human/computer communications using speech ellis k. ‘skip’ cave intervoice-brite inc....
TRANSCRIPT
![Page 1: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/1.jpg)
Human/Computer CommunicationsHuman/Computer CommunicationsUsing SpeechUsing Speech
Ellis K. ‘Skip’ CaveEllis K. ‘Skip’ CaveInterVoice-Brite Inc.InterVoice-Brite Inc.
[email protected]@intervoice.com
![Page 2: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/2.jpg)
Famous Human/Computer Communication - 1968
![Page 3: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/3.jpg)
InterVoice-Brite
• Twenty years building speech applications
• Largest provider of VUI applications and systems in the world
• Turnkey Systems– Hardware, software, application design, managed services
• 1000’s of installations worldwide
• Banking, Travel, Stock Brokerage, Help Desk, etc.– Bank of America– American Express – E-Trade– Microsoft help-desk
![Page 4: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/4.jpg)
Growth of Speech-Enabled Applications
• Analysts estimate that 15% of IVR ports sold in 2000 were speech enabled
• By 2004, 48.5% of IVR ports sold will be speech-enabled
– Source: Frost & Sullivan - U.S. IVR Systems Market, 2001
• IVB estimates that in 2002, 50% of IVB ports sold will be speech enabled.
![Page 5: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/5.jpg)
Overview
• Brief History of Speech Recognition
• How ASR works
• Directed Dialog & Applications
• Standards & Trends
• Natural Language & Applications
![Page 6: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/6.jpg)
History
• Natural Language Processing– Computational Linguistics
– Computer Science
– Text understanding
• Auto translation
• Question/Answer
• Web search
• Speech Recognition– Electrical Engineering
– Speech-to-text
• Dictation
• Control
![Page 7: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/7.jpg)
Turing Test
• Alan M. Turing
• Paper -”Computing Machinery and Intelligence” (Mind, 1950 - Vol. 59, No. 236, pp. 433-460)
• First two sentences of the article:– I propose to consider the question, "Can machines
think?” This should begin with definitions of the meaning of the terms "machine" and "think."
• To answer this question, Turing proposed the “Imitation Game” later named the “Turing Test”– Requires an Interrogator & 2 subjects
![Page 8: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/8.jpg)
Turing Test
Observer
Subject #1
Subject #2
Subject #2Which subject is a machine?
![Page 9: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/9.jpg)
Turing Test
• Turing assumed communications would be written (typed)
• Assumed communications would be unrestricted as to subject
• Predicted that test would be “passed” in 50 years (2000)
• The ability to communicate is equated to “Thinking” and “intelligence”
![Page 10: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/10.jpg)
Turing Test - 50 Years Later
• Today - NL systems still unable to fool interrogator on unrestricted subjects
• Speech Input & Output possible
• Transactional dialogs in restricted subject areas possible -
• Question/Answer queries feasible on large text databases
• May not fool the interrogator, but can provide useful functions– Travel Reservations, Stock Brokerages, Banking,
etc.
![Page 11: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/11.jpg)
Speech Speech RecognitionRecognition
![Page 12: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/12.jpg)
Voice Input - The New Paradigm
• Automatic Speech Recognition (ASR)
• Tremendous technical advances in the last few years
• From small to large vocabularies– 5,000 - 10,000 word vocabulary
• Stock brokerage - E-Trade - Ameritrade
• Travel - Travelocity, Delta Airlines
• From isolated word to connected words– Modern ASR recognizes connected words
• From speaker dependent to speaker independent– Modern ASR is fully speaker independent
• Natural Language
![Page 13: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/13.jpg)
13 Parameters 13 Parameters 13 Parameters
Signal Processing Front-EndFeature
Extraction
![Page 14: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/14.jpg)
Overlapping Sample Windows
25 ms Sample - 15ms overlap - 100 samples/sec.
![Page 15: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/15.jpg)
Cepstrum
• Cepstrum is the inverse Fourier transform of the log spectrum
1,,1,0,)(log2
1)( LndeeSnc njj
![Page 16: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/16.jpg)
Mel Cepstral Coefficients• Construct mel-frequency domain using a triangularly-
shaped weighting function applied to mel-transformed log-magnitude spectral samples:
Mel-Filtered Cepstral CoefficientsMost common feature set for recognizersMotivated by human auditory response characteristics
![Page 17: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/17.jpg)
Mel Cepstrum
• After computing the DFT, and the log magnitude spectrum (to obtain the real cepstrum), we compute the filterbank outputs, and then use a discrete cosine transform to compute the mel-frequency cepstrum coefficients:
• Mel Cepstrum
– 39 Feature vectors representing on 25ms voice sample
![Page 18: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/18.jpg)
Cepstrum as Vector Space Features
![Page 19: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/19.jpg)
Feature Ambiguity
• After the signal processing front-end
• How to resolve overlap or ambiguity in Mel-Cepstrum features
• Need to use context information– What preceeds? What follows?
• N-phones and N-grams
• All probabalistic computations
![Page 20: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/20.jpg)
The Speech Recognition Problem
Find the most likely word sequence Ŵ among all possible sequences given acoustic evidence A
A tractable reformulation of the problem is:
Language model
Acoustic model
Daunting search task
![Page 21: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/21.jpg)
ASR Resolution
• Need – Mel Cepstrum features into probabilities
– Acoustic Model (tri-phone probabilities)
• Phonetic probabilities
– Language Model (bi-gram probabilities)
• Word probabilities
• Apply Dynamic Programming techniques– Find most-likely sequence of phonemes & words
– Viterbi Search
![Page 22: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/22.jpg)
Acoustic Models
• Acoustic states represented by Hidden Markov Models (HMMs)
– Probabilistic State Machines - state sequence unknown, only feature vector outputs observed
– Each state has output symbol distribution
– Each state has transition probability distribution
s0 s1 s2
q(i|s0) q(i|s1) q(i|s2)
t(s0 |s0)
t(s1 |s0)
t(s1 |s1)
t(s2 |s1)
t(s2 |s2)
p(s0)
![Page 23: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/23.jpg)
Subword Models
• Objective: Create a set of HMM’s representing the basic sounds (phones) of a language?
– English has about 40 distinct phonemes
– Need “lexicon” for pronunciations
– Letter to sound rules for unusual words
– Problem - co-articulation effects must be modeled
• “barter” vs “bartender”
• Solution - “tri-phones” - each phone modified by onset and trailing context phones
![Page 24: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/24.jpg)
Language Models
• What is a language model?
– Quantitative ordering of the likelihood of word sequences
• Why use language models?
– Not all word sequences equally likely
– Search space optimization
– Improved accuracy
• Bridges the gap between acoustic ambiguities and ontology
![Page 25: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/25.jpg)
Finite State Grammars
Allowable word sequences are explicitly specified using a structured syntax
• Creates a word network
• Words sequences not enabled do not exist!
• Application developer must construct grammar
• Excellent for directed dialog and closed prompting
![Page 26: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/26.jpg)
• Narrow range of responses allowed
– Only word sequences coded in grammar are recognized
• Straightforward ASR engine. Follows grammar rules exactly
– Easy to add words to grammar
• Allows name lists
• “I want to fly to $CITY”
• “I want to buy $STOCK”
Finite-State Language Model
![Page 27: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/27.jpg)
Statistical Language Models
Stochastic Context-Free Grammars
• Only specifies word transition probabilities
• N-gram language model
• Required for open ended prompts: “How may I direct your inquiry?”
• Much more difficult to analyze possible results
– Not for every interaction
• Data, Data, Data: 10,000+ transcribed responses for each input task
![Page 28: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/28.jpg)
Statistical State Machines
![Page 29: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/29.jpg)
Mixed Language Models
• SLM statistics are unstable (useless) unless examples of each word in each context are presented
• Consider a flight reservation tri-gram language model:I’d like to fly from Boston to Chicago on Monday
Training sentences required for 100 cities: (100*100 + 100*7) = 10,700
• A better way is to consider classes of words:I’d like to fly from $(CITY) to $(CITY) on $(DATE)
Only one transcription is needed to represent 70,000 variations
![Page 30: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/30.jpg)
Viterbi
• How do you determine the most probable utterance?
• The Viterbi Search returns the n-best paths through the Acoustic model and the Language Model
![Page 31: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/31.jpg)
Dynamic Programming (Viterbi)
![Page 32: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/32.jpg)
N-Best Speech Results
• ASR converts speech to text• Use “grammar” to guide recognition • Focus on “speaker independent” ASRs• Must allow for open context
ASRSpeech
Waveform
Grammar
“Get me two movie tickets…”“I want to movie trips…”“My car’s too groovy”
N-Best Result
N=1N=2N=3
![Page 33: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/33.jpg)
What does it all Mean?
Text output is nice, but how do we represent meaning ?
• Finite state grammars - constructs can be tagged with semantics
<item> get me the operator <tag>OPERATOR</tag> </item>
• SLM uses concept spottingItinerary:slm “flightinfo.pfsg” = FlightConcepts
FlightConcepts [
(from City:c) {<origin $c>}
(to City:c) {<dest $c>}
(on Date:d) {<date $d>}
]
• Concepts may also be trained statistically
– but that requires even more data!
![Page 34: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/34.jpg)
Directed Directed DialogsDialogs
![Page 35: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/35.jpg)
Directed Dialog
• Finite-State Grammars - Currently most common method to implement speech-enabled applications
• More flexible & user-friendly than key (Touch-Tone) input
• Allows Spoken List selection– System: “What City are you leaving from?”
– User: “Birmingham”
• Keywords easier to remember than numeric codes– “Account balance” instead of “two”
• Easy to skip ahead through menus– Tellme - “Sports, Basketball, Mavericks”
![Page 36: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/36.jpg)
Issues With Directed Dialogue
• Computer asks all the questions– Usually presented as a menu
– “Do you want your account balance, cleared checks, or deposits?”
• Computer always has the initiative– User just answers questions, never gets to ask any
questions
• All possible answers must be pre-defined by the application developer (grammars)
• Will eventually get the job done, but can be tedious
• Still much better than Touch-tone menus
![Page 37: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/37.jpg)
Issues With Directed Dialogue
• Application developer must design scripts that never have the machine ask open-ended questions– “What can I do for you?”
• Application Developer’s job - design questions where answers can be explicitly predicted.– “Do you want to buy or sell stocks”
• Developer must explicitly define all possible responses– Buy, purchase, get some, acquire
– Sell, dump, get rid of it
![Page 38: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/38.jpg)
Examples of Directed Dialog
Southwest Airlines
Pizza Inn
Brokerage
![Page 39: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/39.jpg)
Standards Standards & Trends& Trends
![Page 40: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/40.jpg)
VoiceXML
• VoiceXML - A web-oriented voice-application programming language– W3C Standard - www.w3.org
– Version 1.0 released March 2000
– Version 2.0 ready to be approved
• http://www.w3.org/TR/voicexml20/
– Voice dialogues scripted using XML structures
• Other VoiceXML support – www.voicexml.org
– voicexmlreview.org
![Page 41: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/41.jpg)
VoiceXML
• Assume telephone as user device
• Voice or key input
• Pre-recorded or Text-to-Speech output
![Page 42: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/42.jpg)
Why VoiceXML?
• Provides environment similar to web for web developers to build speech applications
• Applications are distributed on document servers similar to web
• Leverages the investment companies have made in the development of a web presence.
• Data from Web databases can be used in the call automation system.
• Designed for distributed and/or hosted (ASP) environment.
![Page 43: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/43.jpg)
VoiceXML Architecture
VoiceXMLBrowser/Gateway
Mobile Device
VUI
Telephone
Network
InternetVoiceXML
Browser
WebServer
WebServer
Serve VoiceXML
Document
Voice
![Page 44: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/44.jpg)
VoiceXML Example
<?xml version="1.0"?><vxml version="1.0">
<!--Example 1 for VoiceXML Review --> <form> <block> Hello, World! </block> </form></vxml>
![Page 45: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/45.jpg)
VoiceXML Applications
• Voice Portals– TellMe,
• 1-800-555-8355 (TELL)
• http://www.tellme.com
– BeVocal
• 1-408-850-2255 (BVOCAL)
• www.bevocal.com
![Page 46: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/46.jpg)
The VoiceXML Plan
– Third party developers write VoiceXML scripts that they will publish on the web
– Callers to the Voice Portals will access these voice applications like browsing the web
– VoiceXML will use VUI with directed dialog
• Voice output
• Voice or key input
– hands/eyes free or privacy
![Page 47: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/47.jpg)
Speech Application Language Tags (SALT)
• Microsoft, Cisco Systems, Comverse Inc., Intel, Philips Speech Processing, and SpeechWorks
• www.saltforum.org
• Extension of existing Web standards such as HTML, xHTML and XML
• Support multi-modal and telephone access to information, applications, and Web services, independently or concurrently.
![Page 48: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/48.jpg)
SALT - “Multi-modal”
• Input might come from speech recognition, a keyboard or keypad, and/or a stylus or mouse
• Output to screen or speaker (speech)
• Embedded in HTML documents
• Will require SALT-enabled browsers
• Working Draft V1.9
• Public Release - March 2002
• Submit to IETF - midyear 2002
![Page 49: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/49.jpg)
SALT Code
• <!—- Speech Application Language Tags -->
• <salt:prompt id="askOriginCity"> Where would you like to leave from? </salt:prompt>
• <salt:prompt id="askDestCity"> Where would you like to go to? </salt:prompt>
• <salt:prompt id="sayDidntUnderstand" onComplete="runAsk()">
• Sorry, I didn't understand. </salt:prompt>
• <salt:listen id="recoOriginCity"
• onReco="procOriginCity()” onNoReco="sayDidntUnderstand.Start()">
• <salt:grammar src="city.xml" />
• </salt:listen>
• <salt:listen id="recoDestCity"
• onReco="procDestCity()" onNoReco="sayDidntUnderstand.Start()">
• <salt:grammar src="city.xml" /> </salt:listen>
![Page 50: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/50.jpg)
Evolution of the Speech Interface
• Touch-Tone Input
• Directed Dialogue
• Natural Language– Word spotting
– Phrase spotting
– Deep parsing
![Page 51: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/51.jpg)
Natural Natural Language Language
UnderstandingUnderstanding
![Page 52: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/52.jpg)
What is Natural Language • User can take the initiative
– Computer says “How can I help you?”
• User can state request in a single interaction– What is the price of IBM?
– “I want to sell all my IBM stock today at the market”
• User can change initiatives midstream– I want to buy some stock - How much do I have in my
account?
• Natural Language is closely related to Artificial Intelligence
• Feasible today, if scope of discussion is limited
![Page 53: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/53.jpg)
Natural Language
• NLP, NLU, NLG – All of these mean very specific things to the
computational linguistics community
– NLP - Machine processing of human text or speech
– NLU - Machine understanding of unconstrained human-originated text (or speech)
– NLG - Machine generation of human-understandable text (or speech)
• To build a true NL dialog engine you must deal with NLU and NLG
![Page 54: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/54.jpg)
Natural Language Engine
Natural Language Understanding
Natural LanguageGeneration
Database Manipulation
Natural Language Engine
Text in Text out
![Page 55: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/55.jpg)
Spoken Natural Language Engine
Natural Language Understanding
Text in Text out
Speech in
Automatic SpeechRecognition
Text-to-Speech
Speech out
![Page 56: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/56.jpg)
Text-to-Speech
• Modern Text-to Speech– More natural-sounding
– Better prosody
– Improved proper noun handling
• People Names
• Street Names
• Cities
– Abbreviation handling
• Dr. = Doctor or Drive
![Page 57: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/57.jpg)
Overview of Dialog Systems
• Openness of dialog• Can ASR hear everything?
• Can NLP understand everything heard?
• Can DM cope with multiple strands / directions?
• Does Prompt Generator sound natural?
B/O
NL ParserVoice in
ASR
Voice out PromptGenerator
TTS
DialogManager
![Page 58: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/58.jpg)
Dialog Sophistication
• NL Parser• Ontology + syntax
• Ontology
• Morphology Concept Spotting
• Word spotting
more
less
• Dialog Management• Complex Mixed Initiatives
• Dialog context
• Query Answering (Q & A)
• Directed dialog
less
more
![Page 59: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/59.jpg)
Natural Language Technologies
• Improve Existing Applications– Scheduling - Airlines, Hotels
– Financial - Banks, Brokerages
• Enabling New Applications– Catalogue Order
– Complex Travel Planning
– Information Mining - Voice Web browsing
– Direct Telemarketing AI
• Many applications requires Text-to-Speech
![Page 60: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/60.jpg)
Shallow-Parse NLU
• Nuance, Speechworks, Philips,…
• W3C Semantic Attachments for Speech Recognition Grammars (Working Draft 11 May 2001)
• Driven by an Interpretation Grammar.
– “Full” - Finite State Grammar (whole sentence must match)
– “Robust” Statistical Grammar (word or phrase-spotting - partial parses of phrase fragments)
– Associating semantic tags (ABNF { }, XML <tag/>) to the Recognition Grammar rules
– Semantic Attachments for Speech Recognition Grammars
• Constrained by the Recognition Grammar
– Need to write partial Recognition Grammar even when using SLM
• No interpretation in-context
![Page 61: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/61.jpg)
Statistical Language Model
Phrase-
Spotting
Rules
Buy stock
Sell StockGet stock quote
Get operator
60%
20%
90%
N-Best Phrases
SLM Interpretive or Recognition Grammar
Probabilistic Word Transitions
ASRFront End
1- “I want to buy stock…”2- “I want to go by the lock…”3- “High doesn’t mock….”
SpeechWaveform
![Page 62: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/62.jpg)
Shallow-Parse NLU
• Latest thing from ASR vendors– Phillips
• Speech Pearl - SLM
• Speech Mania - FSG
– Nuance
• Say Anything - SLM
• Requires two new constructs– Statistical Language Model
– Interpretation Grammar
– Allows more open responses
![Page 63: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/63.jpg)
Shallow-Parse NLU
• Developer must define and code “interpretation grammar” that lists keywords and phrases that can occur– Need programming skills to define interpretive grammars.
More complex coding than fixed grammars
– wild cards, concept spotting, phrase matching
• SLM – Need large number of example utterances to get
reliable word-sequence statistics
– Usually requires recording and transcription of live conversations in application topic
– More difficult to add new words to grammar
• Must also provide usage and sequence statistics with each word added
– SLM outputs n-best utterance list
![Page 64: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/64.jpg)
Shallow-Parse NLU
• Interpretive Grammars– Allows “wild card” descriptions: “(to $CITY)”
• “I want to fly to $CITY” - “I want to go to $CITY”
• “I need to fly to $CITY”
– Allows out-of sequence phrases
• I want to go from Chicago to Dallas today
• Today I want to go to Dallas from Chicago
![Page 65: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/65.jpg)
Shallow-Parse NLU
• Allows more open dialog than Finite-State grammars
• Requires more development effort than Finite-State Grammars
• Will allow a smaller technology step than full NLU for less-demanding applications
![Page 66: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/66.jpg)
Deep Parsing NLU
• Full linguistic analysis of Speech– Syntactic and Semantic analysis
• Core Language Model contains data structures required for language understanding– Lexicon
– Ontology
– Parsing Rules
• Eliminate – Scripting
– Manually-defined semantic tags
• developer doesn’t have to define concepts
![Page 67: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/67.jpg)
Deep Parsing NLU
• Lexicon - A List of Words and their Syntactical and Semantic Attributes
• Root or stem word form
– fox, run, Boston
• Optional forms plural, tenses
– fox ,foxes
– run, ran, running
• part of speech
– fox - noun
– run - verb
– Boston - proper noun
• Link to Ontology
– fox - animal, brown, furry
– run - action, move fast
– Boston - city
![Page 68: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/68.jpg)
Deep Parsing NLU
• Ontology - Classes & Relationships
PERSON LOCATION DATE TIME PRODUCT NUMERICAL MONEY ORGANIZATION MANNER VALUE
DEGREE DIMENSION RATE DURATION PERCENTAGE COUNT
time of day
midnightprime time
clock time
hockeyteam
team,squad
institution,establishment
financialinstitution
educationalinstitution
numerosity,multiplicity
integer,whole number
population denominatorthickness
width,breadth
distance,length
altitude wingspan
![Page 69: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/69.jpg)
Deep Parsing NLU
Who was the first Russian astronaut to walk in space
WP VBD DT JJ NNP NN TO VB IN NN
NPNP
PPVP
VP
SVP
S
walk
space
space
walk
walk
astronaut
astronaut
astronaut
PERSON
astronaut
PERSON
first
walk
Russian
space
Parsing Rules
![Page 70: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/70.jpg)
Natural Language Understanding
• NL Parser
– N-Best input in NLSML format
– Context free parsing based on Ontology, Lexicon and Rules
– Filtering out redundant interpretations
– Extended W3C NLSML output
• Dialog Server
– Extended W3C NLSML input
– In context interpretation
– Dialog Directives output
NL ParserVoice in ASRNL Dialog
ServerNLSML NLSML
KB
NL Interpretation is split between two components
![Page 71: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/71.jpg)
Natural Language
• Stock Transaction
• 30 seconds
• Airline Reservation
• TTS
![Page 72: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/72.jpg)
Human Language Technology Institute at UTD
• Opening March 2002 (open house March 7-8)
• Foster research in Human Language Technology
• Establish ties with local industry
• 6 new faculty positions
• Currently funded at about 1 million per year by state and government agencies
![Page 73: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/73.jpg)
Speech Technology Professionals Forum
• Monthly meetings in Telecom Corridor– Third Tuesday - March 19
• Rick Tett– [email protected]
– www.candora.com/stpf
![Page 74: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/74.jpg)
Conclusions
• Consumer applications in Stock Brokerages, Travel Agencies, etc. are raising the standards for directed-dialogue production quality and usability
• VoiceXML and SALT will open VUI Directed Dialog application development
• Artificial Intelligence and Natural Language technology making rapid advances - enabling highly conversational applications
![Page 75: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/75.jpg)
The Speech User Interface
• Speech will emerge as the preferred mobile interface
• Selectable voice or key input– Hands/eyes free or privacy
• Selectable voice or screen output– When small screens make sense
• Hands-free Wireless Web Access – WAP phones not required
– 3G phones not required
![Page 76: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/76.jpg)
The Speech User Interface
• Natural Language will make VUI highly conversational– no menus
– no memorizing keywords or keystrokes
• Many so-called graphical applications can be made more efficient with a pure speech interface– Driving Maps, Bank Statements
![Page 77: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/77.jpg)
The Speech User Interface
The computing terminal of tomorrow -
is already here!
![Page 78: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/78.jpg)
Speech -
The optimal computing interface
![Page 79: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/79.jpg)
Thank You!
![Page 80: Human/Computer Communications Using Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com](https://reader037.vdocuments.site/reader037/viewer/2022110400/56649dba5503460f94aaab03/html5/thumbnails/80.jpg)
References
• Speech & Language Processing– Jurafsky & Martin -Prentice Hall - 2000
• Statistical Methods for Speech Recognition – Jelinek - MIT Press - 1999
• Foundations of Statistical Natural Language Processing– Manning & Schutze - MIT Press - 1999
• Dr. J. Picone - Speech Website– www.isip.msstate.edu