sai user-system interaction u1, speech in the interface: 3. speech input and output technology1...
TRANSCRIPT
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
1
Module u1:Speech in the Interface
3: Speech input and output technology
Jacques Terken
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
2
contents
Speech input technology– Speech recognition– Language understanding– Consequences for design
Speech output technology– Language generation– Speech synthesis– Consequences for design
Project
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
3
Components of conversational interfaces
Speechrecognition
Natural Language Analysis
DialogueManager
SpeechSynthesis
LanguageGeneration
Application
Noise suppression
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
4
Speech recognition
Advances both through progress in speech and language engineering and in computer technology (increases in CPU power)
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
5
Developments
1980
1990
2000
Vocabulary size (number of words)
Spea
king
sty
le
Spontaneous speech
Fluent speech
Read speech
Connected speech
Isolated words
2 20 200 2000 20000 Unrestricted
word spotting
digit strings
voice commands
directory assistance
form fill by voice
name dialing
2-way dialogue
natural conversation
transcription
office dictation
system driven dialogue
network agent &
intelligent messaging
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
6
State of the art
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
7
Why is generic speech recognition so difficult ? Variability of input due to many different sources Understanding requires vast amounts of world
knowledge and common sense reasoning for generation and pruning of hypotheses
Dealing with variability and with storage of/ access to world knowledge outreaches possibilities of current technology
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
8
Sources of variation
Task/Context・ Man-machine dialogue・ Dictation・ Free conversation・ InterviewPhonetic/Prosodic context
Speaker・ Voice quality・ Pitch・ Gender・ DialectSpeaking style・ Stress/Emotion・ Speaking rate・ Lombard effect
Noise・ Other speakers・ Background noise・ Reverberations
Microphone・ Distortion・ Electrical noise・ Directional characteristics
DistortionNoiseEchoesDropouts
Speechrecognitionsystem
Channel
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
9
No generic speech recognizer Idea of generic speech recognizer has been given up
(for the time being) automatic speech recognition possible by virtue of
self-imposed limitations – vocabulary size– multiple vs single speaker– real-time vs offline– recognition vs understanding
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
10
Speech recognition systems Relevant dimensions
– Speaker-dependent vs speaker-independent– Vocabulary size– Grammar: fixed grammar vs probabilistic language
model
Trade-off between different dimensions in terms of performance: choice of technology determined by application requirements
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
11
Command and control Examples: controlling functionality of PC or PDA; controlling
consumer appliances (stereo, tv etc.)
Individual words and multi-word expressions– “File”, “Edit”, “Save as webpage”, “Columns to the left”
Speaker-independent: no training needed before use Limited vocabulary gives high recognition performance Fixed format expressions (defined by grammar) Real-time
User needs to know which items are in the vocabulary and what expressions can be used
(Usually) not customizable
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
12
Examples: train travel information, integrated trip planning
Continuous speech Speaker-independent: Multiple users Mid size vocabulary, typically less than 5000 words Flexibility of input: extensive grammar that can
handle expected user inputs Requires interpretation Real time
Information services
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
13
Continuous speech Speaker-dependent: requires training by user (Almost) unrestricted input:
Large vocabulary > 200.000 words Probabilistic language model instead of fixed
grammar No understanding, just recognition Off-line (but near online performance possible
depending on system properties)
Dictation systems
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
14
State of the art ASR: Statistical approach Two phases
– Training: creating an inventory of acoustic models and computing transition probabilities
– Testing (classification): mapping input onto inventory
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
15
Writing vs speech
Writing: {see} {eat break} {lake}
Speaking: {si: i:t} {brek lek}
Alphabetic languages: appr. 25 signs
Average language: approximately 40 sounds
Phonetic alphabet (1:1 mapping character-sound)
Speech
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
16
Speech and sounds
Waveform and spectrogram of
“How are you”
Speech is made up of nondiscrete events
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
17
Sounds coded as successions of states (one state each 10-30 ms)
States represented by acoustic vectors
Representation of the speech signal
time time
Freq Freq
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
18
Inventory of elementary probabilistic models of basic linguistic units, e.g. phonemes
Words stored as networks of elementary models
Acoustic models
pdf pdf
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
19
Training of acoustic models Compute acoustic vectors and transition probabilities
from large corpora each state holds statistics concerning parameter
values and parameter variation The larger the amount of training data, the better the
estimates of parameter values and variation
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
20
Language model 1. Defined by grammar
– Grammar: • Rules for combining words into sentences (defining the
admissible strings in that language)• Basic unit of analysis is utterance/sentence
– Sentence composed of words representing word classes, e.g.
determiner: the
noun: boy verb: eat
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
21
noun: boy verb: eat determiner: the
rule 1: noun_phrase det n
rule 2: sentence noun_phrase verb
Morphology: base forms vs derived forms
eat stem, 1st person singular
stem + s: 3rd person singular
stem + en: past participle
stem + er: substantive (noun)
the boy eats
*the eats
*boy eats
*eats the boy
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
22
2. Statistical language model– Probabilities for words and transition probabilities
for word sequences in corpus:unigram: probability of individual words
bigram: probability of word given preceding word
trigram: probability of word given two preceding words
– Training materials:
language corpora (journal articles; application-specific)
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
23
Recognition / classification
x1 xT
over w1... wk
P (x1... xT | w1...wk) ・P(w1...wk )
Language modelP(w1... wk )
Phoneme inventory
Pronunciation lexicon
P(x1...xT | w1...wk)
Acousticanalysis
Global search:Maximize
RecognizedWord sequence
Speech input
...
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
24
Compute probability of sequence of states given the probabilities for the states, the probabilities for transitions between states and the language model
Gives best path Usually not best path but n-best list for further
processing
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
25
Caveats Properties of acoustic models strongly determined by
recording conditions:
recognition performance dependent on match between recording conditions and run-time conditions
Use of language model induces word bias: for words outside vocabulary the best matching word is selected
Solution: use garbage model
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
26
Advances Confidence measures for recognition results
– Based on acoustic similarity– Or based on actual confusions for a database– Or taking into consideration the acoustic properties of the
input signal Dynamic (state-dependent) loading of language model Parallel recognizers
– e.g. In Vehicle Information Systems (IVIS): separate recognizers for navigation system, entertainment systems, mobile phone, general purpose
– choice on the basis of confidence scores Further developments
– Parallel recognizer for hyper-articulate speech
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
27
State of the art performance
98 - 99.8 % correct for small vocabulary speaker-independent recognition
92 - 98 % correct for speaker-dependent large vocabulary recognition
50 - 70 % correct for speaker-independent mid size vocabulary
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
28
Recognition of prosody
Observable manifestations: pitch, temporal properties, silence
Function: emphasis, phrasing (e.g. through pauses), sentence type (question/statement), emotion &c.
Relevant to understanding/interpretation, e.g.:
Mary knows many languages you know
Mary knows many languages, you know Influence on realisation of phonemes: Used to be
considered as noise, but contains relevant information
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
29
contents
Speech input technology– Speech recognition– Language understanding– Consequences for design
Speech output technology Consequences for design project
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
30
Natural language processing Full parse or keyword spotting (concept spotting) Keyword spotting:
<any> keyword <any>e.g. <any> $DEPARTURE <any> $DESTINATION <any>
can handle:
Boston New York
I want to go from Boston to New York
I want a flight leaving at Boston and arriving at New York
Semantics (mapping onto functionality) can be specified in the grammar
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
31
contents
Speech input technology– Speech recognition– Language understanding– Consequences for design
Speech output technology Consequences for design project
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
32
coping with technological shortcomings of ASR Shortcomings
– Reliability/robustness– Architectural complexity of “always open” system– Lack of transparency in case of input limitations
Task for design of speech interfaces:
induce user to modify behaviour to fit requirements (restrictions) of technology
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
33
Solutions “Always open” ideal
– push-to-talk button: recognition window
“spoke-too-soon” problem– Barge in (requires echo cancellation which may be
complicated depending on reverberation properties of environment)
Make training conditions (properties of training corpus) similar to test conditions
E.g. special corpora for car environment
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
34
Prompt
response
“Will you accept the call”
“Say yes if you accept the call, otherwise say
no” Isolated yes or no
54.5 % 80.8 %
Multiword yes or no
24.2 % 5.7 %
Other affirm-ative or negative
10.7 % 3.4 %
inappropriate 10.4 % 10.2 %
Good prompt design to give clues about required input
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
35
contents
Speech input technology consequences for design Speech output technology
– Technology– Human factors in speech understanding– Consequences for design
project
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
36
Components of conversational interfaces
Speechrecognition
Natural Language Analysis
DialogueManager
SpeechSynthesis
LanguageGeneration
Application
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
37
demos
http://www.ims.uni-stuttgart.de/~moehler/synthspeech/examples.html http://www.research.att.com/~ttsweb/tts/demo.php http://www.acapela-group.com/text-to-speech-interactive-demo.html http://cslu.cse.ogi.edu/tts/
Audiovisual speech synthesis:
http://www.speech.kth.se/multimodal/
http://mambo.ucsc.edu/demos.html
Emotional synthesis (Janet Cahn):
http://xenia.media.mit.edu/%7Ecahn/emot-speech.html
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
38
Applications Information Access by phone
– news / weather, timetables (OVR), reverse directory, name dialling,
– spoken e-mail etc. Customer Ordering by phone (call centers)
– IVR: ASR replaces tedious touch-tone actions Car Driver Information by voice
– navigation, car traffic info (RDS/TMC), Command & Control (VODIS)
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
39
Interfaces for the Disabled– MIT/DECTalk (Stephen Hawking)
In the office and at home: (near future?) – Command & Control, navigation for home
entertainment
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
40
Output technology
DialogueManager
SpeechSynthesis
LangGeneration
Application (e.g. E-mail)
Application (Information service)
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
41
Language generation
Eindhoven Amsterdam CS
Vertrektijd 08:32 08:47 09:02 09:17 09:32
Aankomsttijd 09:52 10:10 10:22 10:40 10:52
Overstappen 0 1 0 1 0
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
42
If $nr_of_records >1
I have found $n connections:
The first connection leaves at $time_dep from $departure and arrives at $time_arr at $destination
The second connection leaves at $time_dep from $departure and arrives at $time_arr at $destination
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
43
If the user also wants information about whether there are transfers, either other templates have to be used, or templates might be composed from template elements
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
44
speech output technologies canned (pre-recorded) speech
– Suited for call centers, IVR – fixed messages/announcements
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
45
Suited for bank account information, database enquiry systems with structured data and the like
Template-based, e.g. “your account is <$account”>”“the flight from <$departure> to <$destination> leaves at <$date> at <$time”> from <$gate>”“the number of $customer is $telephone_number”
Requirements: database of phrases to be concatenated Some knowledge of speech science required:
– words are pronounced differently depending on • emphasis• position in utterance• type of utterance
– differences concern both pitch and temporal properties (prosody)
Concatenation of pre-recorded phrases
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
46
Compare different realisations of Amsterdam in– do you want to go to Amsterdam ? (emphasis, question,
utterance-final)– I want to go to Amsterdam (emphasis, statement,
utterance -final)– Are there two stations in Amsterdam ? (no emphasis,
question, utterance-final)– There are two stations in Amsterdam (no emphasis,
statement, utterance-final)– Do you want to go to Amsterdam Central Station? (no
emphasis, statement, utternace-medial) Solution:
– have words pronounced in context to obtain different tokens– apply clever splicing techniques for smooth concatenation
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
47
Suited for unrestricted text input: all kinds of text– reading e-mail, fax (in combination with optical
character recognition)– information retrieval for unstructured data
(preferably in combination with automatic summarisation)
Utterances made up by concatenation of small units and post-processing for prosody, or by concatenation of variable units
text-to-speech conversion (TTS)
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
48
TtS technology
Distinction between – linguistic pre-processing and– synthesis
Linguistic pre-processing– Grapheme-phoneme conversion: mapping written
text onto phonemic representation including word stress
– Prosodic structure (emphasis, boundaries including pauses)
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
49
TtS: Linguistic pre-processing:
grapheme-phoneme conversion To determine how a word is pronounced:
– consult a lexicon, containing• a phoneme transcription• syllable boundaries • word accent(s)
– and/or develop pronunciation rules– Output:
Enschede . ‘ En-sx@-de .
Kerkrade . ‘ kErk-ra-d@ .
‘s-Hertogenbosch . sEr-to-x@n-‘ bOs .
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
50
Pros and con’s of lexicon– phoneme transcriptions are accurate– (high) risk of out-of-vocabulary words because the
lexicon : • often contains only stems, no inflections, nor
compounds• is never up to date / complete
– but usually the application includes a user lexicon
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
51
Pros and con’s of pronunciation rules– no out-of-vocabulary words– transcription results are often wrong for
• (longer) combinations of words / morphemes• exceptions and loan-words from other
languages Best solution is a combination of the two methods:
– develop a list of words incorrectly transcribed by the rules and put these words in an exception lexicon
– words not occurring in the exception list are then transcribed by rule
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
52
Complications– Words with same written form but different
pronunciations and different meaning: ‘record vs re’cord:
requires parsing or statistical approach– Proper names and other specialized vocabularies,
acronyms/abbreviations (small announcements in journals!)
Need to be included in (user) lexicon– Different kinds of numbers (telephone numbers,
amounts, credit card numbers etc.)
Require number grammars
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
53
TtS: Linguistic pre-processing:prosody
Emphasis, boundaries (including pauses), sentence type
Observable manifestations: pitch, temporal properties, silence
Requires analysis of linguistic structure (parsing) and (ideally) discourse level information (cf. the earlier “Amsterdam” example)
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
54
TtS: synthesis
Concatenation from words and phrases practically impossible: – database too large (especially if you need several
versions for each word) and– no full coverage (out-of-vocabulary words)
Approaches:– sub-word units– data-oriented approach
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
55
Synthesis by subword units
Common approach: diphone synthesis
linking together pre-recorded diphones, i.e. short segments (transitions between two successive phonemes), extracted from natural speech:
‘s Hertogenbosch phonemes: . s E r t o x @ n b O s .
diphones: .s sE Er rt to ox x@ @n nb bO Os s.
In all, 1600 transitions per language (40 * 40)
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
56
Synthesis: – concatenate diphones in the correct order – perform some (intensity) smoothing at the
diphone borders – adjust phoneme duration and pitch course,
according to prosody rules
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
57
Data-oriented approach
Generalization of diphone approach Store a large database of speech (running text) Run-time:
– generate structure representing phoneme sequence and prosodic properties needed
– Search algorithm:
find the largest possible fragments containing the required properties in the database
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
58
frei-burg
nürn-berg
frei-berg
/fr/ also for fr-iedrichshaven:
items in database re-usable
Concatenate the fragments as they are, without post-processing for pitch and duration
in this way, not only phoneme parameters and transitions are taken from data, but also pitch and temporal properties
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
59
Advantage: natural speech quality preserved (but may not always be desirable: maybe it should be made clear to people that they are talking to a system)
Disadvantage: no explicit control of voice characteristics and prosodic characteristics such as pitch and speaking rate (which you might want to manipulate for synthesis of emotional speech or conveying a certain personality)
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
60
Difficult or impossible to modify speaker characteristics
Other speaker: new database required
Other speaking style: new database required
Research: post-processing of result with preservation of speech quality
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
61
hybrid synthesis– Combination of phrase concatenation and TTS– suited for template-based synthesis with fixed
message structure and variable slots
“the flight from <$departure> to <$destination> leaves at <$date> at <$time”> from <$gate>”
in dialogue systems the system has knowledge of message structure and can select the proper tokens from the database on the basis of this knowledge
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
62
Future: markup languages
structured text current tts-systems: strip text annotations (plain ascii
standard) draft proposal for xml for synthesis, SALT
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
63
Contents
Speech input technology consequences for design Speech output technology
– Technology– Human factors in speech understanding– Consequences for design
project
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
64
issues in comprehension Speech quality
reduced quality slows down feature extraction and mapping input onto feature vectors: will increase number of matching vectors.
requires compensation by top-down processing, taking more time and effort and practice
“Text-to-speech”, but written text is often difficult to understand when read aloud due to complex structures, high information density etc.
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
65
application to synthetic speech substandard quality of synthetic speech requires
compensation by (resource-limited) top-down processing
potential overload of system due to time constraints
slowing down speaking rate very effective way to give the listener more processing time
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
66
Case study: picking up information from speech study on auditory exploration of lists (pitt & edwards,
1997):
recall of list of 48 file names – presented in groups– groups size varied (2, 3 or 4)– presentation of groups listener-paced– recall immediately after each group
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
67
Number of filenames in groups
% correct recall
2 76.7
3 49.7
4 20.5
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
68
adjustments analysis of “list” speaking style
grouping principles• always try to group• grouping by filenames and extensions• large groups first• mnemonic links between groups
prosodic structuring
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
69
evaluation:directory with four subdirectories, each containing files with four different names, corresponding to the modules of a programming project, and with three different extensions
task: find most recent version of the four files containing the source code for the modules and copy them into a new directory
measures: objective (task completion) and subjective
results for task completion: new algorithm: 10.39 min, old version: 24.12 min
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
70
Contents
Speech input technology Consequences for design Speech output technology
– Technology– Human factors in speech understanding– Consequences for design
Project
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
71
Design implications choice of technology can be made dependent on
needs of application for restricted domains very high quality can be
achieved through canned speech or phrase concatenation with multiple tokens
for concatenation with unit selection there is a relation between quality and size of database; for good systems usually there is no problem with intelligibility even with inexperienced listeners
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
72
high quality for diphone speech, needed for uncommon forms such as proper names or company names that are unlikely to be available from a corpus, requires still much effort
importance of learning effects general finding is that acceptance of synthetic speech
depends strongly on voice quality if trade-off between quality and added value is
negative, prospects for acceptance of the speech interface are poor
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
73
speech as output modality: speech vs text/graphics
Text/graphics:– an image [ may be ] worth a thousand words– image/written text is persistent– image is (at least) two-dimensional: temporal and
spatial organisation– visual expression of hierarchical structure– receiver paced– but non-adaptive (until recently).
now: adaptive hypertext
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
74
speech– one-dimensional: extends only in time
spatial issues better dealt with in other modality
– sender-paced– poor medium, yet popular
• large amount of speech-based communication serves primarily a social function
– no need for supporting aids such as paper and pen
– no special motoric abilities needed– speaking is fast
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
75
heuristics:speech output preferred when message is simple message is short message need not be referred to later message deals with events in time message requires an immediate response visual channels are overloaded environment is brightly lit, poorly lit, subject to severe vibration or
otherwise adverse to transmission of visual information user must be free to move around
(from michaelis & wiggins) pronunciation is subject of interaction
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
76
but speech output preferably not used when message is complex or uses unfamiliar terms message is long message needs to be referred to later message deals with spatial topics message is not urgent auditory channels are overloaded environment is too noisy user has easy access to screen system output consists of many different kinds of information
which must be available simultaneously and be monitored and acted upon by the user
(from michaelis & wiggins) environmental variation and “mixed” interaction call for multimodal interfaces
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
77
contents
Speech input technology consequences for design Speech output technology Main points Project
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
78
Main points Database approach, requiring large databases for individual
languages and speaking styles, dominant both for speech input and output– Input: databases for training acoustic models and
language model– Output: concatenation of segments and phrases taken
from database Large differences concerning performance of speech
recognition and quality of output for different languages and target groups (e.g. recognition for children)
Speech input: three major classes of applications: command&control, information services, dictation systems
Major parameters: speaker-dependent/independent, vocabulary size (small, medium, large), rigid vs. free-format input
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
79
Dialogue management
– Finite-state or frame-based approach for task-oriented dialogue acts
– Verification strategies and repair mechanisms for dialogue control
Pragmatic approaches to language understanding and language generation:
– Input: directly mapped onto application functionality
– Output: template-based approaches Not covered: speech monitoring, speech data mining
applications and technology
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology
80
Exercises with CSLU toolkit and other demonstrators– Try out your name, telephone numbers, dates, e-
mail addresses, abbreviations etc. Project
– Protocol development: • Dialogue structure• Strategies and prompts
– Tomorrow• Wizard of Oz test