sai user-system interaction u1, speech in the interface: 3. speech input and output technology1...

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

1

Module u1:Speech in the Interface

3: Speech input and output technology

Jacques Terken


2

contents

Speech input technology– Speech recognition– Language understanding– Consequences for design

Speech output technology– Language generation– Speech synthesis– Consequences for design

Project


3

Components of conversational interfaces

Speechrecognition

Natural Language Analysis

DialogueManager

SpeechSynthesis

LanguageGeneration

Application

Noise suppression


4

Speech recognition

Advances both through progress in speech and language engineering and in computer technology (increases in CPU power)


5

Developments

1980

1990

2000

Vocabulary size (number of words)

Spea

king

sty

le

Spontaneous speech

Fluent speech

Read speech

Connected speech

Isolated words

2 20 200 2000 20000 Unrestricted

word spotting

digit strings

voice commands

directory assistance

form fill by voice

name dialing

2-way dialogue

natural conversation

transcription

office dictation

system driven dialogue

network agent &

intelligent messaging


6

State of the art


7

Why is generic speech recognition so difficult ? Variability of input due to many different sources Understanding requires vast amounts of world

knowledge and common sense reasoning for generation and pruning of hypotheses

Dealing with variability and with storage of/ access to world knowledge outreaches possibilities of current technology


8

Sources of variation

Task/Context・ Man-machine dialogue・ Dictation・ Free conversation・ InterviewPhonetic/Prosodic context

Speaker・ Voice quality・ Pitch・ Gender・ DialectSpeaking style・ Stress/Emotion・ Speaking rate・ Lombard effect

Noise・ Other speakers・ Background noise・ Reverberations

Microphone・ Distortion・ Electrical noise・ Directional characteristics

DistortionNoiseEchoesDropouts

Speechrecognitionsystem

Channel


9

No generic speech recognizer Idea of generic speech recognizer has been given up

(for the time being) automatic speech recognition possible by virtue of

self-imposed limitations – vocabulary size– multiple vs single speaker– real-time vs offline– recognition vs understanding


10

Speech recognition systems Relevant dimensions

– Speaker-dependent vs speaker-independent– Vocabulary size– Grammar: fixed grammar vs probabilistic language

model

Trade-off between different dimensions in terms of performance: choice of technology determined by application requirements


11

Command and control Examples: controlling functionality of PC or PDA; controlling

consumer appliances (stereo, tv etc.)

Individual words and multi-word expressions– “File”, “Edit”, “Save as webpage”, “Columns to the left”

Speaker-independent: no training needed before use Limited vocabulary gives high recognition performance Fixed format expressions (defined by grammar) Real-time

User needs to know which items are in the vocabulary and what expressions can be used

(Usually) not customizable


12

Examples: train travel information, integrated trip planning

Continuous speech Speaker-independent: Multiple users Mid size vocabulary, typically less than 5000 words Flexibility of input: extensive grammar that can

handle expected user inputs Requires interpretation Real time

Information services


13

Continuous speech Speaker-dependent: requires training by user (Almost) unrestricted input:

Large vocabulary > 200.000 words Probabilistic language model instead of fixed

grammar No understanding, just recognition Off-line (but near online performance possible

depending on system properties)

Dictation systems


14

State of the art ASR: Statistical approach Two phases

– Training: creating an inventory of acoustic models and computing transition probabilities

– Testing (classification): mapping input onto inventory


15

Writing vs speech

Writing: {see} {eat break} {lake}

Speaking: {si: i:t} {brek lek}

Alphabetic languages: appr. 25 signs

Average language: approximately 40 sounds

Phonetic alphabet (1:1 mapping character-sound)

Speech


16

Speech and sounds

Waveform and spectrogram of

“How are you”

Speech is made up of nondiscrete events


17

Sounds coded as successions of states (one state each 10-30 ms)

States represented by acoustic vectors

Representation of the speech signal

time time

Freq Freq


18

Inventory of elementary probabilistic models of basic linguistic units, e.g. phonemes

Words stored as networks of elementary models

Acoustic models

pdf

pdf pdf

pdf


19

Training of acoustic models Compute acoustic vectors and transition probabilities

from large corpora each state holds statistics concerning parameter

values and parameter variation The larger the amount of training data, the better the

estimates of parameter values and variation


20

Language model 1. Defined by grammar

– Grammar: • Rules for combining words into sentences (defining the

admissible strings in that language)• Basic unit of analysis is utterance/sentence

– Sentence composed of words representing word classes, e.g.

determiner: the

noun: boy verb: eat


21

noun: boy verb: eat determiner: the

rule 1: noun_phrase det n

rule 2: sentence noun_phrase verb

Morphology: base forms vs derived forms

eat stem, 1st person singular

stem + s: 3rd person singular

stem + en: past participle

stem + er: substantive (noun)

the boy eats

*the eats

*boy eats

*eats the boy


22

2. Statistical language model– Probabilities for words and transition probabilities

for word sequences in corpus:unigram: probability of individual words

bigram: probability of word given preceding word

trigram: probability of word given two preceding words

– Training materials:

language corpora (journal articles; application-specific)


23

Recognition / classification

x1 xT

over w1... wk

P (x1... xT | w1...wk) ・P(w1...wk )

Language modelP(w1... wk )

Phoneme inventory

Pronunciation lexicon

P(x1...xT | w1...wk)

Acousticanalysis

Global search:Maximize

RecognizedWord sequence

Speech input

...


24

Compute probability of sequence of states given the probabilities for the states, the probabilities for transitions between states and the language model

Gives best path Usually not best path but n-best list for further

processing


25

Caveats Properties of acoustic models strongly determined by

recording conditions:

recognition performance dependent on match between recording conditions and run-time conditions

Use of language model induces word bias: for words outside vocabulary the best matching word is selected

Solution: use garbage model


26

Advances Confidence measures for recognition results

– Based on acoustic similarity– Or based on actual confusions for a database– Or taking into consideration the acoustic properties of the

input signal Dynamic (state-dependent) loading of language model Parallel recognizers

– e.g. In Vehicle Information Systems (IVIS): separate recognizers for navigation system, entertainment systems, mobile phone, general purpose

– choice on the basis of confidence scores Further developments

– Parallel recognizer for hyper-articulate speech


27

State of the art performance

98 - 99.8 % correct for small vocabulary speaker-independent recognition

92 - 98 % correct for speaker-dependent large vocabulary recognition

50 - 70 % correct for speaker-independent mid size vocabulary


28

Recognition of prosody

Observable manifestations: pitch, temporal properties, silence

Function: emphasis, phrasing (e.g. through pauses), sentence type (question/statement), emotion &c.

Relevant to understanding/interpretation, e.g.:

Mary knows many languages you know

Mary knows many languages, you know Influence on realisation of phonemes: Used to be

considered as noise, but contains relevant information


29

contents


Speech output technology Consequences for design project


30

Natural language processing Full parse or keyword spotting (concept spotting) Keyword spotting:

<any> keyword <any>e.g. <any> $DEPARTURE <any> $DESTINATION <any>

can handle:

Boston New York

I want to go from Boston to New York

I want a flight leaving at Boston and arriving at New York

Semantics (mapping onto functionality) can be specified in the grammar


31

contents


Speech output technology Consequences for design project


32

coping with technological shortcomings of ASR Shortcomings

– Reliability/robustness– Architectural complexity of “always open” system– Lack of transparency in case of input limitations

Task for design of speech interfaces:

induce user to modify behaviour to fit requirements (restrictions) of technology


33

Solutions “Always open” ideal

– push-to-talk button: recognition window

“spoke-too-soon” problem– Barge in (requires echo cancellation which may be

complicated depending on reverberation properties of environment)

Make training conditions (properties of training corpus) similar to test conditions

E.g. special corpora for car environment


34

Prompt

response

“Will you accept the call”

“Say yes if you accept the call, otherwise say

no” Isolated yes or no

54.5 % 80.8 %

Multiword yes or no

24.2 % 5.7 %

Other affirm-ative or negative

10.7 % 3.4 %

inappropriate 10.4 % 10.2 %

Good prompt design to give clues about required input


35

contents

Speech input technology consequences for design Speech output technology

– Technology– Human factors in speech understanding– Consequences for design

project


36

Components of conversational interfaces

Speechrecognition

Natural Language Analysis

DialogueManager

SpeechSynthesis

LanguageGeneration

Application


37

demos

http://www.ims.uni-stuttgart.de/~moehler/synthspeech/examples.html http://www.research.att.com/~ttsweb/tts/demo.php http://www.acapela-group.com/text-to-speech-interactive-demo.html http://cslu.cse.ogi.edu/tts/

Audiovisual speech synthesis:

http://www.speech.kth.se/multimodal/

http://mambo.ucsc.edu/demos.html

Emotional synthesis (Janet Cahn):

http://xenia.media.mit.edu/%7Ecahn/emot-speech.html

http://www.research.att.com/~ttsweb/tts/demo.php

http://www.acapela-group.com/text-to-speech-interactive-demo.html

http://cslu.cse.ogi.edu/tts/

http://www.speech.kth.se/multimodal/

http://mambo.ucsc.edu/demos.html

http://xenia.media.mit.edu/%7Ecahn/emot-speech.html


38

Applications Information Access by phone

– news / weather, timetables (OVR), reverse directory, name dialling,

– spoken e-mail etc. Customer Ordering by phone (call centers)

– IVR: ASR replaces tedious touch-tone actions Car Driver Information by voice

– navigation, car traffic info (RDS/TMC), Command & Control (VODIS)


39

Interfaces for the Disabled– MIT/DECTalk (Stephen Hawking)

In the office and at home: (near future?) – Command & Control, navigation for home

entertainment


40

Output technology

DialogueManager

SpeechSynthesis

LangGeneration

Application (e.g. E-mail)

Application (Information service)


41

Language generation

Eindhoven Amsterdam CS

Vertrektijd 08:32 08:47 09:02 09:17 09:32

Aankomsttijd 09:52 10:10 10:22 10:40 10:52

Overstappen 0 1 0 1 0


42

If $nr_of_records >1

I have found $n connections:

The first connection leaves at $time_dep from $departure and arrives at $time_arr at $destination

The second connection leaves at $time_dep from $departure and arrives at $time_arr at $destination


43

If the user also wants information about whether there are transfers, either other templates have to be used, or templates might be composed from template elements


44

speech output technologies canned (pre-recorded) speech

– Suited for call centers, IVR – fixed messages/announcements


45

Suited for bank account information, database enquiry systems with structured data and the like

Template-based, e.g. “your account is <$account”>”“the flight from <$departure> to <$destination> leaves at <$date> at <$time”> from <$gate>”“the number of $customer is $telephone_number”

Requirements: database of phrases to be concatenated Some knowledge of speech science required:

– words are pronounced differently depending on • emphasis• position in utterance• type of utterance

– differences concern both pitch and temporal properties (prosody)

Concatenation of pre-recorded phrases


46

Compare different realisations of Amsterdam in– do you want to go to Amsterdam ? (emphasis, question,

utterance-final)– I want to go to Amsterdam (emphasis, statement,

utterance -final)– Are there two stations in Amsterdam ? (no emphasis,

question, utterance-final)– There are two stations in Amsterdam (no emphasis,

statement, utterance-final)– Do you want to go to Amsterdam Central Station? (no

emphasis, statement, utternace-medial) Solution:

– have words pronounced in context to obtain different tokens– apply clever splicing techniques for smooth concatenation


47

Suited for unrestricted text input: all kinds of text– reading e-mail, fax (in combination with optical

character recognition)– information retrieval for unstructured data

(preferably in combination with automatic summarisation)

Utterances made up by concatenation of small units and post-processing for prosody, or by concatenation of variable units

text-to-speech conversion (TTS)


48

TtS technology

Distinction between – linguistic pre-processing and– synthesis

Linguistic pre-processing– Grapheme-phoneme conversion: mapping written

text onto phonemic representation including word stress

– Prosodic structure (emphasis, boundaries including pauses)


49

TtS: Linguistic pre-processing:

grapheme-phoneme conversion To determine how a word is pronounced:

– consult a lexicon, containing• a phoneme transcription• syllable boundaries • word accent(s)

– and/or develop pronunciation rules– Output:

Enschede . ‘ En-sx@-de .

Kerkrade . ‘ kErk-ra-d@ .

‘s-Hertogenbosch . sEr-to-x@n-‘ bOs .


50

Pros and con’s of lexicon– phoneme transcriptions are accurate– (high) risk of out-of-vocabulary words because the

lexicon : • often contains only stems, no inflections, nor

compounds• is never up to date / complete

– but usually the application includes a user lexicon


51

Pros and con’s of pronunciation rules– no out-of-vocabulary words– transcription results are often wrong for

• (longer) combinations of words / morphemes• exceptions and loan-words from other

languages Best solution is a combination of the two methods:

– develop a list of words incorrectly transcribed by the rules and put these words in an exception lexicon

– words not occurring in the exception list are then transcribed by rule


52

Complications– Words with same written form but different

pronunciations and different meaning: ‘record vs re’cord:

requires parsing or statistical approach– Proper names and other specialized vocabularies,

acronyms/abbreviations (small announcements in journals!)

Need to be included in (user) lexicon– Different kinds of numbers (telephone numbers,

amounts, credit card numbers etc.)

Require number grammars


53

TtS: Linguistic pre-processing:prosody

Emphasis, boundaries (including pauses), sentence type

Observable manifestations: pitch, temporal properties, silence

Requires analysis of linguistic structure (parsing) and (ideally) discourse level information (cf. the earlier “Amsterdam” example)


54

TtS: synthesis

Concatenation from words and phrases practically impossible: – database too large (especially if you need several

versions for each word) and– no full coverage (out-of-vocabulary words)

Approaches:– sub-word units– data-oriented approach


55

Synthesis by subword units

Common approach: diphone synthesis

linking together pre-recorded diphones, i.e. short segments (transitions between two successive phonemes), extracted from natural speech:

‘s Hertogenbosch phonemes: . s E r t o x @ n b O s .

diphones: .s sE Er rt to ox x@ @n nb bO Os s.

In all, 1600 transitions per language (40 * 40)


56

Synthesis: – concatenate diphones in the correct order – perform some (intensity) smoothing at the

diphone borders – adjust phoneme duration and pitch course,

according to prosody rules


57

Data-oriented approach

Generalization of diphone approach Store a large database of speech (running text) Run-time:

– generate structure representing phoneme sequence and prosodic properties needed

– Search algorithm:

find the largest possible fragments containing the required properties in the database


58

frei-burg

nürn-berg

frei-berg

/fr/ also for fr-iedrichshaven:

items in database re-usable

Concatenate the fragments as they are, without post-processing for pitch and duration

in this way, not only phoneme parameters and transitions are taken from data, but also pitch and temporal properties


59

Advantage: natural speech quality preserved (but may not always be desirable: maybe it should be made clear to people that they are talking to a system)

Disadvantage: no explicit control of voice characteristics and prosodic characteristics such as pitch and speaking rate (which you might want to manipulate for synthesis of emotional speech or conveying a certain personality)


60

Difficult or impossible to modify speaker characteristics

Other speaker: new database required

Other speaking style: new database required

Research: post-processing of result with preservation of speech quality


61

hybrid synthesis– Combination of phrase concatenation and TTS– suited for template-based synthesis with fixed

message structure and variable slots

“the flight from <$departure> to <$destination> leaves at <$date> at <$time”> from <$gate>”

in dialogue systems the system has knowledge of message structure and can select the proper tokens from the database on the basis of this knowledge


62

Future: markup languages

structured text current tts-systems: strip text annotations (plain ascii

standard) draft proposal for xml for synthesis, SALT


63

Contents

Speech input technology consequences for design Speech output technology


project


64

issues in comprehension Speech quality

reduced quality slows down feature extraction and mapping input onto feature vectors: will increase number of matching vectors.

requires compensation by top-down processing, taking more time and effort and practice

“Text-to-speech”, but written text is often difficult to understand when read aloud due to complex structures, high information density etc.


65

application to synthetic speech substandard quality of synthetic speech requires

compensation by (resource-limited) top-down processing

potential overload of system due to time constraints

slowing down speaking rate very effective way to give the listener more processing time


66

Case study: picking up information from speech study on auditory exploration of lists (pitt & edwards,

1997):

recall of list of 48 file names – presented in groups– groups size varied (2, 3 or 4)– presentation of groups listener-paced– recall immediately after each group


67

Number of filenames in groups

% correct recall

2 76.7

3 49.7

4 20.5


68

adjustments analysis of “list” speaking style

grouping principles• always try to group• grouping by filenames and extensions• large groups first• mnemonic links between groups

prosodic structuring


69

evaluation:directory with four subdirectories, each containing files with four different names, corresponding to the modules of a programming project, and with three different extensions

task: find most recent version of the four files containing the source code for the modules and copy them into a new directory

measures: objective (task completion) and subjective

results for task completion: new algorithm: 10.39 min, old version: 24.12 min


70

Contents

Speech input technology Consequences for design Speech output technology


Project


71

Design implications choice of technology can be made dependent on

needs of application for restricted domains very high quality can be

achieved through canned speech or phrase concatenation with multiple tokens

for concatenation with unit selection there is a relation between quality and size of database; for good systems usually there is no problem with intelligibility even with inexperienced listeners


72

high quality for diphone speech, needed for uncommon forms such as proper names or company names that are unlikely to be available from a corpus, requires still much effort

importance of learning effects general finding is that acceptance of synthetic speech

depends strongly on voice quality if trade-off between quality and added value is

negative, prospects for acceptance of the speech interface are poor


73

speech as output modality: speech vs text/graphics

Text/graphics:– an image [ may be ] worth a thousand words– image/written text is persistent– image is (at least) two-dimensional: temporal and

spatial organisation– visual expression of hierarchical structure– receiver paced– but non-adaptive (until recently).

now: adaptive hypertext


74

speech– one-dimensional: extends only in time

spatial issues better dealt with in other modality

– sender-paced– poor medium, yet popular

• large amount of speech-based communication serves primarily a social function

– no need for supporting aids such as paper and pen

– no special motoric abilities needed– speaking is fast


75

heuristics:speech output preferred when message is simple message is short message need not be referred to later message deals with events in time message requires an immediate response visual channels are overloaded environment is brightly lit, poorly lit, subject to severe vibration or

otherwise adverse to transmission of visual information user must be free to move around

(from michaelis & wiggins) pronunciation is subject of interaction


76

but speech output preferably not used when message is complex or uses unfamiliar terms message is long message needs to be referred to later message deals with spatial topics message is not urgent auditory channels are overloaded environment is too noisy user has easy access to screen system output consists of many different kinds of information

which must be available simultaneously and be monitored and acted upon by the user

(from michaelis & wiggins) environmental variation and “mixed” interaction call for multimodal interfaces


77

contents

Speech input technology consequences for design Speech output technology Main points Project


78

Main points Database approach, requiring large databases for individual

languages and speaking styles, dominant both for speech input and output– Input: databases for training acoustic models and

language model– Output: concatenation of segments and phrases taken

from database Large differences concerning performance of speech

recognition and quality of output for different languages and target groups (e.g. recognition for children)

Speech input: three major classes of applications: command&control, information services, dictation systems

Major parameters: speaker-dependent/independent, vocabulary size (small, medium, large), rigid vs. free-format input


79

Dialogue management

– Finite-state or frame-based approach for task-oriented dialogue acts

– Verification strategies and repair mechanisms for dialogue control

Pragmatic approaches to language understanding and language generation:

– Input: directly mapped onto application functionality

– Output: template-based approaches Not covered: speech monitoring, speech data mining

applications and technology


80

Exercises with CSLU toolkit and other demonstrators– Try out your name, telephone numbers, dates, e-

mail addresses, abbreviations etc. Project

– Protocol development: • Dialogue structure• Strategies and prompts

– Tomorrow• Wizard of Oz test

sai user-system interaction u1, speech in the interface: 3. speech input and output technology1...

Documents

generic speech recognition

sai usersystem interaction

n variability of input

output technology9

output technology7

output technology1 module

design n project slide

hypotheses n