0910 spktek talteknologi jg.ppt · dialogsystem: joakim gustafson tal, musik och hörsel 5 j....
TRANSCRIPT
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
1
J. Gustafson, CTT, KTH http://www.speech.kth.se
Joakim Gustafson
Centrum för Talteknologi, KTHTal, musik och hörsel, KTH
DialogsystemDialogsystemDialogsystemDialogsystem
J. Gustafson, CTT, KTH 2
AgendaAgendaAgendaAgenda
• Introduction to speaker• Spoken Dialogue Systems• Research SDSs• Commercial SDSs• SDS Components• An example SDS
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
2
J. Gustafson, CTT, KTH 3
IntroductionIntroductionIntroductionIntroduction
J. Gustafson, CTT, KTH 4
Joakim GustafsonJoakim GustafsonJoakim GustafsonJoakim Gustafson
1987-1992 Electrical Engineering program at KTH
1992-1993 Linguistics at Stockholm University
1993-2000 PhD studies at CTT/KTH
2000-2007 Senior researcher at Telia R&D department
2007– Assistant Professor CTT/KTH
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
3
J. Gustafson, CTT, KTH 5
My research field:My research field:My research field:My research field:spoken dialogue systemsspoken dialogue systemsspoken dialogue systemsspoken dialogue systems
J. Gustafson, CTT, KTH 6
Multi Multi Multi Multi ----disciplinary fielddisciplinary fielddisciplinary fielddisciplinary field
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
4
J. Gustafson, CTT, KTH
Research areas• Speech production
• Speech perception
• Communication aids
• Multimodal speech synthesis
• Speech recognition
• Speaker verification
• Spoken conversational speech
• Interactive dialogue systems
CTT CTT CTT CTT ---- Centrum Centrum Centrum Centrum förförförför talteknologitalteknologitalteknologitalteknologi
J. Gustafson, CTT, KTH
The KTH speech groupThe KTH speech groupThe KTH speech groupThe KTH speech group
---- Early daysEarly daysEarly daysEarly days
Gunnar Fant and OVE I 1953
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
5
J. Gustafson, CTT, KTH
Namn Nr Årskurs Period Poäng
Talteknologi(Speech Technology) 2F1111 D3-4, E4, F4 3 4
Talteknologi, utökad kurs(Speech Technology, extended course) 2F1112 Media 3-4 3 5
Spektrala transformer för Media 2F1120 Media 3-4 2 4
Musikakustik(Music Acoustics) 2F1212 D4, E4, F4 4 4
Musical communication & music technology(Musikalisk kommunikation och musikteknologi)
2F1213 Media 3-4,E4, D4 4 5
Elektroakustik(Electroacoustics) 2F1400 E4-5, M4-5 1 4
Audioteknik(Audio technology) 2F1410 Media 3-4,
E4, D4 2 5
Orkesterspelets teori(Theory of orchestral performance) 2F1601 alla 6
Orkesterspelets praktik(Orchestral performance) 2F1602 alla 6
Elektroprojekt 2U1700 E1 3-4 5
Medieteknik grundkurs (avsnitt Ljud) 2D1574 Media 2 1-2 12
J. Gustafson, CTT, KTH 10
Spoken Dialogue SystemsSpoken Dialogue SystemsSpoken Dialogue SystemsSpoken Dialogue Systems
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
6
J. Gustafson, CTT, KTH 11
Speech application DomainsSpeech application DomainsSpeech application DomainsSpeech application Domains
• Information retrieval systems– e.g. train time table information or directory
inquiries
• Ordering– ticket booking, buying music
• Command control systems– home control (“turn the radio off”) or voice
command shortcuts (“save”)
• Dictation
J. Gustafson, CTT, KTH 12
Application Domains (cont)Application Domains (cont)Application Domains (cont)Application Domains (cont)
• Games and entertainment– Games can take advantage of the more social aspects of
spoken dialogue
• Co-ordinated collaboration– The task of controlling or over-viewing complex situations
requires flexibility and efficiency.
• Expert systems– Diagnose and help systems that need to reason about facts
and goals and may benefit from natural dialogue
• Learning and training – Naturalness, flexibility, and robustness are attractive features
in training environments
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
7
J. Gustafson, CTT, KTH 13
The spoken dialogueThe spoken dialogueThe spoken dialogueThe spoken dialoguesystem vision system vision system vision system vision
An interface that allows speakers to interact with a computer using spontaneous, unconstrained speech
J. Gustafson, CTT, KTH 14
Talking machines as we Talking machines as we Talking machines as we Talking machines as we know them from movies…know them from movies…know them from movies…know them from movies…
• Star Trek
• HAL 9000 (2001)
• C3PO (Star Wars)
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
8
J. Gustafson, CTT, KTH 15
“Välkommen till
Telia…Här beskriver du
ditt ärende med egna ord
istället för att knappa dig
fram. Om du säger vad du
vill ha hjälp med så kan
jag koppla fram dig. Vad
gäller ditt ärende?”
Hallå? 80 kategorier
Support Agent
Billing info
Mobile Refill
Broadband order
Broadband availability
Mobile info agent
“Ja hallå, vad
gäller ditt ärende?
Det gäller en faktura
Ok ett faktura ärende
[PAUS] Gäller det fast
telefoni,mobil telefoni
bredband eller uppringt
internet?
Bredband
...and since a couple of ...and since a couple of ...and since a couple of ...and since a couple of years from real systemsyears from real systemsyears from real systemsyears from real systems
(an example from 90200)
J. Gustafson, CTT, KTH
How do we want speaking How do we want speaking How do we want speaking How do we want speaking machines to behave?machines to behave?machines to behave?machines to behave?
As us...
...but to a limit?
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
9
J. Gustafson, CTT, KTH 17
Interface metaphorsInterface metaphorsInterface metaphorsInterface metaphors
• Exploit specific knowledge that users already have from other domains
• Give the user instantaneous knowledge about how to interact
• Examples include – The desktop – the computer is a workplace
– File system – the computer is a filing cabinet
– Games – the computer is a toy
J. Gustafson, CTT, KTH 18
Metaphors for Metaphors for Metaphors for Metaphors for speech interfacesspeech interfacesspeech interfacesspeech interfaces
• The tool/interface metaphor–Speech as an alternative interface technology in GUIs (multimodality)
• The servant/human metaphor– The computer as an entity with human-like conversational abilities
One is not necessarily better than the other!
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
10
J. Gustafson, CTT, KTH 19
Applications of machines with Applications of machines with Applications of machines with Applications of machines with human metaphor interactionhuman metaphor interactionhuman metaphor interactionhuman metaphor interaction
• Research tool• Cassel: ‘‘a machine that acts human enough that we respond to it as we respond to another human”
• Second LanguageLearning
• Social Robots
• Computer Games
J. Gustafson, CTT, KTH 20
What would make a What would make a What would make a What would make a speech interface intuitive?speech interface intuitive?speech interface intuitive?speech interface intuitive?
• The user should not have to learn: – When to speak
– How to speak
– What to say
• The system should support the user by:– Clearly showing when it will speak
– Saying things that reflects what it understands
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
11
J. Gustafson, CTT, KTH 21
When to speakWhen to speakWhen to speakWhen to speak
• The computer understands when the user intends to stop or continue speaking
• The computer indicates when it intends to stop or continue speaking
J. Gustafson, CTT, KTH 22
Controlling the dialogue flowControlling the dialogue flowControlling the dialogue flowControlling the dialogue flow
• Conversation has two channels– Exchange of information
– Control of the exchange of information
• Control– Turn-taking (who speaks when)
– Feedback (attention, attitude…)
Dialogue systems needs both controls
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
12
J. Gustafson, CTT, KTH
Using superficial methods to Using superficial methods to Using superficial methods to Using superficial methods to handle dialogue flowhandle dialogue flowhandle dialogue flowhandle dialogue flow
• Talk after the beep
• Talk-now icon/cursor
• Push-to-talk button
J. Gustafson, CTT, KTH 24
• Facial gestures and head movements
• Prosody and extra-linguistic sounds
Using humanUsing humanUsing humanUsing human----like methods to like methods to like methods to like methods to handle dialogue flowhandle dialogue flowhandle dialogue flowhandle dialogue flow
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
13
J. Gustafson, CTT, KTH 25
The MonAMI ReminderThe MonAMI ReminderThe MonAMI ReminderThe MonAMI Reminder
Google Calendar
ECA
SMS notifications
Web GUI
15.00 Meeting with Sara at Wayne’s Coffee
Monday 13
Handwriting
J. Gustafson, CTT, KTH 26
Attention and interaction controllerAttention and interaction controllerAttention and interaction controllerAttention and interaction controller
NonAttentive
Attentive Speaking
Listening
UserStartLook
SystemInitiative
SystemResponse
SystemStopSpeak
UserStopLook
Holding
Timeout
Interrupted
UserStartSpeak SystemIgnore (resume)
SystemResponse (restart)SystemIgnore
Calling
PausingSpeakingAttending
Not attending
Attending
Speaking
Not attending
SystemInitiative
UserStartLook
SystemResponse
UserStartSpeak
UserStopLookUserStartLook (resume)
SystemStopSpeak
System
User
SystemAbortSpeak
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
14
J. Gustafson, CTT, KTH 27
Facial animationFacial animationFacial animationFacial animation
NonAttentive Attentive SystemIgnoreListening
J. Gustafson, CTT, KTH 28
Experiments in a tutoring settingExperiments in a tutoring settingExperiments in a tutoring settingExperiments in a tutoring setting
Tutoring setting: Tutor explaining the system to a user
Problem• Is the user talking to the tutor or the system?
• Compare push-to-talk (PTT) with look-to-talk (LTT)
Subjects • 4 elderly persons from the target group
• 4 collegues
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
15
J. Gustafson, CTT, KTH 29
Example Example Example Example
J. Gustafson, CTT, KTH 30
Research SpokenResearch SpokenResearch SpokenResearch SpokenDialogue SystemsDialogue SystemsDialogue SystemsDialogue Systems
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
16
J. Gustafson, CTT, KTH 31
Classic research dialogue systemsClassic research dialogue systemsClassic research dialogue systemsClassic research dialogue systems
• Historic research systems– Voyager (1989)– ATIS (1992)– SUNDIAL (1993)– TRAINS (1996)
• Application– Philips Train Information (1995)
• Large Efforts– Communicator – Verbmobil– SmartKom
J. Gustafson, CTT, KTH 32
The Nordic SceneThe Nordic SceneThe Nordic SceneThe Nordic Scene
• Stockholm, Sweden– KTH (Waxholm, August, AdApt, Higgins)– Telia (Resebokning, NICE)
• Linköping, Sweden– IDA (LINLIN, Ornitology, Movie Finder)
• Göteborg, Sweden– Lingvistiken (TRINDI, GODIS, TALK)– Chalmers (State-chart xml)
• Aalborg, Odense Denmark• Helsinki, Tampere Finland• Trondheim, Norway
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
17
J. Gustafson, CTT, KTH 33
The Waxholm Project (92The Waxholm Project (92The Waxholm Project (92The Waxholm Project (92----95)95)95)95)
• Tourist information– Stockholm archipelago
– time-tables, hotels, hostels, camping and dining possibilities.
• Mixed initiative dialogue– speech recognition
– multimodal synthesis
• Graphic information– pictures, maps, charts and
time-tables.
J. Gustafson, CTT, KTH 34
Waxholm SystemWaxholm SystemWaxholm SystemWaxholm System
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
18
J. Gustafson, CTT, KTH
The Waxholm systemThe Waxholm systemThe Waxholm systemThe Waxholm system
J. Gustafson, CTT, KTH 36
The August kiosk at The August kiosk at The August kiosk at The August kiosk at KulturhusetKulturhusetKulturhusetKulturhuset 98 98 98 98
• Swedish spoken dialogue system for public use • Animated agent named after August Strindberg• Speech technology components developed at KTH• Designed to be easily expanded and reconfigured • Used to collect spontaneous speech data
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
19
J. Gustafson, CTT, KTH
The August system
J. Gustafson, CTT, KTH 38
The HIGGINS system (03The HIGGINS system (03The HIGGINS system (03The HIGGINS system (03----))))
• The primary domain of HIGGINS is city navigation for pedestrians.
• Secondarily, HIGGINS is intended to provide simple information about the immediate surroundings.
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
20
J. Gustafson, CTT, KTH 39
HigginsHigginsHigginsHiggins
J. Gustafson, CTT, KTH 40
The Commercial Speech The Commercial Speech The Commercial Speech The Commercial Speech ApplicationsApplicationsApplicationsApplications
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
21
J. Gustafson, CTT, KTH 41
Access Convergence & SpeechAccess Convergence & SpeechAccess Convergence & SpeechAccess Convergence & Speech
J. Gustafson, CTT, KTH 42
Commercial ApplicationsCommercial ApplicationsCommercial ApplicationsCommercial Applicationsof Speech technologyof Speech technologyof Speech technologyof Speech technology
• Dictation– Medical
– Legal
– Windows Vista (95-99% without training?)
• Customer care center automation– Call Routing
– Technical support
• Multimodal mobile applications– Information search
– Content download
– Entertainment
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
22
J. Gustafson, CTT, KTH 43
Problem solvingProblem solvingProblem solvingProblem solving
• SpeechCycle– Broadband support Agent
– Video support Agent
– IP Telephony Agent
J. Gustafson, CTT, KTH 44
Some nice dialogue features of Some nice dialogue features of Some nice dialogue features of Some nice dialogue features of Speechcycle’s Agents Speechcycle’s Agents Speechcycle’s Agents Speechcycle’s Agents
• Open prompts with free speech ASR
• References to user’s previous attempts to fix problem
• Answer to free prompts
• Error identification
• Memory of earlier turns
• Encouraging the user to stay
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
23
J. Gustafson, CTT, KTH 45
SDS ComponentsSDS ComponentsSDS ComponentsSDS Components
J. Gustafson, CTT, KTH 46
Spoken dialogue processingSpoken dialogue processingSpoken dialogue processingSpoken dialogue processing
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
24
J. Gustafson, CTT, KTH 47
The Components of The Components of The Components of The Components of Spoken Dialogue SystemsSpoken Dialogue SystemsSpoken Dialogue SystemsSpoken Dialogue Systems
• Automatic Speech Recognition (ASR)• Natural Language Understanding (NLU)• Dialogue Management (DM)• Natural Language Generation (NLG)• Text-To-Speech synthesis (TTS)• Multimodal modules
J. Gustafson, CTT, KTH 48
Speech RecognizerSpeech RecognizerSpeech RecognizerSpeech Recognizer
Purpose: convert speech to text
Input: digital speech
Output: String(s) or lattice that represent the most probable utterance(s)
Considerations: Vocabulary size, Grammar and Speech type
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
25
J. Gustafson, CTT, KTH 49
Speech Recognition Speech Recognition Speech Recognition Speech Recognition
• Identify the speech sounds– Measure the spectrum of the speech signal 100
times per second
• Match the recognized sounds against a pronunciation dictionary– Only the allowed words are active
• Construct a sentence of the most probable words – Taking the probability of the context into account
J. Gustafson, CTT, KTH 50
Statistically based recognitionStatistically based recognitionStatistically based recognitionStatistically based recognition
Search
Language modelPossible word sequences
N-best
Sentence understanding
N-bestN-bestN-best
Phoneme modelsAcoustic deskr.
Lexiconvocabulary + transcription
Chosen sentenceKnowledge
Analysis
Comparison
ParameterisationFFT16 kHz 100 Hz
Spectral analysis
Classification
ContinuousDiscrete
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
26
J. Gustafson, CTT, KTH
What makes speechWhat makes speechWhat makes speechWhat makes speechrecognition difficult? recognition difficult? recognition difficult? recognition difficult?
Speaker Channel Listener
• Between speakers– Age
– Gender
– Anatomy
– Dialect
• Within speaker– Stress
– Mood
– Health
– Formel / Spontanteous
• Environment– Additive noise
– Room
• Microphone, Telephone– Bandwidth
– Noise
• Listener– Age
– First Language
– Level of hearing
– Known / Unknown
– Human / Maschine
J. Gustafson, CTT, KTH 52
(.wav)
(.wav)
(.wav)
(.wav)
(.wav)
(.wav)
(.wav) (.wav)
”Flyget, tåget och bilbranschen tävlar om lönsamhet och folkets gunst”.
Född iUSAex-Jugoslavien
(.wav)
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
27
J. Gustafson, CTT, KTH
Results 2005Results 2005Results 2005Results 2005
(earlier result in parentesis)(earlier result in parentesis)(earlier result in parentesis)(earlier result in parentesis)
Corpus Speech type Lexicon size Word Error Rate (%)
Human Error Rate (%)
Digit string (phone) spontaneous 10 0.3 0.009
Resource Management
read 1000 3.6 0.1
ATIS spontaneous 2000 2 -
Wall Street Journal
read 64000 6.6 1
Radio News mixed 64000 13.5 (20) -
Switchboard (phone)
conversation 10000 19.3 (38) 4
Call Home (phone) conversation 10000 30 (50) -
J. Gustafson, CTT, KTH 54
Natural Language Natural Language Natural Language Natural Language UnderstandingUnderstandingUnderstandingUnderstanding
Purpose: produce meaning representation from ASR output
Input: String(s) lattice of words
Output: Meaning representation
Considerations: Type of grammar (Finite-state, Full parse or Word-spotting)
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
28
J. Gustafson, CTT, KTH 55
Phrases and syntaxPhrases and syntaxPhrases and syntaxPhrases and syntax
I want to take the boat
J. Gustafson, CTT, KTH 56
Robust Utterance AnalysisRobust Utterance AnalysisRobust Utterance AnalysisRobust Utterance Analysis
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
29
J. Gustafson, CTT, KTH 57
Pragmatics/world knowledgePragmatics/world knowledgePragmatics/world knowledgePragmatics/world knowledge
• Flights can be delayed• To be delayed is bad• ”Usually” → recurring in time. How often is ”usually”?
• Is this particular flight delayed a lot? • What would that mean for connecting flights?
S: There is a flight at 8.10.
U: Isn’t that usually delayed?
J. Gustafson, CTT, KTH 58
Dialogue ManagementDialogue ManagementDialogue ManagementDialogue Management
Purpose: decide the next system action
Input: a meaning representation from NLU
Output: High-level communicative goal(s)
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
30
J. Gustafson, CTT, KTH 59
Automating DialogueAutomating DialogueAutomating DialogueAutomating Dialogue
• To automate a dialogue, we must model:User’s Intentions User’s Intentions User’s Intentions User’s Intentions
• Recognized from input media forms (multi-modality)
• Disambiguated by the Context
System ActionsSystem ActionsSystem ActionsSystem Actions• Based on the task model and on output medium forms
(multi-media)
Dialogue FlowDialogue FlowDialogue FlowDialogue Flow• Content Selection (what to say)
• Information Packaging (how much to say)
• Turn-taking (who can speak/interrupt)
J. Gustafson, CTT, KTH 60
Knowledge SourcesKnowledge SourcesKnowledge SourcesKnowledge Sourcesfor Dialogue Systemsfor Dialogue Systemsfor Dialogue Systemsfor Dialogue Systems
• Context:– Dialogue history– Current task, slot, state
• Ontology:– World Knowledge model– Domain model
• Language:– Models of conversational competence:
• Turn-taking strategies• Discourse obligations
– Linguistic models:• Grammars, Dictionaries, Corpora
• User Model:– Information about user that could be relevant to the dialogue
• Age, gender, spoken language, location, preferences, etc.• Dynamic information: the current Mental State (Belief, Desires, Intentions)
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
31
J. Gustafson, CTT, KTH 61
Dialogue Control StrategiesDialogue Control StrategiesDialogue Control StrategiesDialogue Control Strategies
• Finite-state based systems– dialog and states explicitly specified
• Frame based systems– dialog separated from information states
• Agent based systems– model of intentions, goals, beliefs
J. Gustafson, CTT, KTH 62
Natural Language GenerationNatural Language GenerationNatural Language GenerationNatural Language Generation
Purpose: produce an output string to be shown/spoken to the user
Input: Representation from DM
Output: String for TTS (possibly marked for prosody, etc.)
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
32
J. Gustafson, CTT, KTH 63
Utterance GenerationUtterance GenerationUtterance GenerationUtterance Generation
• Predefined utterances• Frames with slots• Generation based on grammar and underlying semantics
J. Gustafson, CTT, KTH 64
TextTextTextText----totototo----Speech SynthesisSpeech SynthesisSpeech SynthesisSpeech Synthesis
Input: text string (with tags)
Purpose: speak string to user
Considerations: Human-like, Flexibility
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
33
J. Gustafson, CTT, KTH 65
What is speech synthesis?What is speech synthesis?What is speech synthesis?What is speech synthesis?• Recorded human speech
– Words and phrases
– fix vocabulary
• Systematically recorded fragments of speech– one fix speaker
• Record the phoneme transitions "diphones"
• Parametric synthesis (artificial)– Formant synthesis
– articulatory synthesis
• Multimodal synthesis (talking head)
J. Gustafson, CTT, KTH 66
TextTextTextText----totototo----speech (TTS)speech (TTS)speech (TTS)speech (TTS)
text
Linguistic analysis
Prosodic analysis
Phonetic description
Sound generation
Morphological analysisLexicon and rulesSyntax analysis
Rules and lexicon
Rules and unit selection
Concatenation Rules
Language ident.
“abcd”
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
34
J. Gustafson, CTT, KTH 67
Text PreprocessingText PreprocessingText PreprocessingText Preprocessing
• Sentence end detection (semicolon, period – ratio, time and decimal point, sentence ending respectively)
• Abbreviations (e.g. – for instance). Changed to their full form with the help of lexicons
• Acronyms (I.B.M – these can be read as a sequence of characters, or NASA which can be read following the default way)
• Numbers (Once detected, first interpreted as rational, time of the day, dates and ordinal depending on their context)
• Idioms (e.g. “In spite of”, “as a matter of fact”– these are combined into single FSU using a special lexicon)
Olov Engwall, Speech synthesis, 2008
J. Gustafson, CTT, KTH 68
GraphemeGraphemeGraphemeGrapheme----totototo----phoneme conversionphoneme conversionphoneme conversionphoneme conversion
• Dictionary:– Store a maximum of phonological knowledge into a lexicon.– Compounding rules describe how the morphemes of dictionary items are modified.– Hand-corrected, expensive– The lexicon is never complete: needs out of vocabulary pronouncer, transcribed by rule.
• Rules:– A set of letter to sound (grapheme to phoneme) rules.– Words pronounced in a such a particular way that they have their own rule are stored in
exceptions directory.– Fast & easy, but lower accuracy
• Machine learning:– Cart tree– Analogy pronunciation
Olov Engwall, Speech synthesis, 2008
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
35
J. Gustafson, CTT, KTH 69
Sound generation approachesSound generation approachesSound generation approachesSound generation approaches
• Parametric synthesis (Synthesis by Rule )– Formant synthesis or articulatory synthesis
– Cognitive approach of the phonation mechanism– Speech is produced by mathematical rules that formally describe the
influence of phonemes on one another
– Flexible, but lower quality - today
• Synthesis by Concatenation– Diphones or larger units (unit selection)
– Limited knowledge of the data to be handled– Elementary speech units are stored in a database and then
concatenated and processed to produce the speech signal– One speaker
J. Gustafson, CTT, KTH
TextTextTextText----totototo----Speech Synthesis Speech Synthesis Speech Synthesis Speech Synthesis (TTS) Evolution(TTS) Evolution(TTS) Evolution(TTS) Evolution
1962 1967 1972 1977 1982 1987 1992 1997 2002
Year
Formant
Synthesis
Bell Labs; Joint
Speech Research
Unit; MIT (DEC-
Talk); Haskins
Lab; KTH
LPC-Based
Diphone/Dyad
Synthesis
Bell Labs; CNET;
Bellcore; Berkeley
Speech
Technology
Unit Selection
Synthesis
ATR in Japan; CSTR
in Scotland; BT in
England; AT&T Labs
(1998); L&H in
Belgium
Poor
Intelligibility;
Poor Naturalness
Good
Intelligibility;
Poor Naturalness
Good
Intelligibility;
Customer Quality
Naturalness
(Limited Context)
HMM
Synthesis
HTS, Nagoya,
Edinburgh
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
36
J. Gustafson, CTT, KTH 7171
Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959--------1987)1987)1987)1987)1987)1987)1987)1987)
• Haskins, 1959 • KTH – Stockholm, 1962• Bell Labs, 1973 • MIT, 1976• MIT-talk, 1979• Speak ‘N spell, 1980• BELL Labs, 1985• Dec talk, 1987
J. Gustafson, CTT, KTH
Source filter theorySource filter theorySource filter theorySource filter theory
Source
(glottal flow)filter
(shape of vocal tract)
Radiation
(lips)
Time:
Frequency:
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
37
J. Gustafson, CTT, KTH 73
ArticulatoryArticulatoryArticulatoryArticulatory parametersparametersparametersparameters
• Jaw opening
• Lip rounding
• Lip Protrusion
• Tongue position
• Tongue height
• Tongue tip
• Velum
• Hyoid
J. Gustafson, CTT, KTH
Concatenative synthesisConcatenative synthesisConcatenative synthesisConcatenative synthesis• Diphone synthesis
– Svensk(Mbrola)
• Unit selection– Svensk weather (CSTR) good, less good
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
38
J. Gustafson, CTT, KTH
Diphone synthesisDiphone synthesisDiphone synthesisDiphone synthesis
• Mid-phone is more stable than edge:
• Mid-phone is more stable than edge:
Slide from Dan Jurafsky
J. Gustafson, CTT, KTH
Diphone SynthesisDiphone SynthesisDiphone SynthesisDiphone Synthesis
• Well-understood, mature technology
• Problems:– Signal processing still necessary for modifying durations
– Source data is still not natural
– Units are just not large enough; can’t handle word-specific effects, etc
Slide from Dan Jurafsky
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
39
J. Gustafson, CTT, KTH
Unit Selection SynthesisUnit Selection SynthesisUnit Selection SynthesisUnit Selection Synthesis
• Natural data solves problems with diphones– Diphone databases are carefully designed but:
• Speaker makes errors• Speaker doesn’t speak intended dialect• Require database design to be right
– If it’s automatic• Labeled with what the speaker actually said• Coarticulation, schwas, flaps are natural
• “There’s no data like more data”– Lots of copies of each unit mean you can choose just the
right one for the context– Larger units mean you can capture wider effects
Slide from Dan Jurafsky
J. Gustafson, CTT, KTH
Unit Selection SummaryUnit Selection SummaryUnit Selection SummaryUnit Selection Summary
• Advantages– Quality is far superior to diphones
– Natural prosody selection sounds better
• Disadvantages:– Quality can be very bad in places
• HCI problem: mix of very good and very bad is quite annoying
– Synthesis is computationally expensive
– Can’t synthesize everything you want:
• Diphone technique can move emphasis
• Unit selection gives good (but possibly incorrect) result
Slide from Richard Sproat
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
40
J. Gustafson, CTT, KTH
Current trend Current trend Current trend Current trend –––– machine learningmachine learningmachine learningmachine learning
• Problems with Unit Selection Synthesis– Can’t modify signal
– (mixing modified and unmodified sounds bad)
– But database often doesn’t have exactly what you want
• Solution: HMM (Hidden Markov Model) Synthesis– Won the last TTS bakeoff.
– Sounds unnatural to researchers
– But naïve subjects preferred it
– Has the potential to improve on both diphone and unit selection.
Slide from Dan Jurafsky
J. Gustafson, CTT, KTH 80
HMM synthesisHMM synthesisHMM synthesisHMM synthesis
• A speech synthesis technique based on• HTK (Hidden Markov Model Toolkit)• Developed by the HTS working group• at the Department of Computer Science Nagoya
• Institute of Technology• Interdisciplinary Graduate School of Science and
• Engineering Tokyo Institute of Technology• http://hts.sp.nitech.ac.jp
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
41
J. Gustafson, CTT, KTH 81
Features used in our Features used in our Features used in our Features used in our Swedish HMM synthesis trainingSwedish HMM synthesis trainingSwedish HMM synthesis trainingSwedish HMM synthesis training
• Segment features: – immediate context– position in syllable
• Syllable features – Stress and lexical accent type– position in word and phrase
• Word features – number of syllables– position in phrase– morphological feature (compound or not)– part-of-speech tag (content or function word)
• Phrase features– phrase length in terms of syllables and words
• Utterance features:– length in syllables, words and phrases
• Speaker – dialect
J. Gustafson, CTT, KTH
Dalarna Norrland Skåne Gotland SvealandGötaland
vart tar universum slut
centerpartister och kristdemokrater menar dock att
brudparet kan slippa betala det kan bådas föräldrar
göra
gula sidorna finns från och med i fredags sökbart via
mobiltelefonen
om man till exempel tar en telefon och frågar hur den
fungerar så svarar vetenskapsmannen med att lyfta på
luren och slå numret
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
42
J. Gustafson, CTT, KTH 83
Multimodal SynthesisMultimodal SynthesisMultimodal SynthesisMultimodal Synthesis(talking heads)(talking heads)(talking heads)(talking heads)
Input: tagged text string
Purpose: support synthesis with lip movements
Considerations: Synchronization, Human-like?
J. Gustafson, CTT, KTH 84
Why Talking Heads?Why Talking Heads?Why Talking Heads?Why Talking Heads?
• The most natural form of communication that we know of is face-to-face interaction
• A virtual talking head can improve information transfer
– Through expression, gaze, head movements
– Through lip- cheek and tongue movements
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
43
J. Gustafson, CTT, KTH 85
Tasks of an Animated AgentTasks of an Animated AgentTasks of an Animated AgentTasks of an Animated Agent
• Provide intelligible synthetic speech• Indicate emphasis and focus in utterances • Support turn-taking• Give spatial references (gaze, pointing etc)• Provide non-verbal back-channeling• Indicate the system’s internal state
J. Gustafson, CTT, KTH 86
ApplicationsApplicationsApplicationsApplications
• Improved speech synthesis• Human-Computer Interface in spoken dialogue systems
• Aid for hearing impaired• Educational software• Stimuli for perceptual experiments
• Entertainment: games, virtual reality, movies etc.
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
44
J. Gustafson, CTT, KTH 87
Techniques for facial animationTechniques for facial animationTechniques for facial animationTechniques for facial animation
• Talking head synthesis requires a signal model and a control model.
The signal model may be • Video-based (2D)
– Enables realistic reproduction– Lacking flexibility
• Muscle based (3D)– Highly flexible– Difficult to gather anatomical data– Computationally intensive (although less of a problem today...)
• Direct parametrisation/morphing based (3D)– Good compromise between flexibility and realism– Simple data acuistion using optical methods
J. Gustafson, CTT, KTH
Direct parameterisationDirect parameterisationDirect parameterisationDirect parameterisation
• 3D-modelling
• Deformation through high-level parameters
• Different possible parameterisations:
– Articulatory oriented
– MPEG-4 (low-level)
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
45
J. Gustafson, CTT, KTH 89
Articulatory oriented direct Articulatory oriented direct Articulatory oriented direct Articulatory oriented direct parameterisationparameterisationparameterisationparameterisation
• High-level parameterisation tailored for visual speech animation
• Parameter set includes– Jaw opening
– Lip rounding
– Bilabial closure
– Laboidental closure
• Parameters are normalized relatve to spatial targets
J. Gustafson, CTT, KTH 90
MPEGMPEGMPEGMPEG----4 direct parameterisation4 direct parameterisation4 direct parameterisation4 direct parameterisation
• Original purpose: model based video coding
• The standard defines a generic face object
• 84 feature points (FPs)• 68 facial animation parameters (FAPs)
• FAPs are normalized relative to distances in the face
• Expressed in FAPU (FAP unit)
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
46
J. Gustafson, CTT, KTH 91
Measuring talking facesMeasuring talking facesMeasuring talking facesMeasuring talking faces
• Cyberware scanning– High definition 3D shape and texture
– Static
– Expensive equipment
• Motion capture– Captures points in 3D
– Dynamic
– Starting to become affordable
• Video analysis– Can provide texture, contours and points
– Inexpensive hardware
– Typically less accurate than motion capture
J. Gustafson, CTT, KTH
Collection of Collection of Collection of Collection of audioaudioaudioaudio----visual databasesvisual databasesvisual databasesvisual databases
� Eliciting technique: information
seeking scenario
� Focus on the speaker who has the
role of information giver
� The speaker seats facing 4 infrared
cameras, a digital video-camera, a
microphone The other person is only
video recorded.
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
47
J. Gustafson, CTT, KTH
Recording and modelRecording and modelRecording and modelRecording and model
J. Gustafson, CTT, KTH
Emotional synthesisEmotional synthesisEmotional synthesisEmotional synthesis
Happiness Anger Fear
Surprise Disgust Sadness
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
48
J. Gustafson, CTT, KTH
Interactions: emotion and articulationInteractions: emotion and articulationInteractions: emotion and articulationInteractions: emotion and articulation(from AV speech database (from AV speech database (from AV speech database (from AV speech database –––– EU/PF_STAR EU/PF_STAR EU/PF_STAR EU/PF_STAR
project)project)project)project)
J. Gustafson, CTT, KTH
Datadriven facial synthesis with MPEG4 modelDatadriven facial synthesis with MPEG4 modelDatadriven facial synthesis with MPEG4 modelDatadriven facial synthesis with MPEG4 model
Happy Angry
Sad Surprised
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
49
J. Gustafson, CTT, KTH 97
An example SDSAn example SDSAn example SDSAn example SDS
J. Gustafson, CTT, KTH 98
The NICE systemThe NICE systemThe NICE systemThe NICE system
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
50
J. Gustafson, CTT, KTH 99
BackgroundBackgroundBackgroundBackground
• NICE was a EU project 2002-2005• Five Partners
– Liquid Media, Sweden
– Scansoft (initially Philips, now Nuance), Germany
– Limsi, France
– TeliaSonera, Sweden
– Nislab, Denmark
• Two systems were developed– a Swedish Fairytale game
– an English HC Andersen story-teller
J. Gustafson, CTT, KTH 100
The NICE system setupThe NICE system setupThe NICE system setupThe NICE system setup
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
51
J. Gustafson, CTT, KTH 101
A computer game with A computer game with A computer game with A computer game with multimodal dialoguemultimodal dialoguemultimodal dialoguemultimodal dialogue
• It puts new & different demands on speech and dialogue technology– young users– a game lasts for many hours– many kinds of dialogue: problem solving, negotiation,
information-seeking, socializing, command & control, multi-party, …
– the setup is multimodal and asynchronous
• Many things are easier than for ‘normal’ dialogue systems– low penalty for recognition errors– voluntary suspension of belief– gamers already know all the clichés
• $ Might pay off big some day $
J. Gustafson, CTT, KTH 102
The fairyThe fairyThe fairyThe fairy----tale gametale gametale gametale game
• Two scenes – the room and the fairy-tale world– Scene one – Multimodal training session put-that-there– Scene two – Multi-party negotiation dialogue
• Three characters– Cloddy Hans – slow but friendly Helper character– Karen – fast and sullen Gatekeeper character– Thumbelina – silent supporting character with attitudinal gestures
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
52
J. Gustafson, CTT, KTH 103
Inside a fairyInside a fairyInside a fairyInside a fairy----tale braintale braintale braintale brain
• World model: Interrelated (Java) objects representing things, characters, and their relationships.
• Discourse history: Who said what when?
• Agenda: Goals, actions and their causal relationships.
J. Gustafson, CTT, KTH 104
EventEventEventEvent----based dialoguebased dialoguebased dialoguebased dialogue
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
53
J. Gustafson, CTT, KTH 105
AttentionalAttentionalAttentionalAttentional and emotionaland emotionaland emotionaland emotionaloutput behaviorsoutput behaviorsoutput behaviorsoutput behaviors
J. Gustafson, CTT, KTH 106
Body gestures and Body gestures and Body gestures and Body gestures and physical actionsphysical actionsphysical actionsphysical actions
Dialogsystem: Joakim Gustafson
Tal, musik och hörsel
54
J. Gustafson, CTT, KTH 107
The End