0910 spktek talteknologi jg.ppt · dialogsystem: joakim gustafson tal, musik och hörsel 5 j....

Dialogsystem: Joakim Gustafson

Tal, musik och hörsel

1

J. Gustafson, CTT, KTH http://www.speech.kth.se

Joakim Gustafson

Centrum för Talteknologi, KTHTal, musik och hörsel, KTH

DialogsystemDialogsystemDialogsystemDialogsystem

J. Gustafson, CTT, KTH 2

AgendaAgendaAgendaAgenda

• Introduction to speaker• Spoken Dialogue Systems• Research SDSs• Commercial SDSs• SDS Components• An example SDS



2


IntroductionIntroductionIntroductionIntroduction


Joakim GustafsonJoakim GustafsonJoakim GustafsonJoakim Gustafson

1987-1992 Electrical Engineering program at KTH

1992-1993 Linguistics at Stockholm University

1993-2000 PhD studies at CTT/KTH

2000-2007 Senior researcher at Telia R&D department

2007– Assistant Professor CTT/KTH



3


My research field:My research field:My research field:My research field:spoken dialogue systemsspoken dialogue systemsspoken dialogue systemsspoken dialogue systems


Multi Multi Multi Multi ----disciplinary fielddisciplinary fielddisciplinary fielddisciplinary field



4

J. Gustafson, CTT, KTH

Research areas• Speech production

• Speech perception

• Communication aids

• Multimodal speech synthesis

• Speech recognition

• Speaker verification

• Spoken conversational speech

• Interactive dialogue systems

CTT CTT CTT CTT ---- Centrum Centrum Centrum Centrum förförförför talteknologitalteknologitalteknologitalteknologi


The KTH speech groupThe KTH speech groupThe KTH speech groupThe KTH speech group

---- Early daysEarly daysEarly daysEarly days

Gunnar Fant and OVE I 1953



5


Namn Nr Årskurs Period Poäng

Talteknologi(Speech Technology) 2F1111 D3-4, E4, F4 3 4

Talteknologi, utökad kurs(Speech Technology, extended course) 2F1112 Media 3-4 3 5

Spektrala transformer för Media 2F1120 Media 3-4 2 4

Musikakustik(Music Acoustics) 2F1212 D4, E4, F4 4 4

Musical communication & music technology(Musikalisk kommunikation och musikteknologi)

2F1213 Media 3-4,E4, D4 4 5

Elektroakustik(Electroacoustics) 2F1400 E4-5, M4-5 1 4

Audioteknik(Audio technology) 2F1410 Media 3-4,

E4, D4 2 5

Orkesterspelets teori(Theory of orchestral performance) 2F1601 alla 6

Orkesterspelets praktik(Orchestral performance) 2F1602 alla 6

Elektroprojekt 2U1700 E1 3-4 5

Medieteknik grundkurs (avsnitt Ljud) 2D1574 Media 2 1-2 12


Spoken Dialogue SystemsSpoken Dialogue SystemsSpoken Dialogue SystemsSpoken Dialogue Systems



6


Speech application DomainsSpeech application DomainsSpeech application DomainsSpeech application Domains

• Information retrieval systems– e.g. train time table information or directory

inquiries

• Ordering– ticket booking, buying music

• Command control systems– home control (“turn the radio off”) or voice

command shortcuts (“save”)

• Dictation


Application Domains (cont)Application Domains (cont)Application Domains (cont)Application Domains (cont)

• Games and entertainment– Games can take advantage of the more social aspects of

spoken dialogue

• Co-ordinated collaboration– The task of controlling or over-viewing complex situations

requires flexibility and efficiency.

• Expert systems– Diagnose and help systems that need to reason about facts

and goals and may benefit from natural dialogue

• Learning and training – Naturalness, flexibility, and robustness are attractive features

in training environments



7


The spoken dialogueThe spoken dialogueThe spoken dialogueThe spoken dialoguesystem vision system vision system vision system vision

An interface that allows speakers to interact with a computer using spontaneous, unconstrained speech


Talking machines as we Talking machines as we Talking machines as we Talking machines as we know them from movies…know them from movies…know them from movies…know them from movies…

• Star Trek

• HAL 9000 (2001)

• C3PO (Star Wars)



8


“Välkommen till

Telia…Här beskriver du

ditt ärende med egna ord

istället för att knappa dig

fram. Om du säger vad du

vill ha hjälp med så kan

jag koppla fram dig. Vad

gäller ditt ärende?”

Hallå? 80 kategorier

Support Agent

Billing info

Mobile Refill

Broadband order

Broadband availability

Mobile info agent

“Ja hallå, vad

gäller ditt ärende?

Det gäller en faktura

Ok ett faktura ärende

[PAUS] Gäller det fast

telefoni,mobil telefoni

bredband eller uppringt

internet?

Bredband

...and since a couple of ...and since a couple of ...and since a couple of ...and since a couple of years from real systemsyears from real systemsyears from real systemsyears from real systems

(an example from 90200)


How do we want speaking How do we want speaking How do we want speaking How do we want speaking machines to behave?machines to behave?machines to behave?machines to behave?

As us...

...but to a limit?



9


Interface metaphorsInterface metaphorsInterface metaphorsInterface metaphors

• Exploit specific knowledge that users already have from other domains

• Give the user instantaneous knowledge about how to interact

• Examples include – The desktop – the computer is a workplace

– File system – the computer is a filing cabinet

– Games – the computer is a toy


Metaphors for Metaphors for Metaphors for Metaphors for speech interfacesspeech interfacesspeech interfacesspeech interfaces

• The tool/interface metaphor–Speech as an alternative interface technology in GUIs (multimodality)

• The servant/human metaphor– The computer as an entity with human-like conversational abilities

One is not necessarily better than the other!



10


Applications of machines with Applications of machines with Applications of machines with Applications of machines with human metaphor interactionhuman metaphor interactionhuman metaphor interactionhuman metaphor interaction

• Research tool• Cassel: ‘‘a machine that acts human enough that we respond to it as we respond to another human”

• Second LanguageLearning

• Social Robots

• Computer Games


What would make a What would make a What would make a What would make a speech interface intuitive?speech interface intuitive?speech interface intuitive?speech interface intuitive?

• The user should not have to learn: – When to speak

– How to speak

– What to say

• The system should support the user by:– Clearly showing when it will speak

– Saying things that reflects what it understands



11


When to speakWhen to speakWhen to speakWhen to speak

• The computer understands when the user intends to stop or continue speaking

• The computer indicates when it intends to stop or continue speaking


Controlling the dialogue flowControlling the dialogue flowControlling the dialogue flowControlling the dialogue flow

• Conversation has two channels– Exchange of information

– Control of the exchange of information

• Control– Turn-taking (who speaks when)

– Feedback (attention, attitude…)

Dialogue systems needs both controls



12


Using superficial methods to Using superficial methods to Using superficial methods to Using superficial methods to handle dialogue flowhandle dialogue flowhandle dialogue flowhandle dialogue flow

• Talk after the beep

• Talk-now icon/cursor

• Push-to-talk button


• Facial gestures and head movements

• Prosody and extra-linguistic sounds

Using humanUsing humanUsing humanUsing human----like methods to like methods to like methods to like methods to handle dialogue flowhandle dialogue flowhandle dialogue flowhandle dialogue flow



13


The MonAMI ReminderThe MonAMI ReminderThe MonAMI ReminderThe MonAMI Reminder

Google Calendar

ECA

SMS notifications

Web GUI

15.00 Meeting with Sara at Wayne’s Coffee

Monday 13

Handwriting


Attention and interaction controllerAttention and interaction controllerAttention and interaction controllerAttention and interaction controller

NonAttentive

Attentive Speaking

Listening

UserStartLook

SystemInitiative

SystemResponse

SystemStopSpeak

UserStopLook

Holding

Timeout

Interrupted

UserStartSpeak SystemIgnore (resume)

SystemResponse (restart)SystemIgnore

Calling

PausingSpeakingAttending

Not attending

Attending

Speaking

Not attending

SystemInitiative

UserStartLook

SystemResponse

UserStartSpeak

UserStopLookUserStartLook (resume)

SystemStopSpeak

System

User

SystemAbortSpeak



14


Facial animationFacial animationFacial animationFacial animation

NonAttentive Attentive SystemIgnoreListening


Experiments in a tutoring settingExperiments in a tutoring settingExperiments in a tutoring settingExperiments in a tutoring setting

Tutoring setting: Tutor explaining the system to a user

Problem• Is the user talking to the tutor or the system?

• Compare push-to-talk (PTT) with look-to-talk (LTT)

Subjects • 4 elderly persons from the target group

• 4 collegues



15


Example Example Example Example


Research SpokenResearch SpokenResearch SpokenResearch SpokenDialogue SystemsDialogue SystemsDialogue SystemsDialogue Systems



16


Classic research dialogue systemsClassic research dialogue systemsClassic research dialogue systemsClassic research dialogue systems

• Historic research systems– Voyager (1989)– ATIS (1992)– SUNDIAL (1993)– TRAINS (1996)

• Application– Philips Train Information (1995)

• Large Efforts– Communicator – Verbmobil– SmartKom


The Nordic SceneThe Nordic SceneThe Nordic SceneThe Nordic Scene

• Stockholm, Sweden– KTH (Waxholm, August, AdApt, Higgins)– Telia (Resebokning, NICE)

• Linköping, Sweden– IDA (LINLIN, Ornitology, Movie Finder)

• Göteborg, Sweden– Lingvistiken (TRINDI, GODIS, TALK)– Chalmers (State-chart xml)

• Aalborg, Odense Denmark• Helsinki, Tampere Finland• Trondheim, Norway



17


The Waxholm Project (92The Waxholm Project (92The Waxholm Project (92The Waxholm Project (92----95)95)95)95)

• Tourist information– Stockholm archipelago

– time-tables, hotels, hostels, camping and dining possibilities.

• Mixed initiative dialogue– speech recognition

– multimodal synthesis

• Graphic information– pictures, maps, charts and

time-tables.


Waxholm SystemWaxholm SystemWaxholm SystemWaxholm System



18


The Waxholm systemThe Waxholm systemThe Waxholm systemThe Waxholm system


The August kiosk at The August kiosk at The August kiosk at The August kiosk at KulturhusetKulturhusetKulturhusetKulturhuset 98 98 98 98

• Swedish spoken dialogue system for public use • Animated agent named after August Strindberg• Speech technology components developed at KTH• Designed to be easily expanded and reconfigured • Used to collect spontaneous speech data



19


The August system


The HIGGINS system (03The HIGGINS system (03The HIGGINS system (03The HIGGINS system (03----))))

• The primary domain of HIGGINS is city navigation for pedestrians.

• Secondarily, HIGGINS is intended to provide simple information about the immediate surroundings.



20


HigginsHigginsHigginsHiggins


The Commercial Speech The Commercial Speech The Commercial Speech The Commercial Speech ApplicationsApplicationsApplicationsApplications



21


Access Convergence & SpeechAccess Convergence & SpeechAccess Convergence & SpeechAccess Convergence & Speech


Commercial ApplicationsCommercial ApplicationsCommercial ApplicationsCommercial Applicationsof Speech technologyof Speech technologyof Speech technologyof Speech technology

• Dictation– Medical

– Legal

– Windows Vista (95-99% without training?)

• Customer care center automation– Call Routing

– Technical support

• Multimodal mobile applications– Information search

– Content download

– Entertainment



22


Problem solvingProblem solvingProblem solvingProblem solving

• SpeechCycle– Broadband support Agent

– Video support Agent

– IP Telephony Agent


Some nice dialogue features of Some nice dialogue features of Some nice dialogue features of Some nice dialogue features of Speechcycle’s Agents Speechcycle’s Agents Speechcycle’s Agents Speechcycle’s Agents

• Open prompts with free speech ASR

• References to user’s previous attempts to fix problem

• Answer to free prompts

• Error identification

• Memory of earlier turns

• Encouraging the user to stay



23


SDS ComponentsSDS ComponentsSDS ComponentsSDS Components


Spoken dialogue processingSpoken dialogue processingSpoken dialogue processingSpoken dialogue processing



24


The Components of The Components of The Components of The Components of Spoken Dialogue SystemsSpoken Dialogue SystemsSpoken Dialogue SystemsSpoken Dialogue Systems

• Automatic Speech Recognition (ASR)• Natural Language Understanding (NLU)• Dialogue Management (DM)• Natural Language Generation (NLG)• Text-To-Speech synthesis (TTS)• Multimodal modules


Speech RecognizerSpeech RecognizerSpeech RecognizerSpeech Recognizer

Purpose: convert speech to text

Input: digital speech

Output: String(s) or lattice that represent the most probable utterance(s)

Considerations: Vocabulary size, Grammar and Speech type



25


Speech Recognition Speech Recognition Speech Recognition Speech Recognition

• Identify the speech sounds– Measure the spectrum of the speech signal 100

times per second

• Match the recognized sounds against a pronunciation dictionary– Only the allowed words are active

• Construct a sentence of the most probable words – Taking the probability of the context into account


Statistically based recognitionStatistically based recognitionStatistically based recognitionStatistically based recognition

Search

Language modelPossible word sequences

N-best

Sentence understanding

N-bestN-bestN-best

Phoneme modelsAcoustic deskr.

Lexiconvocabulary + transcription

Chosen sentenceKnowledge

Analysis

Comparison

ParameterisationFFT16 kHz 100 Hz

Spectral analysis

Classification

ContinuousDiscrete



26


What makes speechWhat makes speechWhat makes speechWhat makes speechrecognition difficult? recognition difficult? recognition difficult? recognition difficult?

Speaker Channel Listener

• Between speakers– Age

– Gender

– Anatomy

– Dialect

• Within speaker– Stress

– Mood

– Health

– Formel / Spontanteous

• Environment– Additive noise

– Room

• Microphone, Telephone– Bandwidth

– Noise

• Listener– Age

– First Language

– Level of hearing

– Known / Unknown

– Human / Maschine


(.wav)

(.wav)

(.wav)

(.wav)

(.wav)

(.wav)

(.wav) (.wav)

”Flyget, tåget och bilbranschen tävlar om lönsamhet och folkets gunst”.

Född iUSAex-Jugoslavien

(.wav)



27


Results 2005Results 2005Results 2005Results 2005

(earlier result in parentesis)(earlier result in parentesis)(earlier result in parentesis)(earlier result in parentesis)

Corpus Speech type Lexicon size Word Error Rate (%)

Human Error Rate (%)

Digit string (phone) spontaneous 10 0.3 0.009

Resource Management

read 1000 3.6 0.1

ATIS spontaneous 2000 2 -

Wall Street Journal

read 64000 6.6 1

Radio News mixed 64000 13.5 (20) -

Switchboard (phone)

conversation 10000 19.3 (38) 4

Call Home (phone) conversation 10000 30 (50) -


Natural Language Natural Language Natural Language Natural Language UnderstandingUnderstandingUnderstandingUnderstanding

Purpose: produce meaning representation from ASR output

Input: String(s) lattice of words

Output: Meaning representation

Considerations: Type of grammar (Finite-state, Full parse or Word-spotting)



28


Phrases and syntaxPhrases and syntaxPhrases and syntaxPhrases and syntax

I want to take the boat


Robust Utterance AnalysisRobust Utterance AnalysisRobust Utterance AnalysisRobust Utterance Analysis



29


Pragmatics/world knowledgePragmatics/world knowledgePragmatics/world knowledgePragmatics/world knowledge

• Flights can be delayed• To be delayed is bad• ”Usually” → recurring in time. How often is ”usually”?

• Is this particular flight delayed a lot? • What would that mean for connecting flights?

S: There is a flight at 8.10.

U: Isn’t that usually delayed?


Dialogue ManagementDialogue ManagementDialogue ManagementDialogue Management

Purpose: decide the next system action

Input: a meaning representation from NLU

Output: High-level communicative goal(s)



30


Automating DialogueAutomating DialogueAutomating DialogueAutomating Dialogue

• To automate a dialogue, we must model:User’s Intentions User’s Intentions User’s Intentions User’s Intentions

• Recognized from input media forms (multi-modality)

• Disambiguated by the Context

System ActionsSystem ActionsSystem ActionsSystem Actions• Based on the task model and on output medium forms

(multi-media)

Dialogue FlowDialogue FlowDialogue FlowDialogue Flow• Content Selection (what to say)

• Information Packaging (how much to say)

• Turn-taking (who can speak/interrupt)


Knowledge SourcesKnowledge SourcesKnowledge SourcesKnowledge Sourcesfor Dialogue Systemsfor Dialogue Systemsfor Dialogue Systemsfor Dialogue Systems

• Context:– Dialogue history– Current task, slot, state

• Ontology:– World Knowledge model– Domain model

• Language:– Models of conversational competence:

• Turn-taking strategies• Discourse obligations

– Linguistic models:• Grammars, Dictionaries, Corpora

• User Model:– Information about user that could be relevant to the dialogue

• Age, gender, spoken language, location, preferences, etc.• Dynamic information: the current Mental State (Belief, Desires, Intentions)



31


Dialogue Control StrategiesDialogue Control StrategiesDialogue Control StrategiesDialogue Control Strategies

• Finite-state based systems– dialog and states explicitly specified

• Frame based systems– dialog separated from information states

• Agent based systems– model of intentions, goals, beliefs


Natural Language GenerationNatural Language GenerationNatural Language GenerationNatural Language Generation

Purpose: produce an output string to be shown/spoken to the user

Input: Representation from DM

Output: String for TTS (possibly marked for prosody, etc.)



32


Utterance GenerationUtterance GenerationUtterance GenerationUtterance Generation

• Predefined utterances• Frames with slots• Generation based on grammar and underlying semantics


TextTextTextText----totototo----Speech SynthesisSpeech SynthesisSpeech SynthesisSpeech Synthesis

Input: text string (with tags)

Purpose: speak string to user

Considerations: Human-like, Flexibility



33


What is speech synthesis?What is speech synthesis?What is speech synthesis?What is speech synthesis?• Recorded human speech

– Words and phrases

– fix vocabulary

• Systematically recorded fragments of speech– one fix speaker

• Record the phoneme transitions "diphones"

• Parametric synthesis (artificial)– Formant synthesis

– articulatory synthesis

• Multimodal synthesis (talking head)


TextTextTextText----totototo----speech (TTS)speech (TTS)speech (TTS)speech (TTS)

text

Linguistic analysis

Prosodic analysis

Phonetic description

Sound generation

Morphological analysisLexicon and rulesSyntax analysis

Rules and lexicon

Rules and unit selection

Concatenation Rules

Language ident.

“abcd”



34


Text PreprocessingText PreprocessingText PreprocessingText Preprocessing

• Sentence end detection (semicolon, period – ratio, time and decimal point, sentence ending respectively)

• Abbreviations (e.g. – for instance). Changed to their full form with the help of lexicons

• Acronyms (I.B.M – these can be read as a sequence of characters, or NASA which can be read following the default way)

• Numbers (Once detected, first interpreted as rational, time of the day, dates and ordinal depending on their context)

• Idioms (e.g. “In spite of”, “as a matter of fact”– these are combined into single FSU using a special lexicon)

Olov Engwall, Speech synthesis, 2008


GraphemeGraphemeGraphemeGrapheme----totototo----phoneme conversionphoneme conversionphoneme conversionphoneme conversion

• Dictionary:– Store a maximum of phonological knowledge into a lexicon.– Compounding rules describe how the morphemes of dictionary items are modified.– Hand-corrected, expensive– The lexicon is never complete: needs out of vocabulary pronouncer, transcribed by rule.

• Rules:– A set of letter to sound (grapheme to phoneme) rules.– Words pronounced in a such a particular way that they have their own rule are stored in

exceptions directory.– Fast & easy, but lower accuracy

• Machine learning:– Cart tree– Analogy pronunciation

Olov Engwall, Speech synthesis, 2008



35


Sound generation approachesSound generation approachesSound generation approachesSound generation approaches

• Parametric synthesis (Synthesis by Rule )– Formant synthesis or articulatory synthesis

– Cognitive approach of the phonation mechanism– Speech is produced by mathematical rules that formally describe the

influence of phonemes on one another

– Flexible, but lower quality - today

• Synthesis by Concatenation– Diphones or larger units (unit selection)

– Limited knowledge of the data to be handled– Elementary speech units are stored in a database and then

concatenated and processed to produce the speech signal– One speaker


TextTextTextText----totototo----Speech Synthesis Speech Synthesis Speech Synthesis Speech Synthesis (TTS) Evolution(TTS) Evolution(TTS) Evolution(TTS) Evolution

1962 1967 1972 1977 1982 1987 1992 1997 2002

Year

Formant

Synthesis

Bell Labs; Joint

Speech Research

Unit; MIT (DEC-

Talk); Haskins

Lab; KTH

LPC-Based

Diphone/Dyad

Synthesis

Bell Labs; CNET;

Bellcore; Berkeley

Speech

Technology

Unit Selection

Synthesis

ATR in Japan; CSTR

in Scotland; BT in

England; AT&T Labs

(1998); L&H in

Belgium

Poor

Intelligibility;

Poor Naturalness

Good

Intelligibility;

Poor Naturalness

Good

Intelligibility;

Customer Quality

Naturalness

(Limited Context)

HMM

Synthesis

HTS, Nagoya,

Edinburgh



36


Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959--------1987)1987)1987)1987)1987)1987)1987)1987)

• Haskins, 1959 • KTH – Stockholm, 1962• Bell Labs, 1973 • MIT, 1976• MIT-talk, 1979• Speak ‘N spell, 1980• BELL Labs, 1985• Dec talk, 1987


Source filter theorySource filter theorySource filter theorySource filter theory

Source

(glottal flow)filter

(shape of vocal tract)

Radiation

(lips)

Time:

Frequency:



37


ArticulatoryArticulatoryArticulatoryArticulatory parametersparametersparametersparameters

• Jaw opening

• Lip rounding

• Lip Protrusion

• Tongue position

• Tongue height

• Tongue tip

• Velum

• Hyoid


Concatenative synthesisConcatenative synthesisConcatenative synthesisConcatenative synthesis• Diphone synthesis

– Svensk(Mbrola)

• Unit selection– Svensk weather (CSTR) good, less good



38


Diphone synthesisDiphone synthesisDiphone synthesisDiphone synthesis

• Mid-phone is more stable than edge:

• Mid-phone is more stable than edge:

Slide from Dan Jurafsky


Diphone SynthesisDiphone SynthesisDiphone SynthesisDiphone Synthesis

• Well-understood, mature technology

• Problems:– Signal processing still necessary for modifying durations

– Source data is still not natural

– Units are just not large enough; can’t handle word-specific effects, etc




39


Unit Selection SynthesisUnit Selection SynthesisUnit Selection SynthesisUnit Selection Synthesis

• Natural data solves problems with diphones– Diphone databases are carefully designed but:

• Speaker makes errors• Speaker doesn’t speak intended dialect• Require database design to be right

– If it’s automatic• Labeled with what the speaker actually said• Coarticulation, schwas, flaps are natural

• “There’s no data like more data”– Lots of copies of each unit mean you can choose just the

right one for the context– Larger units mean you can capture wider effects



Unit Selection SummaryUnit Selection SummaryUnit Selection SummaryUnit Selection Summary

• Advantages– Quality is far superior to diphones

– Natural prosody selection sounds better

• Disadvantages:– Quality can be very bad in places

• HCI problem: mix of very good and very bad is quite annoying

– Synthesis is computationally expensive

– Can’t synthesize everything you want:

• Diphone technique can move emphasis

• Unit selection gives good (but possibly incorrect) result

Slide from Richard Sproat



40


Current trend Current trend Current trend Current trend –––– machine learningmachine learningmachine learningmachine learning

• Problems with Unit Selection Synthesis– Can’t modify signal

– (mixing modified and unmodified sounds bad)

– But database often doesn’t have exactly what you want

• Solution: HMM (Hidden Markov Model) Synthesis– Won the last TTS bakeoff.

– Sounds unnatural to researchers

– But naïve subjects preferred it

– Has the potential to improve on both diphone and unit selection.



HMM synthesisHMM synthesisHMM synthesisHMM synthesis

• A speech synthesis technique based on• HTK (Hidden Markov Model Toolkit)• Developed by the HTS working group• at the Department of Computer Science Nagoya

• Institute of Technology• Interdisciplinary Graduate School of Science and

• Engineering Tokyo Institute of Technology• http://hts.sp.nitech.ac.jp



41


Features used in our Features used in our Features used in our Features used in our Swedish HMM synthesis trainingSwedish HMM synthesis trainingSwedish HMM synthesis trainingSwedish HMM synthesis training

• Segment features: – immediate context– position in syllable

• Syllable features – Stress and lexical accent type– position in word and phrase

• Word features – number of syllables– position in phrase– morphological feature (compound or not)– part-of-speech tag (content or function word)

• Phrase features– phrase length in terms of syllables and words

• Utterance features:– length in syllables, words and phrases

• Speaker – dialect


Dalarna Norrland Skåne Gotland SvealandGötaland

vart tar universum slut

centerpartister och kristdemokrater menar dock att

brudparet kan slippa betala det kan bådas föräldrar

göra

gula sidorna finns från och med i fredags sökbart via

mobiltelefonen

om man till exempel tar en telefon och frågar hur den

fungerar så svarar vetenskapsmannen med att lyfta på

luren och slå numret



42


Multimodal SynthesisMultimodal SynthesisMultimodal SynthesisMultimodal Synthesis(talking heads)(talking heads)(talking heads)(talking heads)

Input: tagged text string

Purpose: support synthesis with lip movements

Considerations: Synchronization, Human-like?


Why Talking Heads?Why Talking Heads?Why Talking Heads?Why Talking Heads?

• The most natural form of communication that we know of is face-to-face interaction

• A virtual talking head can improve information transfer

– Through expression, gaze, head movements

– Through lip- cheek and tongue movements



43


Tasks of an Animated AgentTasks of an Animated AgentTasks of an Animated AgentTasks of an Animated Agent

• Provide intelligible synthetic speech• Indicate emphasis and focus in utterances • Support turn-taking• Give spatial references (gaze, pointing etc)• Provide non-verbal back-channeling• Indicate the system’s internal state


ApplicationsApplicationsApplicationsApplications

• Improved speech synthesis• Human-Computer Interface in spoken dialogue systems

• Aid for hearing impaired• Educational software• Stimuli for perceptual experiments

• Entertainment: games, virtual reality, movies etc.



44


Techniques for facial animationTechniques for facial animationTechniques for facial animationTechniques for facial animation

• Talking head synthesis requires a signal model and a control model.

The signal model may be • Video-based (2D)

– Enables realistic reproduction– Lacking flexibility

• Muscle based (3D)– Highly flexible– Difficult to gather anatomical data– Computationally intensive (although less of a problem today...)

• Direct parametrisation/morphing based (3D)– Good compromise between flexibility and realism– Simple data acuistion using optical methods


Direct parameterisationDirect parameterisationDirect parameterisationDirect parameterisation

• 3D-modelling

• Deformation through high-level parameters

• Different possible parameterisations:

– Articulatory oriented

– MPEG-4 (low-level)



45


Articulatory oriented direct Articulatory oriented direct Articulatory oriented direct Articulatory oriented direct parameterisationparameterisationparameterisationparameterisation

• High-level parameterisation tailored for visual speech animation

• Parameter set includes– Jaw opening

– Lip rounding

– Bilabial closure

– Laboidental closure

• Parameters are normalized relatve to spatial targets


MPEGMPEGMPEGMPEG----4 direct parameterisation4 direct parameterisation4 direct parameterisation4 direct parameterisation

• Original purpose: model based video coding

• The standard defines a generic face object

• 84 feature points (FPs)• 68 facial animation parameters (FAPs)

• FAPs are normalized relative to distances in the face

• Expressed in FAPU (FAP unit)



46


Measuring talking facesMeasuring talking facesMeasuring talking facesMeasuring talking faces

• Cyberware scanning– High definition 3D shape and texture

– Static

– Expensive equipment

• Motion capture– Captures points in 3D

– Dynamic

– Starting to become affordable

• Video analysis– Can provide texture, contours and points

– Inexpensive hardware

– Typically less accurate than motion capture


Collection of Collection of Collection of Collection of audioaudioaudioaudio----visual databasesvisual databasesvisual databasesvisual databases

� Eliciting technique: information

seeking scenario

� Focus on the speaker who has the

role of information giver

� The speaker seats facing 4 infrared

cameras, a digital video-camera, a

microphone The other person is only

video recorded.



47


Recording and modelRecording and modelRecording and modelRecording and model


Emotional synthesisEmotional synthesisEmotional synthesisEmotional synthesis

Happiness Anger Fear

Surprise Disgust Sadness



48


Interactions: emotion and articulationInteractions: emotion and articulationInteractions: emotion and articulationInteractions: emotion and articulation(from AV speech database (from AV speech database (from AV speech database (from AV speech database –––– EU/PF_STAR EU/PF_STAR EU/PF_STAR EU/PF_STAR

project)project)project)project)


Datadriven facial synthesis with MPEG4 modelDatadriven facial synthesis with MPEG4 modelDatadriven facial synthesis with MPEG4 modelDatadriven facial synthesis with MPEG4 model

Happy Angry

Sad Surprised



49


An example SDSAn example SDSAn example SDSAn example SDS


The NICE systemThe NICE systemThe NICE systemThe NICE system



50


BackgroundBackgroundBackgroundBackground

• NICE was a EU project 2002-2005• Five Partners

– Liquid Media, Sweden

– Scansoft (initially Philips, now Nuance), Germany

– Limsi, France

– TeliaSonera, Sweden

– Nislab, Denmark

• Two systems were developed– a Swedish Fairytale game

– an English HC Andersen story-teller


The NICE system setupThe NICE system setupThe NICE system setupThe NICE system setup



51


A computer game with A computer game with A computer game with A computer game with multimodal dialoguemultimodal dialoguemultimodal dialoguemultimodal dialogue

• It puts new & different demands on speech and dialogue technology– young users– a game lasts for many hours– many kinds of dialogue: problem solving, negotiation,

information-seeking, socializing, command & control, multi-party, …

– the setup is multimodal and asynchronous

• Many things are easier than for ‘normal’ dialogue systems– low penalty for recognition errors– voluntary suspension of belief– gamers already know all the clichés

• $ Might pay off big some day $


The fairyThe fairyThe fairyThe fairy----tale gametale gametale gametale game

• Two scenes – the room and the fairy-tale world– Scene one – Multimodal training session put-that-there– Scene two – Multi-party negotiation dialogue

• Three characters– Cloddy Hans – slow but friendly Helper character– Karen – fast and sullen Gatekeeper character– Thumbelina – silent supporting character with attitudinal gestures



52


Inside a fairyInside a fairyInside a fairyInside a fairy----tale braintale braintale braintale brain

• World model: Interrelated (Java) objects representing things, characters, and their relationships.

• Discourse history: Who said what when?

• Agenda: Goals, actions and their causal relationships.


EventEventEventEvent----based dialoguebased dialoguebased dialoguebased dialogue



53


AttentionalAttentionalAttentionalAttentional and emotionaland emotionaland emotionaland emotionaloutput behaviorsoutput behaviorsoutput behaviorsoutput behaviors


Body gestures and Body gestures and Body gestures and Body gestures and physical actionsphysical actionsphysical actionsphysical actions



54


The End

0910 spktek talteknologi jg.ppt · dialogsystem: joakim gustafson tal, musik och hörsel 5 j....

Documents