speech processing 11-492/18-492

53
Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS components

Upload: others

Post on 30-Jan-2022

33 views

Category:

Documents


0 download

TRANSCRIPT

Speech Processing 11-492/18-492Speech Processing 11-492/18-492

Spoken Dialog SystemsSDS components

Spoken Dialog SystemsSpoken Dialog Systems

More than just ASR and TTSMore than just ASR and TTS RecognitionRecognition Language understandingLanguage understanding Manipulation of utterancesManipulation of utterances Generation of new informationGeneration of new information Text generationText generation SynthesisSynthesis

SDS ArchitectureSDS Architecture

Language Generation

ASRLanguage

Understanding

Synthesis

Dialog Manager

Error Handling Strategies

Non Understanding

SDS InternalsSDS Internals

Language UnderstandingLanguage Understanding From words to structureFrom words to structure

Dialog ManagerDialog Manager State of dialog (who is talking)State of dialog (who is talking) Direction of dialog (what next)Direction of dialog (what next) References, user profile etcReferences, user profile etc Interaction of database/internetInteraction of database/internet

Language GenerationLanguage Generation From structure to wordsFrom structure to words

Language UnderstandingLanguage Understanding

Parsing of SPEECH not TEXTParsing of SPEECH not TEXT Eh, I wanna go, wanna go to Boston tomorrowEh, I wanna go, wanna go to Boston tomorrow If its not too much trouble I’d be very grateful if If its not too much trouble I’d be very grateful if

one might be able to aid me in arranging my one might be able to aid me in arranging my travel arrangements to Boston, Logan airport, at travel arrangements to Boston, Logan airport, at sometime tomorrow morning, thank you.sometime tomorrow morning, thank you.

Boston, tomorrowBoston, tomorrow

Parsing: Output structureParsing: Output structure

““I wanna go to Boston, tomorrow”I wanna go to Boston, tomorrow” Destination: BOSDestination: BOS Departure: 20081028, AMDeparture: 20081028, AM Airline: unspecifiedAirline: unspecified Special: unspecifiedSpecial: unspecified

Convert speech to structureConvert speech to structure Sufficient for further processing/querySufficient for further processing/query

Interaction ExampleInteraction Example

Intelligent Agent

Cheap Taiwanese eating places include Din Tai Fung, Boiling Point, etc. What do you want to choose? I can help you go there.

find a cheap eating place for taiwanese food

User

SDS ProcessSDS Process

find a cheap eating place for taiwanese food

User

target

foodpriceAMOD

NN

seeking

PREP_FORIntelligent

Agent

SDS ProcessSDS Process

User

target

foodpriceAMOD

NN

seeking

PREP_FOR

Organized Domain Knowledge

Intelligent Agent

Ontology Induction

(semantic slot)

find a cheap eating place for taiwanese food

SDS ProcessSDS Process

User

target

foodpriceAMOD

NN

seeking

PREP_FOR

Organized Domain Knowledge

Intelligent Agent

Ontology Induction

(semantic slot)

Structure Learning

(inter-slot relation)

find a cheap eating place for taiwanese food

SDS ProcessSDS Process

User

target

foodpriceAMOD

NN

seeking

PREP_FORIntelligent

Agent

seeking=“find”target=“eating place”price=“cheap”food=“taiwanese”

find a cheap eating place for taiwanese food

find a cheap eating place for taiwanese food

SDS ProcessSDS Process

User

target

foodpriceAMOD

NN

seeking

PREP_FORIntelligent

Agent

seeking=“find”target=“eating place”price=“cheap”food=“taiwanese”

Semantic Decoding

Automatic Slot InductionAutomatic Slot Induction

Chen et al. ASRU’13Chen et al. ASRU’13

can i have a cheap restaurant

Frame: capability

Frame: expensiveness

Frame: locale by use

Domain

DomainGeneral

slot candidate

15

can i have a cheap restaurant

Parsing vs Language ModelParsing vs Language Model

Language ModelLanguage Model Model what actually gets saidModel what actually gets said

Parsing Parsing Extract the information you wantExtract the information you want

Models *can* be sharedModels *can* be shared Only accept things in the grammarOnly accept things in the grammar Can be over limitingCan be over limiting

Neural Networks for SLUNeural Networks for SLU

RNN for Slot FillingRNN for Slot Filling

Step 1: word embedding Step 1: word embedding Step 2: short-term dependencies capturingStep 2: short-term dependencies capturing Step 3: long-term dependencies capturingStep 3: long-term dependencies capturing Step 4: different types of neural architectureStep 4: different types of neural architecture

http://deeplearning.net/tutorial/rnnslu.html#rnnsluhttp://deeplearning.net/tutorial/rnnslu.html#rnnslu

Mesnil et al. 2013

Interactive Learning for SLUInteractive Learning for SLU

Luis : Interactive machine learning for Luis : Interactive machine learning for language understandinglanguage understanding

Advantages: Advantages: Non-expert could add in knowledge in Non-expert could add in knowledge in feature engineeringfeature engineeringActive-learning reduces heavy labelingActive-learning reduces heavy labeling

https://www.luis.ai/ https://www.luis.ai/

Williams et al. 2016

Dialog ManagerDialog Manager

Maintain stateMaintain state Where are we in the dialogWhere are we in the dialog Whose turn is itWhose turn is it

Waiting for speakerWaiting for speaker Waiting for database query (stall user)Waiting for database query (stall user)

Deal with barge-inDeal with barge-in

Frame Based Dialog MangerFrame Based Dialog Manger

Used for transaction dialogUsed for transaction dialog Generalizes finite-state approach by allowing Generalizes finite-state approach by allowing

multiple paths to acquire info multiple paths to acquire info Central data structure is frame with slotsCentral data structure is frame with slots

• DM is monitoring frame, filling in slots DM is monitoring frame, filling in slots Frame: Frame:

Set of information needed Set of information needed Context for utterance interpretation Context for utterance interpretation Context for dialogue progress Context for dialogue progress

Allows mixed initiative Allows mixed initiative Allows over-answering Allows over-answering

Also called form-based (MIT): Often called “slot-filling” Also called form-based (MIT): Often called “slot-filling” 

Problems with Frames Problems with Frames

Not easily applicable to complex tasks Not easily applicable to complex tasks May not be a single frame May not be a single frame Dynamic construction of information Dynamic construction of information User access to “product”User access to “product”

Agenda + FrameAgenda + Frame

Product: Product: hierarchical composition of frames hierarchical composition of frames

Process: Process: Agenda Agenda

Generalization of stack Generalization of stack Ordered list of topics Ordered list of topics List of handlersList of handlers

Statistical Approaches to DMStatistical Approaches to DM

Allow for dialog complexity beyond human Allow for dialog complexity beyond human mind mind

Find optimal decision for non-trivial design Find optimal decision for non-trivial design problems problems

Life-long learningLife-long learning

DecisionsDecisions

Difficult design decision over the course of Difficult design decision over the course of interaction interaction When to ask open / directive questions? When to ask open / directive questions? When to confirm? When to confirm? When to barge-in / wait? When to barge-in / wait? Which type of feedback to provide? (ex, Which type of feedback to provide? (ex,

intelligent tutoring system) intelligent tutoring system) Sample efficient policy search Sample efficient policy search

Policy space is too huge to search with Policy space is too huge to search with traditional ways of SDS developmenttraditional ways of SDS development

System-Initiative VS Mixed-InitiativeSystem-Initiative VS Mixed-Initiative

S1: Welcome to CMU Let’s Go. S1: Welcome to CMU Let’s Go. Where do you leave from? Where do you leave from?

U1: CMU U1: CMU

S2: From CMU, did I get that right? S2: From CMU, did I get that right?

U2: Yes. U2: Yes.

S3:S3: Where are you going? Where are you going?

U3: Downtown.U3: Downtown.

S4: To Downtown, did I get that right? S4: To Downtown, did I get that right?

U4: Yes.U4: Yes.

S1: Welcome to CMU Let’s Go. S1: Welcome to CMU Let’s Go. How may I help you? How may I help you?

U1: I'd like to go from CMU to Downtown. U1: I'd like to go from CMU to Downtown.

S2: From CMU to Downtown, did I get that right? S2: From CMU to Downtown, did I get that right?

U2: Yes. U2: Yes.

S3S3: When are you going to take the bus? : When are you going to take the bus?

U3: Now U3: Now

S3: You want the next bus, is that right? S3: You want the next bus, is that right?

U3: Yes.U3: Yes.

Confirm Each VS Confirm AllConfirm Each VS Confirm All

S1: Welcome to CMU Let’s Go. Where do you leave from? S1: Welcome to CMU Let’s Go. Where do you leave from?

U1: CMU U1: CMU

S2S2: From CMU, did I get that right? : From CMU, did I get that right?

U2: Yes. U2: Yes.

S3:S3: Where are you going? Where are you going?

U3: Downtown.U3: Downtown.

S4: To Downtown, did I get that right? S4: To Downtown, did I get that right?

U4: Yes.U4: Yes.

S1: Welcome to CMU Let’s Go. Where do you leave from? S1: Welcome to CMU Let’s Go. Where do you leave from?

U1: CMU • S2: Where are you going? U1: CMU • S2: Where are you going?

U2: Downtown. U2: Downtown.

S3: When are you going to take bus? S3: When are you going to take bus?

U3: Now U3: Now

S4: S4: Leaving from CMU going to Downtown immediately, is it correct?Leaving from CMU going to Downtown immediately, is it correct?

Explicit Confirm VS Implicit ConfirmExplicit Confirm VS Implicit Confirm

S1: Welcome to CMU Let’s Go. Where do you leave from? S1: Welcome to CMU Let’s Go. Where do you leave from?

U1: CMU U1: CMU

S2: S2: From CMU, did I get that right? From CMU, did I get that right?

U2: Yes. U2: Yes.

S3:S3: Where are you going? Where are you going?

U3: Downtown.U3: Downtown.

S4: S4: To Downtown, did I get that right? To Downtown, did I get that right?

U4: Yes.U4: Yes.

S1: Welcome to CMU Let’s Go. Where do you leave from? S1: Welcome to CMU Let’s Go. Where do you leave from?

U1: CMU U1: CMU

S2: S2: Departing CMUDeparting CMU. Where are you going? . Where are you going?

U2: Downtown. U2: Downtown.

S3: S3: Going to Downtown. Going to Downtown. When are you going to take bus? When are you going to take bus?

U3: NowU3: Now

Corrective FeedbackCorrective Feedback

S1: Welcome to CMU Let’s Go. How may I help you? S1: Welcome to CMU Let’s Go. How may I help you?

U1: I likes go CMU to Downtown. U1: I likes go CMU to Downtown.

S2: S2: You must not use third person singular verb after first person You must not use third person singular verb after first person singular pronoun [explicit correction]singular pronoun [explicit correction]

S1: Welcome to CMU Let’s Go. How may I help you? S1: Welcome to CMU Let’s Go. How may I help you?

U1: I likes go CMU to Downtown. U1: I likes go CMU to Downtown.

S2: S2: You said I like to go from CMU to Downtown. Did I get that right? You said I like to go from CMU to Downtown. Did I get that right? [recast][recast]

Traditional Policy DesignTraditional Policy Design

Traditional SDS development paradigm Traditional SDS development paradigm Several versions of a system are developed (where each version Several versions of a system are developed (where each version

uses a single dialog policy, intuitively designed by an expert) uses a single dialog policy, intuitively designed by an expert) Dialog corpora are collected with human users interacting with Dialog corpora are collected with human users interacting with

different versions of the system different versions of the system A number of evaluation metrics are measured for each dialog A number of evaluation metrics are measured for each dialog The different versions are statistically compared The different versions are statistically compared Update the system with “best” dialog policy Update the system with “best” dialog policy

Due to the costs of experimentation, only a handful of Due to the costs of experimentation, only a handful of policies are usually explored in any one experimentpolicies are usually explored in any one experiment

Problem of Traditional PolicyProblem of Traditional Policy

The number of possible policies is much larger

Limited Data

ModelingModeling

Agent updates state by observing environment Given the state, agent performs the best action to achieve its goal Need to learn the mapping h from s S to a A∈ ∈

LearningLearning

Supervised learning Supervised learning For each input state s, optimal action a is known For each input state s, optimal action a is known Infer input-output relationship a ≈ h(s) Infer input-output relationship a ≈ h(s) Example: neural networks, support vector Example: neural networks, support vector

machinemachine

Markov Decision ProcessMarkov Decision Process

No data:No data: Human-Human is differences between human Human-Human is differences between human

and machine perceptionand machine perception Wizard of Oz is costly and wizard is hard to Wizard of Oz is costly and wizard is hard to

train.train.Statistical method to formulate the questionStatistical method to formulate the question

MDP computes an optimal dialog policy within a much larger search MDP computes an optimal dialog policy within a much larger search space, using a relatively small number of training dialogspace, using a relatively small number of training dialog

MDP evaluates actions (small number) not policies (large number) by MDP evaluates actions (small number) not policies (large number) by credit assignment of total reward to each action using dynamic credit assignment of total reward to each action using dynamic programmingprogramming

Markov Decision ProcessMarkov Decision Process

Learning GoalLearning Goal

Dialog Management as MDPDialog Management as MDP

Model DialogModel Dialog

StateState

ActionsActions

Transition FunctionsTransition Functions

Reward FunctionReward Function

Example: Non-Task-Oriented System Example: Non-Task-Oriented System ArchitectureArchitecture

Lexical Semantic Strategies

Engagement Appropriateness Strategies

User Input

Context Tracking

Strategies

System Output

High GenerationConfidence

Low GenerationConfidence

Trigger Condition Not Meet

Trigger Condition Meet

ResponseGeneration

Method

Response Generation

Conversation Strategy Selection

Active ParticipationActive Participation

Start conversations with a concrete topic Start conversations with a concrete topic

e.g. TickTock: Hi, I am TickTock, I really like sports, e.g. TickTock: Hi, I am TickTock, I really like sports, let’s talk about sports.let’s talk about sports.

Design each strategies with respect to a topicDesign each strategies with respect to a topic

Stay on the topicStay on the topic

Change the topicChange the topic

43

Stay on the Topic:Stay on the Topic:

S1:Tell a joke (joke)S1:Tell a joke (joke)

e.g. Do you know that people usually spend far e.g. Do you know that people usually spend far more time watching sports than actually playing any.more time watching sports than actually playing any.

S2: Initiate activities (initiate)S2: Initiate activities (initiate)

e.g. Do you want to see a game together some e.g. Do you want to see a game together some time?time?

S3:Talk more (more):S3:Talk more (more):

e.g. Let’s talk more about sports.e.g. Let’s talk more about sports.

44

Change the TopicChange the Topic

S4: Switch topics (switch): S4: Switch topics (switch):

e.g. How about we talk about movies?e.g. How about we talk about movies?

S5: End topics with an open question (end): S5: End topics with an open question (end):

e.g. That’s interesting, could you share with me e.g. That’s interesting, could you share with me some interesting news on the Internet?some interesting news on the Internet?

45

A Failure Conversation CaseA Failure Conversation Case

TickTock: Hello, I really like politics. Let’s talk about TickTock: Hello, I really like politics. Let’s talk about politics.politics.

UserUser: : No, I don’t like politics. No, I don’t like politics.

TickTock: Why is that? TickTock: Why is that?

User: User: I just don’t like politics. I just don’t like politics.

TickTock: How about we watch a debate together TickTock: How about we watch a debate together some day? (initiate activity) some day? (initiate activity)

UserUser: : I told you I don’t want to talk about I told you I don’t want to talk about politics! politics!

46

Not Appropriate!

Not Appropriate!

Sentiment ContextSentiment Context

TickTock: Hello, I really like politics. Let’s talk about TickTock: Hello, I really like politics. Let’s talk about politics. politics. neutralneutral

User: User: No, I don’t like politics. No, I don’t like politics. negativenegative

TickTock: Why is that? TickTock: Why is that? neutralneutral

User: User: I just don’t like politics. I just don’t like politics. negativenegative

TickTock: OK, how about we talk about movies? TickTock: OK, how about we talk about movies? (switch topics)(switch topics)

User:User: Sure. I do watch movies a lot. Sure. I do watch movies a lot. neutralneutral

47

Appropriate!Appropriate!

Engagement Maintain PolicyEngagement Maintain PolicyGoal: Improve system Goal: Improve system appropriatenessappropriateness considering considering context.context.

Method: Reinforcement learningMethod: Reinforcement learning

Q Learning : Q Learning :

State Variable (S):State Variable (S): system-appropriateness confidencesystem-appropriateness confidence

all previous utterance-sentiment confidenceall previous utterance-sentiment confidence

time of each strategy executedtime of each strategy executed

turn positionturn position

most recently used strategymost recently used strategy

Actions (A) : engagement strategies (5 types) and Actions (A) : engagement strategies (5 types) and generated utterancegenerated utterance

48

Model DetailModel Detail

Reward function(R): Cumulated Appropriateness *0.3 + Reward function(R): Cumulated Appropriateness *0.3 + Conversation depth*0.3 + Information gain *0.4Conversation depth*0.3 + Information gain *0.4

AppropriatenessAppropriateness : the current response’s coherence with the user : the current response’s coherence with the user utterance.utterance.

Automatic predictor: SVM binary classifier (Inappropriate, Automatic predictor: SVM binary classifier (Inappropriate, interpretable VS Appropriate)interpretable VS Appropriate)

Conversation depthConversation depth: the maximum number of consecutive utterances : the maximum number of consecutive utterances on the same topic.on the same topic.

Automatic predictor: SVM binary classifier (Shallow, Intermediate VS Automatic predictor: SVM binary classifier (Shallow, Intermediate VS Deep)Deep)

Information gainInformation gain: the number of unique tokens.: the number of unique tokens.

Update: Update: Simulator: A.L.I.C.E. chatbot. Interface: Multithreaded text-only API Simulator: A.L.I.C.E. chatbot. Interface: Multithreaded text-only API (open source)(open source)

49

Language GenerationLanguage Generation

Query for flights to BostonQuery for flights to Boston Template fill answer(s)Template fill answer(s)

The next flight to DEST leaves at The next flight to DEST leaves at DEPART_TIME arriving at ARRIVE_TIME.DEPART_TIME arriving at ARRIVE_TIME.

Templates may be much more complexTemplates may be much more complex

Language GenerationLanguage Generation

Choose which template to useChoose which template to use Based on state, answer typeBased on state, answer type Natural variationNatural variation Statistical variationStatistical variation

Include <ssml> tags to help synthesisInclude <ssml> tags to help synthesis Can <emph>emphasize</emph> partsCan <emph>emphasize</emph> parts Can identify dates, numbers etc.Can identify dates, numbers etc.

Humans like variation in the outputHumans like variation in the output It is rare for a human to repeat things exactlyIt is rare for a human to repeat things exactly

Language GenerationLanguage Generation

Frames structures to (marked up) textFrames structures to (marked up) text START: PittsburghSTART: Pittsburgh END: BostonEND: Boston DATE: 20081028DATE: 20081028 TIME: 07:45TIME: 07:45 FLIGHT: US075FLIGHT: US075

Can generateCan generate I have US 075 leaving at 07:45 tomorrowI have US 075 leaving at 07:45 tomorrow US Airways has a flight departing tomorrow at 07:45US Airways has a flight departing tomorrow at 07:45

Standardized thingsStandardized things

HelpHelp User should be able to get help at any timeUser should be able to get help at any time Explain where they are and what they are Explain where they are and what they are

expected to say (with explicit examples)expected to say (with explicit examples) ErrorsErrors

““I didn’t understand” …I didn’t understand” … ConfirmationConfirmation

Did you say “Boston”?Did you say “Boston”?

Designing PromptsDesigning Prompts

Constrain your questions:Constrain your questions: How may I help you?How may I help you?

Long story replyLong story reply What bus number would like schedules for?What bus number would like schedules for?

Expect bus number repliesExpect bus number replies