speech processing 11-492/18-492
TRANSCRIPT
Spoken Dialog SystemsSpoken Dialog Systems
More than just ASR and TTSMore than just ASR and TTS RecognitionRecognition Language understandingLanguage understanding Manipulation of utterancesManipulation of utterances Generation of new informationGeneration of new information Text generationText generation SynthesisSynthesis
SDS ArchitectureSDS Architecture
Language Generation
ASRLanguage
Understanding
Synthesis
Dialog Manager
Error Handling Strategies
Non Understanding
SDS InternalsSDS Internals
Language UnderstandingLanguage Understanding From words to structureFrom words to structure
Dialog ManagerDialog Manager State of dialog (who is talking)State of dialog (who is talking) Direction of dialog (what next)Direction of dialog (what next) References, user profile etcReferences, user profile etc Interaction of database/internetInteraction of database/internet
Language GenerationLanguage Generation From structure to wordsFrom structure to words
Language UnderstandingLanguage Understanding
Parsing of SPEECH not TEXTParsing of SPEECH not TEXT Eh, I wanna go, wanna go to Boston tomorrowEh, I wanna go, wanna go to Boston tomorrow If its not too much trouble I’d be very grateful if If its not too much trouble I’d be very grateful if
one might be able to aid me in arranging my one might be able to aid me in arranging my travel arrangements to Boston, Logan airport, at travel arrangements to Boston, Logan airport, at sometime tomorrow morning, thank you.sometime tomorrow morning, thank you.
Boston, tomorrowBoston, tomorrow
Parsing: Output structureParsing: Output structure
““I wanna go to Boston, tomorrow”I wanna go to Boston, tomorrow” Destination: BOSDestination: BOS Departure: 20081028, AMDeparture: 20081028, AM Airline: unspecifiedAirline: unspecified Special: unspecifiedSpecial: unspecified
Convert speech to structureConvert speech to structure Sufficient for further processing/querySufficient for further processing/query
Interaction ExampleInteraction Example
Intelligent Agent
Cheap Taiwanese eating places include Din Tai Fung, Boiling Point, etc. What do you want to choose? I can help you go there.
find a cheap eating place for taiwanese food
User
SDS ProcessSDS Process
find a cheap eating place for taiwanese food
User
target
foodpriceAMOD
NN
seeking
PREP_FORIntelligent
Agent
SDS ProcessSDS Process
User
target
foodpriceAMOD
NN
seeking
PREP_FOR
Organized Domain Knowledge
Intelligent Agent
Ontology Induction
(semantic slot)
find a cheap eating place for taiwanese food
SDS ProcessSDS Process
User
target
foodpriceAMOD
NN
seeking
PREP_FOR
Organized Domain Knowledge
Intelligent Agent
Ontology Induction
(semantic slot)
Structure Learning
(inter-slot relation)
find a cheap eating place for taiwanese food
SDS ProcessSDS Process
User
target
foodpriceAMOD
NN
seeking
PREP_FORIntelligent
Agent
seeking=“find”target=“eating place”price=“cheap”food=“taiwanese”
find a cheap eating place for taiwanese food
find a cheap eating place for taiwanese food
SDS ProcessSDS Process
User
target
foodpriceAMOD
NN
seeking
PREP_FORIntelligent
Agent
seeking=“find”target=“eating place”price=“cheap”food=“taiwanese”
Semantic Decoding
Automatic Slot InductionAutomatic Slot Induction
Chen et al. ASRU’13Chen et al. ASRU’13
can i have a cheap restaurant
Frame: capability
Frame: expensiveness
Frame: locale by use
Domain
DomainGeneral
slot candidate
15
can i have a cheap restaurant
Parsing vs Language ModelParsing vs Language Model
Language ModelLanguage Model Model what actually gets saidModel what actually gets said
Parsing Parsing Extract the information you wantExtract the information you want
Models *can* be sharedModels *can* be shared Only accept things in the grammarOnly accept things in the grammar Can be over limitingCan be over limiting
Neural Networks for SLUNeural Networks for SLU
RNN for Slot FillingRNN for Slot Filling
Step 1: word embedding Step 1: word embedding Step 2: short-term dependencies capturingStep 2: short-term dependencies capturing Step 3: long-term dependencies capturingStep 3: long-term dependencies capturing Step 4: different types of neural architectureStep 4: different types of neural architecture
http://deeplearning.net/tutorial/rnnslu.html#rnnsluhttp://deeplearning.net/tutorial/rnnslu.html#rnnslu
Mesnil et al. 2013
Interactive Learning for SLUInteractive Learning for SLU
Luis : Interactive machine learning for Luis : Interactive machine learning for language understandinglanguage understanding
Advantages: Advantages: Non-expert could add in knowledge in Non-expert could add in knowledge in feature engineeringfeature engineeringActive-learning reduces heavy labelingActive-learning reduces heavy labeling
https://www.luis.ai/ https://www.luis.ai/
Williams et al. 2016
Dialog ManagerDialog Manager
Maintain stateMaintain state Where are we in the dialogWhere are we in the dialog Whose turn is itWhose turn is it
Waiting for speakerWaiting for speaker Waiting for database query (stall user)Waiting for database query (stall user)
Deal with barge-inDeal with barge-in
Frame Based Dialog MangerFrame Based Dialog Manger
Used for transaction dialogUsed for transaction dialog Generalizes finite-state approach by allowing Generalizes finite-state approach by allowing
multiple paths to acquire info multiple paths to acquire info Central data structure is frame with slotsCentral data structure is frame with slots
• DM is monitoring frame, filling in slots DM is monitoring frame, filling in slots Frame: Frame:
Set of information needed Set of information needed Context for utterance interpretation Context for utterance interpretation Context for dialogue progress Context for dialogue progress
Allows mixed initiative Allows mixed initiative Allows over-answering Allows over-answering
Also called form-based (MIT): Often called “slot-filling” Also called form-based (MIT): Often called “slot-filling”
Problems with Frames Problems with Frames
Not easily applicable to complex tasks Not easily applicable to complex tasks May not be a single frame May not be a single frame Dynamic construction of information Dynamic construction of information User access to “product”User access to “product”
Agenda + FrameAgenda + Frame
Product: Product: hierarchical composition of frames hierarchical composition of frames
Process: Process: Agenda Agenda
Generalization of stack Generalization of stack Ordered list of topics Ordered list of topics List of handlersList of handlers
Statistical Approaches to DMStatistical Approaches to DM
Allow for dialog complexity beyond human Allow for dialog complexity beyond human mind mind
Find optimal decision for non-trivial design Find optimal decision for non-trivial design problems problems
Life-long learningLife-long learning
DecisionsDecisions
Difficult design decision over the course of Difficult design decision over the course of interaction interaction When to ask open / directive questions? When to ask open / directive questions? When to confirm? When to confirm? When to barge-in / wait? When to barge-in / wait? Which type of feedback to provide? (ex, Which type of feedback to provide? (ex,
intelligent tutoring system) intelligent tutoring system) Sample efficient policy search Sample efficient policy search
Policy space is too huge to search with Policy space is too huge to search with traditional ways of SDS developmenttraditional ways of SDS development
System-Initiative VS Mixed-InitiativeSystem-Initiative VS Mixed-Initiative
S1: Welcome to CMU Let’s Go. S1: Welcome to CMU Let’s Go. Where do you leave from? Where do you leave from?
U1: CMU U1: CMU
S2: From CMU, did I get that right? S2: From CMU, did I get that right?
U2: Yes. U2: Yes.
S3:S3: Where are you going? Where are you going?
U3: Downtown.U3: Downtown.
S4: To Downtown, did I get that right? S4: To Downtown, did I get that right?
U4: Yes.U4: Yes.
S1: Welcome to CMU Let’s Go. S1: Welcome to CMU Let’s Go. How may I help you? How may I help you?
U1: I'd like to go from CMU to Downtown. U1: I'd like to go from CMU to Downtown.
S2: From CMU to Downtown, did I get that right? S2: From CMU to Downtown, did I get that right?
U2: Yes. U2: Yes.
S3S3: When are you going to take the bus? : When are you going to take the bus?
U3: Now U3: Now
S3: You want the next bus, is that right? S3: You want the next bus, is that right?
U3: Yes.U3: Yes.
Confirm Each VS Confirm AllConfirm Each VS Confirm All
S1: Welcome to CMU Let’s Go. Where do you leave from? S1: Welcome to CMU Let’s Go. Where do you leave from?
U1: CMU U1: CMU
S2S2: From CMU, did I get that right? : From CMU, did I get that right?
U2: Yes. U2: Yes.
S3:S3: Where are you going? Where are you going?
U3: Downtown.U3: Downtown.
S4: To Downtown, did I get that right? S4: To Downtown, did I get that right?
U4: Yes.U4: Yes.
S1: Welcome to CMU Let’s Go. Where do you leave from? S1: Welcome to CMU Let’s Go. Where do you leave from?
U1: CMU • S2: Where are you going? U1: CMU • S2: Where are you going?
U2: Downtown. U2: Downtown.
S3: When are you going to take bus? S3: When are you going to take bus?
U3: Now U3: Now
S4: S4: Leaving from CMU going to Downtown immediately, is it correct?Leaving from CMU going to Downtown immediately, is it correct?
Explicit Confirm VS Implicit ConfirmExplicit Confirm VS Implicit Confirm
S1: Welcome to CMU Let’s Go. Where do you leave from? S1: Welcome to CMU Let’s Go. Where do you leave from?
U1: CMU U1: CMU
S2: S2: From CMU, did I get that right? From CMU, did I get that right?
U2: Yes. U2: Yes.
S3:S3: Where are you going? Where are you going?
U3: Downtown.U3: Downtown.
S4: S4: To Downtown, did I get that right? To Downtown, did I get that right?
U4: Yes.U4: Yes.
S1: Welcome to CMU Let’s Go. Where do you leave from? S1: Welcome to CMU Let’s Go. Where do you leave from?
U1: CMU U1: CMU
S2: S2: Departing CMUDeparting CMU. Where are you going? . Where are you going?
U2: Downtown. U2: Downtown.
S3: S3: Going to Downtown. Going to Downtown. When are you going to take bus? When are you going to take bus?
U3: NowU3: Now
Corrective FeedbackCorrective Feedback
S1: Welcome to CMU Let’s Go. How may I help you? S1: Welcome to CMU Let’s Go. How may I help you?
U1: I likes go CMU to Downtown. U1: I likes go CMU to Downtown.
S2: S2: You must not use third person singular verb after first person You must not use third person singular verb after first person singular pronoun [explicit correction]singular pronoun [explicit correction]
S1: Welcome to CMU Let’s Go. How may I help you? S1: Welcome to CMU Let’s Go. How may I help you?
U1: I likes go CMU to Downtown. U1: I likes go CMU to Downtown.
S2: S2: You said I like to go from CMU to Downtown. Did I get that right? You said I like to go from CMU to Downtown. Did I get that right? [recast][recast]
Traditional Policy DesignTraditional Policy Design
Traditional SDS development paradigm Traditional SDS development paradigm Several versions of a system are developed (where each version Several versions of a system are developed (where each version
uses a single dialog policy, intuitively designed by an expert) uses a single dialog policy, intuitively designed by an expert) Dialog corpora are collected with human users interacting with Dialog corpora are collected with human users interacting with
different versions of the system different versions of the system A number of evaluation metrics are measured for each dialog A number of evaluation metrics are measured for each dialog The different versions are statistically compared The different versions are statistically compared Update the system with “best” dialog policy Update the system with “best” dialog policy
Due to the costs of experimentation, only a handful of Due to the costs of experimentation, only a handful of policies are usually explored in any one experimentpolicies are usually explored in any one experiment
Problem of Traditional PolicyProblem of Traditional Policy
The number of possible policies is much larger
Limited Data
ModelingModeling
Agent updates state by observing environment Given the state, agent performs the best action to achieve its goal Need to learn the mapping h from s S to a A∈ ∈
LearningLearning
Supervised learning Supervised learning For each input state s, optimal action a is known For each input state s, optimal action a is known Infer input-output relationship a ≈ h(s) Infer input-output relationship a ≈ h(s) Example: neural networks, support vector Example: neural networks, support vector
machinemachine
Markov Decision ProcessMarkov Decision Process
No data:No data: Human-Human is differences between human Human-Human is differences between human
and machine perceptionand machine perception Wizard of Oz is costly and wizard is hard to Wizard of Oz is costly and wizard is hard to
train.train.Statistical method to formulate the questionStatistical method to formulate the question
MDP computes an optimal dialog policy within a much larger search MDP computes an optimal dialog policy within a much larger search space, using a relatively small number of training dialogspace, using a relatively small number of training dialog
MDP evaluates actions (small number) not policies (large number) by MDP evaluates actions (small number) not policies (large number) by credit assignment of total reward to each action using dynamic credit assignment of total reward to each action using dynamic programmingprogramming
Example: Non-Task-Oriented System Example: Non-Task-Oriented System ArchitectureArchitecture
Lexical Semantic Strategies
Engagement Appropriateness Strategies
User Input
Context Tracking
Strategies
System Output
High GenerationConfidence
Low GenerationConfidence
Trigger Condition Not Meet
Trigger Condition Meet
ResponseGeneration
Method
Response Generation
Conversation Strategy Selection
Active ParticipationActive Participation
Start conversations with a concrete topic Start conversations with a concrete topic
e.g. TickTock: Hi, I am TickTock, I really like sports, e.g. TickTock: Hi, I am TickTock, I really like sports, let’s talk about sports.let’s talk about sports.
Design each strategies with respect to a topicDesign each strategies with respect to a topic
Stay on the topicStay on the topic
Change the topicChange the topic
43
Stay on the Topic:Stay on the Topic:
S1:Tell a joke (joke)S1:Tell a joke (joke)
e.g. Do you know that people usually spend far e.g. Do you know that people usually spend far more time watching sports than actually playing any.more time watching sports than actually playing any.
S2: Initiate activities (initiate)S2: Initiate activities (initiate)
e.g. Do you want to see a game together some e.g. Do you want to see a game together some time?time?
S3:Talk more (more):S3:Talk more (more):
e.g. Let’s talk more about sports.e.g. Let’s talk more about sports.
44
Change the TopicChange the Topic
S4: Switch topics (switch): S4: Switch topics (switch):
e.g. How about we talk about movies?e.g. How about we talk about movies?
S5: End topics with an open question (end): S5: End topics with an open question (end):
e.g. That’s interesting, could you share with me e.g. That’s interesting, could you share with me some interesting news on the Internet?some interesting news on the Internet?
45
A Failure Conversation CaseA Failure Conversation Case
TickTock: Hello, I really like politics. Let’s talk about TickTock: Hello, I really like politics. Let’s talk about politics.politics.
UserUser: : No, I don’t like politics. No, I don’t like politics.
TickTock: Why is that? TickTock: Why is that?
User: User: I just don’t like politics. I just don’t like politics.
TickTock: How about we watch a debate together TickTock: How about we watch a debate together some day? (initiate activity) some day? (initiate activity)
UserUser: : I told you I don’t want to talk about I told you I don’t want to talk about politics! politics!
46
Not Appropriate!
Not Appropriate!
Sentiment ContextSentiment Context
TickTock: Hello, I really like politics. Let’s talk about TickTock: Hello, I really like politics. Let’s talk about politics. politics. neutralneutral
User: User: No, I don’t like politics. No, I don’t like politics. negativenegative
TickTock: Why is that? TickTock: Why is that? neutralneutral
User: User: I just don’t like politics. I just don’t like politics. negativenegative
TickTock: OK, how about we talk about movies? TickTock: OK, how about we talk about movies? (switch topics)(switch topics)
User:User: Sure. I do watch movies a lot. Sure. I do watch movies a lot. neutralneutral
47
Appropriate!Appropriate!
Engagement Maintain PolicyEngagement Maintain PolicyGoal: Improve system Goal: Improve system appropriatenessappropriateness considering considering context.context.
Method: Reinforcement learningMethod: Reinforcement learning
Q Learning : Q Learning :
State Variable (S):State Variable (S): system-appropriateness confidencesystem-appropriateness confidence
all previous utterance-sentiment confidenceall previous utterance-sentiment confidence
time of each strategy executedtime of each strategy executed
turn positionturn position
most recently used strategymost recently used strategy
Actions (A) : engagement strategies (5 types) and Actions (A) : engagement strategies (5 types) and generated utterancegenerated utterance
48
Model DetailModel Detail
Reward function(R): Cumulated Appropriateness *0.3 + Reward function(R): Cumulated Appropriateness *0.3 + Conversation depth*0.3 + Information gain *0.4Conversation depth*0.3 + Information gain *0.4
AppropriatenessAppropriateness : the current response’s coherence with the user : the current response’s coherence with the user utterance.utterance.
Automatic predictor: SVM binary classifier (Inappropriate, Automatic predictor: SVM binary classifier (Inappropriate, interpretable VS Appropriate)interpretable VS Appropriate)
Conversation depthConversation depth: the maximum number of consecutive utterances : the maximum number of consecutive utterances on the same topic.on the same topic.
Automatic predictor: SVM binary classifier (Shallow, Intermediate VS Automatic predictor: SVM binary classifier (Shallow, Intermediate VS Deep)Deep)
Information gainInformation gain: the number of unique tokens.: the number of unique tokens.
Update: Update: Simulator: A.L.I.C.E. chatbot. Interface: Multithreaded text-only API Simulator: A.L.I.C.E. chatbot. Interface: Multithreaded text-only API (open source)(open source)
49
Language GenerationLanguage Generation
Query for flights to BostonQuery for flights to Boston Template fill answer(s)Template fill answer(s)
The next flight to DEST leaves at The next flight to DEST leaves at DEPART_TIME arriving at ARRIVE_TIME.DEPART_TIME arriving at ARRIVE_TIME.
Templates may be much more complexTemplates may be much more complex
Language GenerationLanguage Generation
Choose which template to useChoose which template to use Based on state, answer typeBased on state, answer type Natural variationNatural variation Statistical variationStatistical variation
Include <ssml> tags to help synthesisInclude <ssml> tags to help synthesis Can <emph>emphasize</emph> partsCan <emph>emphasize</emph> parts Can identify dates, numbers etc.Can identify dates, numbers etc.
Humans like variation in the outputHumans like variation in the output It is rare for a human to repeat things exactlyIt is rare for a human to repeat things exactly
Language GenerationLanguage Generation
Frames structures to (marked up) textFrames structures to (marked up) text START: PittsburghSTART: Pittsburgh END: BostonEND: Boston DATE: 20081028DATE: 20081028 TIME: 07:45TIME: 07:45 FLIGHT: US075FLIGHT: US075
Can generateCan generate I have US 075 leaving at 07:45 tomorrowI have US 075 leaving at 07:45 tomorrow US Airways has a flight departing tomorrow at 07:45US Airways has a flight departing tomorrow at 07:45
Standardized thingsStandardized things
HelpHelp User should be able to get help at any timeUser should be able to get help at any time Explain where they are and what they are Explain where they are and what they are
expected to say (with explicit examples)expected to say (with explicit examples) ErrorsErrors
““I didn’t understand” …I didn’t understand” … ConfirmationConfirmation
Did you say “Boston”?Did you say “Boston”?
Designing PromptsDesigning Prompts
Constrain your questions:Constrain your questions: How may I help you?How may I help you?
Long story replyLong story reply What bus number would like schedules for?What bus number would like schedules for?
Expect bus number repliesExpect bus number replies