recognition of dialogue acts in meetings
DESCRIPTION
Recognition of Dialogue Acts in Meetings. Alfred Dielmann – [email protected] Steve Renals – [email protected] Centre for Speech Technology Research University of Edinburgh. Agenda. Introduction Meeting data corpus System overview Feature set Factored Language Models - PowerPoint PPT PresentationTRANSCRIPT
ww
w.a
mip
roje
ct.o
rgRecognition of Recognition of Dialogue Acts Dialogue Acts
in Meetingsin MeetingsAlfred Dielmann – [email protected]
Steve Renals – [email protected]
Centre for Speech Technology ResearchUniversity of Edinburgh
www.amiproject.orgA. Dielmann 2 May 20062
AgendaAgenda
• Introduction
• Meeting data corpus
• System overview• Feature set• Factored Language Models• DBN infrastructure
• Experimental results
• Conclusions and future directions
www.amiproject.orgA. Dielmann 2 May 20063
Automatic meeting structuring
Meeting phases detection
Dialogue Act recognition
GoalGoal
Summarisation
Topic Detectionand Tracking
Language Models forAutomatic Speech Recognition
www.amiproject.orgA. Dielmann 2 May 20064
Building blocks of a conversation
“Dialog Acts reflect the functions that utterances serve in a discourse” [Ji et al. 05]
Dialogue ActsDialogue Acts
• Several DA coding schemes can be defined:• Targeted on different conversational aspects• Caracterised by multiple hierarchical levels• Different number of DA labels
www.amiproject.orgA. Dielmann 2 May 20065
ICSI meeting corpus (1)ICSI meeting corpus (1)
• Naturally occurring meetings• Unconstrained human-to-human interactions• Unconstrained recording conditions
• 75 meetings of 4-10 participants (average 6 participants)
• 72 hours multi-channel audio data• Head mounted microphones• 4 tabletop microphones
• Fully transcribed• Annotated in terms of Dialogue Acts
• MRDA scheme: 11 generic tags + 39 specific sub-tags• More than 2000 unique DA labels• Mappings from MRDA tags to reduced tag sets
www.amiproject.orgA. Dielmann 2 May 20066
ICSI meeting corpus (2)ICSI meeting corpus (2)
• Five broad DA categories (obtained by manually grouping MRDA tags):
• Statements, Questions, Back-channels, Fillers, Disruptions
• Imbalanced distribution across DA categories• Same data-set subdivision as in Ang et al.[2005]
• Training (51 meetings) / Development (11) / Test (11)
DA distribution (number of DA units)
QUESTIONS6%
FILLERS10% DISRUPTIONS
13%
STATEMENTS59%
BACKCHAN.12%
DA distribution (temporal duration or # words)
STATEMENTS74%
QUESTIONS6%
DISRUPTIONS10%
BACKCHAN.1%
FILLERS9%
www.amiproject.orgA. Dielmann 2 May 20068
MethodologyMethodologyTask definition:• Joint approach: DA “segmentation+classification” executed
concurrently as a single step instead of sequentially
Generative approach:• Observable sequences of words Wt
(sentences) and features Yt are generated by hidden Dialogue Acts
System building blocks:• Trainable statistical based model (dynamic Bayesian network)• Factored language model (maps word sequences into DA units)• Feature extraction component (DA segmentation)• Discourse model: trigram language model over DA label
sequences
DA1 DA2 DA3
Dia logue A ct un it boundaries
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16
www.amiproject.orgA. Dielmann 2 May 20069
DA segment boundaries + DA labels
System overviewSystem overview
Feature extraction
Multi-channel audio recordings
Dynamic Baysian Network infrastructurefor joint DA segmentation and tagging
FactoredLanguage
Model
Transcriptions
DiscourseModel
DA labels (training)
Automatic Speech Recognition
www.amiproject.orgA. Dielmann 2 May 200610
DA segment boundaries + DA labels
System overviewSystem overview
Feature extraction
Multi-channel audio recordings
Dynamic Baysian Network infrastructurefor joint DA segmentation and tagging
FactoredLanguage
Model
Transcriptions
DiscourseModel
DA labels (training)
Automatic Speech Recognition
DADATaggingTaggingDADA
SegmentationSegmentation
www.amiproject.orgA. Dielmann 2 May 200611
ASR transcriptionASR transcription• Goal: estimate the degradation of DA recognition
performances caused by an imperfect transcription• Word level transcription kindly provided by the AMI ASR
team• Baseline system developed in early 2005• Based on a PLP front-end, decision tree clustered
crossword triphone models and an interpolated bigram LM
• Trained on ICSI meetings• Covers the whole ICSI corpus through a 4-fold cross-
validation• About 29% of Word Error Rate (ideal for our task)
www.amiproject.orgA. Dielmann 2 May 200612
FeaturesFeatures
• 6 continuous word related features:
• F0 mean and variance • normalised against baseline pitch
• RMS energy • normalised by channel and by typical word
energy• Word length
• normalised by typical duration• Word relevance
• local term frequency / absolute term frequency• Pause duration
• re-scaled inter-word pauses
www.amiproject.orgA. Dielmann 2 May 200613
Factored Language ModelsFactored Language Models
Generalised language models in which
word and word related features (factors)
are bundled together
Nearly everything!!Word themselves, stems,
Part Of Speech tags,relative position in the
sentence, morphologicalclasses, DAs, ….
1 1 1 0:1: 1 1 1 2 2 2( ) ( , ,..., , ,..., , , ,..., ,..., )k k k kT t t t t t t t t t t n
t
p w p w f f w f f w f f f
Goal: factorise the joint probability associated to
a sentence in terms offactor-related conditional
probabilities:
www.amiproject.orgA. Dielmann 2 May 200614
Trained on: Reference transcription ASR
Tested on: Reference ASR transcription
Dev Test Dev Test Dev Test
69.7 70.9 61.6 61.9 63.4 63.61( , , )t t t tp w w n d
DA tagging accuracies, for the FLM comparison task, have been
estimated by integrating a simple decoder into the SRILM toolkit
1ˆ ( , , )t t t tt
td Dt
d Argmax p w w n d
{ , , ,...}D Statements Questions Fillers
Score against the reference DA label ˆtd t
d
%Correct
Factored Language ModelsFactored Language Models
www.amiproject.orgA. Dielmann 2 May 200616
FLM 1st Backoff 2nd Backoff Dev Test
69.7 70.9
61.7 63.5
68.2 68.8
… 67.7 68.2
1( , , )t t t tp w w n d ( , )t t tp w n d ( )t tp w n
1( , , )t t t tp w w p d ( , )t t tp w p d ( )t tp w p
1( , , )t t t tp w w m d ( , )t t tp w m d ( )t tp w m
1( , , , , )t t t t t tp w w n p m d ( , , , )t t t t tp w n p m d
The relationship between words and DAs has been modelled
by using a 3 factors FLM and Kneser-Ney discounting
wt : word nt : relative word position (…)dt : dialogue act label (hidden) pt : part of speech labelmt : meeting type st : word stem …
Factored Language ModelsFactored Language Models
www.amiproject.orgA. Dielmann 2 May 200617
• Bayesian Networks are directed probabilistic graphical models:• Nodes represent random variables• Direct arcs represent conditional (in-)dependences
among variables
• DBNs = extension of BNs to process data sequences or time-series:• Instancing a static BN for each temporal slice t• Explicating temporal dependences between variables
• Switching DBNs (Bayesian multi-nets) are adaptive DBNs able to changetheir internal topology according to the state of one or more variables (switching nodes)
C S
L
HMMs, Kalman Filter models, and many other state-space models could be represented under the same formalism
DBNsDBNs
www.amiproject.orgA. Dielmann 2 May 200619
Generative DBN model (1)Generative DBN model (1)
• Implemented through a switching DBN model:• Switching variable: DA boundary detector node Et
• 2 operative conditions 2 switching model’s topologies
• The “intra DA” topology (Et=0) operates by modelling a sequence of words Wt:T which is assumed to be part of a single DA unit• Updates the FLM based DA estimations
(joint sentence probabilities: p(Wt|Wt-1,Nt,DA0t) )
• Updates a set of deterministic counter variables (word counter Ct and word block counter Nt )Note: for clarity the next slides show only the BN slices that are actually duplicated for
t > 1 , the network topologies adopted for t=0,1 take also care of variable initialisations
www.amiproject.orgA. Dielmann 2 May 200620
Generative DBN model (2)Generative DBN model (2)
• The “inter DA” state is active only when a transition between different DA units is likely (Et=1)
• Models the DA unit transition process• Integrates the discourse model probability
• Initialises the deterministic counter variables (Ct , Nt) and forces the FLM to start a new set of estimations (backoff to unigrams)
• Updates the DA recognition history (DA1t and DA2
t)
• The probability of encountering a new DA boundary Et=1 will be estimated during both the operative states from:• Observable continuous feature vector Yt (GMMs)
• Word block counter Nt (DA duration model)
• Previous DA recognition history DAkt
DA hypothesesgeneration
www.amiproject.orgA. Dielmann 2 May 200621
Generative DBN model (3)Generative DBN model (3)
FLM DA boundarydetector(hidden)
DA recognition history
Words(observable)
Continuousfeatures(observable)
Discoursemodel
DA labels(hidden)
Switchingnode
Wordpositioncounter
DA1t-1
DA0t-1
DA2t-1
W t-1
C t
N t
E t
Yt
DA1t
DA0t
DA2t
W t
+1
:=0+1
C t-1
DA1t-1
DA0t-1
DA2t-1
W t-1
C t
N t
E t
Yt
DA1t
DA0t
DA2t
W t
:=0
E t-1=0 : no boundary detected E t-1=1 : DA boundary detected
:=0
DArecognitionoutput
www.amiproject.orgA. Dielmann 2 May 200622
Performance evaluationPerformance evaluation Segmentation error metric:
•NIST Sentence like unit (NIST-SU*): sum of missed DA boundariesand False Alarms divided by the number of reference DA units
Recognition error metrics:
•NIST Sentence like unit (NIST-SU*): sum of missed DAs, FalseAlarms and Substitutions divided by the number of ref. DA units
•SClite: sum of % DA substitution, insertion and deletion errorsafter a time mediated DA alignment (same as the Word Error Rate metric used in speech recognition)
•Lenient*: 100% - {percentage of correctly classified words (ignoring DA boundaries)}
* NIST Sentence like Unit, and Lenient metrics are defined in: [Ang et al. 2005]
…
…
www.amiproject.orgA. Dielmann 2 May 200625
Experimental resultsExperimental results
Test condition
Model based on
Reference
Transcription
ASR
Transcription
% correct Reference Transcription ASR
Chance 21.0
+ Prior distribution 38.8
Everything as Stat. 57.4
DA tagging 76.0 65.5 65.7
Error metric: Automatic segmentation
NIST-SU 35.6 48.1 43.6
Note: DA tagging accuracy has been estimated by providing the
ground truth segmentation (forcing the state of Et nodes)
+
DA classification based on
(FLM) lexical features plus a
3-gram discourse model (improvestagging by ~5%)
…
www.amiproject.orgA. Dielmann 2 May 200627
• “Pause duration” features play a key role for the segmentation task, but optimal recognition performances can be achieved only through the fully comprehensive feature setup
• A system fully trained on ASR transcriptions performs slightly better than one trained on clean transcription and tested on ASR output
• Mismatch between Ref. and ASR word lists & systematic substitutions
Recognition resultsRecognition results
Test condition
Model based
on
Reference
Transcription
ASR
Transcription
Error metrics Reference Transcription ASR
NIST-SU 56.8 73.2 69.6
SClite 44.6 55.7 53.5
Lenient 19.7 22.2 22.0
All evaluation metrics show
consistenttrends/behaviours
Encouraging results can be achieved on
ASR transcriptions
…
www.amiproject.orgA. Dielmann 2 May 200629
Conclusions (1)Conclusions (1)
• Task: • Automatic recognition of five broad DA categories:
• Statements, questions, back-channels, fillers and disruptions
• Approach: • A switching DBN based infrastructure (Bayesian multi-net)
oversees the DA recognition process (joint segmentation and classification) and integrates an heterogeneous set of technologies:
• Feature based DA segmentation• Factored Language Model for DA classification• N-gram DA discourse model (3-gram)
• The graphical infrastructure encourages the reuse of common resources (like the discourse model and the word counters) and learns the optimal recognition strategy from data without the need for external supervision
• The joint approach operates on a wide search space
www.amiproject.orgA. Dielmann 2 May 200630
Conclusions (2)Conclusions (2)
• Results:• Small gap between FLM based DA tagging and maximum entropy
DA classification [Ang et al. 2005]• The concurrent evaluation of multiple DA segmentation + tagging
hypotheses (Joint approach) provides low recognition error rates…• … and seems to cope well with imperfect word transcriptions: 29%
WER on the ASR output causes less than 10% of degradation on the DA recognition output
• Further directions:• WiP on AMI meeting corpus (17 DA classes)• Tuning of FLM and investigation of new factors• Experiments with multimodal features• Integration of automatic DA recognition into the
“meeting action detection framework”