recognition of dialogue acts in meetings

ww

w.a

mip

roje

ct.o

rgRecognition of Recognition of Dialogue Acts Dialogue Acts

in Meetingsin MeetingsAlfred Dielmann – [email protected]

Steve Renals – [email protected]

Centre for Speech Technology ResearchUniversity of Edinburgh

www.amiproject.orgA. Dielmann 2 May 20062

AgendaAgenda

• Introduction

• Meeting data corpus

• System overview• Feature set• Factored Language Models• DBN infrastructure

• Experimental results

• Conclusions and future directions


Automatic meeting structuring

Meeting phases detection

Dialogue Act recognition

GoalGoal

Summarisation

Topic Detectionand Tracking

Language Models forAutomatic Speech Recognition


Building blocks of a conversation

“Dialog Acts reflect the functions that utterances serve in a discourse” [Ji et al. 05]

Dialogue ActsDialogue Acts

• Several DA coding schemes can be defined:• Targeted on different conversational aspects• Caracterised by multiple hierarchical levels• Different number of DA labels


ICSI meeting corpus (1)ICSI meeting corpus (1)

• Naturally occurring meetings• Unconstrained human-to-human interactions• Unconstrained recording conditions

• 75 meetings of 4-10 participants (average 6 participants)

• 72 hours multi-channel audio data• Head mounted microphones• 4 tabletop microphones

• Fully transcribed• Annotated in terms of Dialogue Acts

• MRDA scheme: 11 generic tags + 39 specific sub-tags• More than 2000 unique DA labels• Mappings from MRDA tags to reduced tag sets


ICSI meeting corpus (2)ICSI meeting corpus (2)

• Five broad DA categories (obtained by manually grouping MRDA tags):

• Statements, Questions, Back-channels, Fillers, Disruptions

• Imbalanced distribution across DA categories• Same data-set subdivision as in Ang et al.[2005]

• Training (51 meetings) / Development (11) / Test (11)

DA distribution (number of DA units)

QUESTIONS6%

FILLERS10% DISRUPTIONS

13%

STATEMENTS59%

BACKCHAN.12%

DA distribution (temporal duration or # words)

STATEMENTS74%

QUESTIONS6%

DISRUPTIONS10%

BACKCHAN.1%

FILLERS9%


MethodologyMethodologyTask definition:• Joint approach: DA “segmentation+classification” executed

concurrently as a single step instead of sequentially

Generative approach:• Observable sequences of words Wt

(sentences) and features Yt are generated by hidden Dialogue Acts

System building blocks:• Trainable statistical based model (dynamic Bayesian network)• Factored language model (maps word sequences into DA units)• Feature extraction component (DA segmentation)• Discourse model: trigram language model over DA label

sequences

DA1 DA2 DA3

Dia logue A ct un it boundaries

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16


DA segment boundaries + DA labels

System overviewSystem overview

Feature extraction

Multi-channel audio recordings

Dynamic Baysian Network infrastructurefor joint DA segmentation and tagging

FactoredLanguage

Model

Transcriptions

DiscourseModel

DA labels (training)

Automatic Speech Recognition


DA segment boundaries + DA labels

System overviewSystem overview

Feature extraction

Multi-channel audio recordings

Dynamic Baysian Network infrastructurefor joint DA segmentation and tagging

FactoredLanguage

Model

Transcriptions

DiscourseModel

DA labels (training)

Automatic Speech Recognition

DADATaggingTaggingDADA

SegmentationSegmentation


ASR transcriptionASR transcription• Goal: estimate the degradation of DA recognition

performances caused by an imperfect transcription• Word level transcription kindly provided by the AMI ASR

team• Baseline system developed in early 2005• Based on a PLP front-end, decision tree clustered

crossword triphone models and an interpolated bigram LM

• Trained on ICSI meetings• Covers the whole ICSI corpus through a 4-fold cross-

validation• About 29% of Word Error Rate (ideal for our task)


FeaturesFeatures

• 6 continuous word related features:

• F0 mean and variance • normalised against baseline pitch

• RMS energy • normalised by channel and by typical word

energy• Word length

• normalised by typical duration• Word relevance

• local term frequency / absolute term frequency• Pause duration

• re-scaled inter-word pauses


Factored Language ModelsFactored Language Models

Generalised language models in which

word and word related features (factors)

are bundled together

Nearly everything!!Word themselves, stems,

Part Of Speech tags,relative position in the

sentence, morphologicalclasses, DAs, ….

1 1 1 0:1: 1 1 1 2 2 2( ) ( , ,..., , ,..., , , ,..., ,..., )k k k kT t t t t t t t t t t n

t

p w p w f f w f f w f f f

Goal: factorise the joint probability associated to

a sentence in terms offactor-related conditional

probabilities:


Trained on: Reference transcription ASR

Tested on: Reference ASR transcription

Dev Test Dev Test Dev Test

69.7 70.9 61.6 61.9 63.4 63.61( , , )t t t tp w w n d

DA tagging accuracies, for the FLM comparison task, have been

estimated by integrating a simple decoder into the SRILM toolkit

1ˆ ( , , )t t t tt

td Dt

d Argmax p w w n d

{ , , ,...}D Statements Questions Fillers

Score against the reference DA label ˆtd t

d

%Correct



FLM 1st Backoff 2nd Backoff Dev Test

69.7 70.9

61.7 63.5

68.2 68.8

… 67.7 68.2

1( , , )t t t tp w w n d ( , )t t tp w n d ( )t tp w n

1( , , )t t t tp w w p d ( , )t t tp w p d ( )t tp w p

1( , , )t t t tp w w m d ( , )t t tp w m d ( )t tp w m

1( , , , , )t t t t t tp w w n p m d ( , , , )t t t t tp w n p m d

The relationship between words and DAs has been modelled

by using a 3 factors FLM and Kneser-Ney discounting

wt : word nt : relative word position (…)dt : dialogue act label (hidden) pt : part of speech labelmt : meeting type st : word stem …



• Bayesian Networks are directed probabilistic graphical models:• Nodes represent random variables• Direct arcs represent conditional (in-)dependences

among variables

• DBNs = extension of BNs to process data sequences or time-series:• Instancing a static BN for each temporal slice t• Explicating temporal dependences between variables

• Switching DBNs (Bayesian multi-nets) are adaptive DBNs able to changetheir internal topology according to the state of one or more variables (switching nodes)

C S

L

HMMs, Kalman Filter models, and many other state-space models could be represented under the same formalism

DBNsDBNs


Generative DBN model (1)Generative DBN model (1)

• Implemented through a switching DBN model:• Switching variable: DA boundary detector node Et

• 2 operative conditions 2 switching model’s topologies

• The “intra DA” topology (Et=0) operates by modelling a sequence of words Wt:T which is assumed to be part of a single DA unit• Updates the FLM based DA estimations

(joint sentence probabilities: p(Wt|Wt-1,Nt,DA0t) )

• Updates a set of deterministic counter variables (word counter Ct and word block counter Nt )Note: for clarity the next slides show only the BN slices that are actually duplicated for

t > 1 , the network topologies adopted for t=0,1 take also care of variable initialisations



• The “inter DA” state is active only when a transition between different DA units is likely (Et=1)

• Models the DA unit transition process• Integrates the discourse model probability

• Initialises the deterministic counter variables (Ct , Nt) and forces the FLM to start a new set of estimations (backoff to unigrams)

• Updates the DA recognition history (DA1t and DA2

t)

• The probability of encountering a new DA boundary Et=1 will be estimated during both the operative states from:• Observable continuous feature vector Yt (GMMs)

• Word block counter Nt (DA duration model)

• Previous DA recognition history DAkt

DA hypothesesgeneration



FLM DA boundarydetector(hidden)

DA recognition history

Words(observable)

Continuousfeatures(observable)

Discoursemodel

DA labels(hidden)

Switchingnode

Wordpositioncounter

DA1t-1

DA0t-1

DA2t-1

W t-1

C t

N t

E t

Yt

DA1t

DA0t

DA2t

W t

+1

:=0+1

C t-1

DA1t-1

DA0t-1

DA2t-1

W t-1

C t

N t

E t

Yt

DA1t

DA0t

DA2t

W t

:=0

E t-1=0 : no boundary detected E t-1=1 : DA boundary detected

:=0

DArecognitionoutput


Performance evaluationPerformance evaluation Segmentation error metric:

•NIST Sentence like unit (NIST-SU*): sum of missed DA boundariesand False Alarms divided by the number of reference DA units

Recognition error metrics:

•NIST Sentence like unit (NIST-SU*): sum of missed DAs, FalseAlarms and Substitutions divided by the number of ref. DA units

•SClite: sum of % DA substitution, insertion and deletion errorsafter a time mediated DA alignment (same as the Word Error Rate metric used in speech recognition)

•Lenient*: 100% - {percentage of correctly classified words (ignoring DA boundaries)}

* NIST Sentence like Unit, and Lenient metrics are defined in: [Ang et al. 2005]

…

…


Experimental resultsExperimental results

Test condition

Model based on

Reference

Transcription

ASR

Transcription

% correct Reference Transcription ASR

Chance 21.0

+ Prior distribution 38.8

Everything as Stat. 57.4

DA tagging 76.0 65.5 65.7

Error metric: Automatic segmentation

NIST-SU 35.6 48.1 43.6

Note: DA tagging accuracy has been estimated by providing the

ground truth segmentation (forcing the state of Et nodes)

+

DA classification based on

(FLM) lexical features plus a

3-gram discourse model (improvestagging by ~5%)

…


• “Pause duration” features play a key role for the segmentation task, but optimal recognition performances can be achieved only through the fully comprehensive feature setup

• A system fully trained on ASR transcriptions performs slightly better than one trained on clean transcription and tested on ASR output

• Mismatch between Ref. and ASR word lists & systematic substitutions

Recognition resultsRecognition results

Test condition

Model based

on

Reference

Transcription

ASR

Transcription

Error metrics Reference Transcription ASR

NIST-SU 56.8 73.2 69.6

SClite 44.6 55.7 53.5

Lenient 19.7 22.2 22.0

All evaluation metrics show

consistenttrends/behaviours

Encouraging results can be achieved on

ASR transcriptions

…


Conclusions (1)Conclusions (1)

• Task: • Automatic recognition of five broad DA categories:

• Statements, questions, back-channels, fillers and disruptions

• Approach: • A switching DBN based infrastructure (Bayesian multi-net)

oversees the DA recognition process (joint segmentation and classification) and integrates an heterogeneous set of technologies:

• Feature based DA segmentation• Factored Language Model for DA classification• N-gram DA discourse model (3-gram)

• The graphical infrastructure encourages the reuse of common resources (like the discourse model and the word counters) and learns the optimal recognition strategy from data without the need for external supervision

• The joint approach operates on a wide search space


Conclusions (2)Conclusions (2)

• Results:• Small gap between FLM based DA tagging and maximum entropy

DA classification [Ang et al. 2005]• The concurrent evaluation of multiple DA segmentation + tagging

hypotheses (Joint approach) provides low recognition error rates…• … and seems to cope well with imperfect word transcriptions: 29%

WER on the ASR output causes less than 10% of degradation on the DA recognition output

• Further directions:• WiP on AMI meeting corpus (17 DA classes)• Tuning of FLM and investigation of new factors• Experiments with multimodal features• Integration of automatic DA recognition into the

“meeting action detection framework”

recognition of dialogue acts in meetings

Documents