d3.3: final report on multimodal information access and...

55
Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485) D3.3: Final report on Multimodal information access and integration V 1.0 1/55 2005.02.28

Upload: others

Post on 01-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

D3.3: Final report on Multimodal information access and integration

V 1.0 1/55 2005.02.28

Page 2: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

Project ref. no. IST–2001-34485 Project acronym M4

Project full title MultiModal Meeting Manager

Security Restricted

Contractual date of delivery M36 (28 February 2005)

Actual date of delivery 28 February 2005

Deliverable number D3.3

Deliverable name Final report on Multimodal information access and integration

Type Report

Status and version V1.0

Number of pages 55

WP contributing to deliverable WP3

WP/Task responsible UniGE

Other contributors UniEd, TUM, IDIAP, Brno, TNO, Twente

Editor(s) Stéphane Marchand-Maillet (UniGE)

EC project officer Mats Ljungqvist

Key words Information access, multimodal integration

Abstract

This deliverable forms a consistent report on innovative working techniques for multimodal information access and integration. Multimodal integration and access have been fully addressed throughout the M4 project. We are now able to report on a robust framework for the generic integration and similarity access of multimodal data. This framework is coupled with development of multimodal meeting features benefiting from all modalities classically encountered. We have also developed a number of approaches for managing data manually where needed and essentially for the creation of multimodal test and training data.

V 1.0 2/55 2005.02.28

Page 3: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

Table of contents:

1 INTRODUCTION 4

2. MULTIMODAL ACCESS THROUGH EXTRACTIVE SUMMARIZATION 4

3. INFORMATION ACCESS THROUGH AUTOMATIC ADDRESSEE IDENTIFICATION 8

4 MULTIMODAL RECOGNITION OF PHONEMES IN MEETING DATA 17

5 MULTIMODAL INTEGRATION OF FEATURE STREAMS FOR MEETING GROUP ACTION SEGMENTATION AND RECOGNITION 20

6 INTERACTIVE VIDEO RETRIEVAL BASED ON MULTIMODAL DISSIMILARITY REPRESENTATION 44

7 MEETING DATA EDITING AND ANNOTATION 49

8. CONCLUSION 55

V 1.0 3/55 2005.02.28

Page 4: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

1 Introduction

This deliverable presents and summarise efforts made within the project regarding the integration of multistream information for multimodal information access. We first present in Section 2 a framework for multimodal access via summarization. In Section 3, we propose access via addressee identification. Section 4 proposes the study of general speech feature extraction tasks. Then, in Section 4, a thorough study targeted at localising and recognising meeting group actions (Section 5) is presented. We then present in Section 6 an indexing scheme that may embed all the above features in a dissimilarity representation for real-time similarity access of multimodal features. In Section 7, we present a number of approach that have also been designed and utilised to manual process multimodal in view of acquiring prior knowledge for our studies.

2. Multimodal access through extractive summarization Deliverable 3.1 reported on preliminary work aiming at producing short headline summaries for clusters of broadcast news. We have extended the extractive summarization technique initially developed for broadcast news text. The experiments were carried out in the context of the NIST evaluation DUC (Document Understanding Conference). The system was based on a Naïve Bayes framework in combination with features learned from generative language models. More recently, a derived approach has been evaluated on the meeting transcripts of the ICSI Corpus. Broadcast news and transcribed meetings have very different characteristics. The former is usually quite condensed text, which is optimized to transfer information using a minimum of airtime or lines of text. Meetings however often have a much lower information density, since they consist mostly of spontaneous speech. Another factor is that a meeting transcript is the end result of a communication process between several participants. This means that a large proportion of the utterances deal with managing the communication process itself (i.e. backchannels) instead of the topic of the meeting. Although there is a considerable discrepancy between both domains, we still see possibilities for extractive techniques. All utterances of six manual meeting transcripts of the ICSI meeting corpus (corresponding to roughly six hours of recorded meetings) were annotated for salience using a ternary salience scale. The annotation of the utterances was done in a context dependent fashion; the annotator selected the most important utterances from the meeting, while maintaining a good coverage of all discussion topics and maintaining a good readability (informative summaries). The annotation work itself yielded several important insights about the nature of meeting transcripts. The manual extractive summaries reduced the number of utterances with 80-90%. Since most extracted utterances are long, the summary length measured in word count is about 30% of the original meeting length, which we consider a useful result for a first attempt at meeting summarization. The second step in the pilot study on extractive meeting summarization consisted of training a machine learning based summarizer and evaluating its performance. The six annotated meetings were divided into a set of four meetings for training and two meetings for evaluation, and N-fold cross-validation was applied. Instead of the Naïve Bayes framework used for the summarization of broadcasted news, a Maximum Entropy framework was chosen since it does not assume feature independence. Features The following features were used in the MaxEnt based summarizer: Sentence Length (SL) This feature is divided into four classes: ultra-short, which are segments that are only 10 characters long or less. These occur often and are often of no importance. The boundaries for

V 1.0 4/55 2005.02.28

Page 5: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

'short' are larger than 10 and smaller than 30 characters. 'Medium' is between 30 and 80 characters, and everything longer than 80 is considered 'long'. The measure is in characters instead of words especially for smaller segments: a segment like 'I like it' contains no information, while a sentence with three long words tends to be more important.

TF-IDF (TF) This is an implementation of the TF-IDF information retrieval technique where words are highlighted that occur more often in the current meeting than the average in a corpus of similar meetings. Segments that contain document-specific words often contain valuable information for a summary. Frequent words and bigrams (FR FR1 FRB) A variation on the TF-IDF algorithm, this implementation uses the mean and standard deviation of word occurrences in a corpus to evaluate their importance. It works slightly better than TF-IDF, as a single feature. The variant implements a feature for every important word, which results in 15 different features, a method to give the model more different examples to train with. The bigram version implements the same for significantly important sets of two words. To ensure a clean list of important words in the unigram variants, a closed word stoplist is used: the use of particular closed words are not a sign of importance, they only depict the style of the speakers. Cuephrases Although this feature is not fully implemented, it captures the idea: people tend to use certain phrases to announce something important. To find some phrases that stood out from the good segments, a cross entropy based metric was used which can distinguish specific terms (or phrases) that stand off from a corpus. When using bad segments as the background corpus, one would be able to see which phrases are specific for a summary. Unfortunately, this did not work as well as we hoped: the results were words/phrases that were not obvious to be important. We handpicked a number from the top list, and for an occurrence in a segment this feature is triggered. Linking to the previous segment (PC) This can be done by doing a run on the generated list of feature vectors, and add new features that determine if the previous segment was rated good. This results in two features, one for the last, and one for the second-last segment. It adds a new layer of uncertainty, because the features are based on the assumption that the segments were correctly labeled in the first run. Important speakers (IMP) It showed that the most frequent speakers also say the more important things. This behavior is captured in three features that name the first, second and third longest speakers. This is calculated by summing up the time of speech for each individual speaker. Dialogue Acts (DA) A dialogue act is metadata about a segment that informs about the intention of the speaker. Examples are floorgrabbing, making a statement, or asking a question. Every segment can contain multiple dialogue acts. When experimenting with them, it showed that interesting segments were mostly statement-only and question segments. Results The system was trained with different feature combinations, because using the complete featureset rendered suboptimal results. A result was calculated for every single feature, after which the best result was selected. Then every feature combined with that best feature was evaluated again. A number of features have no significant results because the model selected all segments to be important. The strength of these features is their combination with others.

V 1.0 5/55 2005.02.28

Page 6: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

Figure 1 Results with different featuresets

Figure 1 shows the results of evaluating the maximum entropy based summarizer on the held out annotated data. Results are shown for different featuresets. As a baseline we included two approaches. The first is a random extraction of segments from the test set. This is the absolute baseline, as there is no coherence or logic in this method. The second baseline selects those segments that contain more that 10 words. This can function as a baseline because it is very easy to implement, and has a much better success rate than the first one. The optimal featureset yielded an F score of 0.505, which exceeds the baseline but shows that the technique has to be improved in order to become really usable. The results have to be seen in context of the well known fact that inter-assessor agreement is usually quite low on summarization tasks. This means that humans find it hard to agree on what constitutes a perfect summary. We are currently working to build a summarizer based on lexical chaining techiques, in order to enhance context sensitivuty of the system. SMIL demonstrator When a summary has actually been generated, it is possible to listen to it using a SMIL player while viewing the transcript. While keeping synchronized with audio (or video), the player will skip to each segment that the summarizer has marked as important. Segments are colorcoded for each individual, and the format also allows clicking in a topic index to skip certain parts. This system actually works very well and feels quite natural, even though segments are sometimes not completed

Parliament recordings test collection In order to study a realistic meeting context and to get early user feedback on M4 technology, we have established a research partnership with the Dutch parliament. The Dutch parliament offer a live Webcast of their plenary sessions as a service to the public. The Dutch parliament is also in the process of implementing electronic services on a much larger scale. As a result of the collaboration, the parliament made some 20 hours of registrations available to the M4 and AMI consortia for research. In addition the official transcripts of the meetings are available for research.

Parliament recordings demonstrator

In D 3.2 an early version of a demonstrator based on a 30 minute fragment of parliamentary sessions were discussed. Recently, a new demonstrator of the browser for parliamentary

V 1.0 6/55 2005.02.28

Page 7: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

meeting recordings has been developed. This version is based on the same principles but has been extended to disclose a larger set of meetings.

The demonstrator consists of two components: a content interpretation and enrichment step that generates searchable indices and the meeting browser. Content interpretation and enrichment The parliament session recordings consist of produced video registrations, distributed as windows media files of one hour length. The test collection covers two full days of sessions in october 2003. For these sessions, official transcripts are available, which are based on the spoken text, but have been slightly edited to make the written accounts better readable (e.g. by deleting disfluencies). Also, some metadata (e.g. list of attendees and topics) are added into the official transcripts which were available to M4 in the form of PDF files. The objective of the content enrichment phase was to align the official transcript text to the video registration, i.e. to add timecodes to every word in the official transcripts. In this way the transcripts can be used to locate relevant meeting fragments given a query. This objective was reached by performing two steps:

1. Automatic speech recognition: An existing ASR system for Dutch was adapted for the parliamentary data (by adapting the lexicon and language model) and was used to generate a transcript of the audio signal of the video registrations. This automatic transcripts contained timecodes.

2. Transcript and metadata extraction: A module was built to extract the manual transcripts from the PDF versions. Some of the metadata (e.g speaker name and debate topic) could be extracted as well by heuristic interpretation of the page layout and font information. Both transcript and metadata were converted into XML format.

The time-coded automatic transcript was subsequently aligned with a collection of XML versions of official records. This involved a combination of dynamic programming (LCS algorithm), least squares filtering and polynomial interpolation. No manual selection of the manual transcript and automatic transcript is required prior to the alignment process. Subsequently the manual transcripts, speaker and debate topics are indexed in three separate indices. Meeting browser The meeting browser demonstrator (cf. the screendump in Figure 2) facilitates access to an archive of meeting sessions by offering both browsing and searching functions. The upper left pane shows the browsable structure of the sessions of one day. It consists of a list of debate topics, which are subdivided (not shown here) in a list of speaker turns. When a particular session topic is selected, it is immediately shown in the video viewer. When the video plays, the corresponding transcript from the official meeting transcript is shown alongside, scrolling as the parliament members speak. The bottom part of the browser interface is reserved for search. A user can search in the transcripts, on debate topic or speaker name. The meeting browser is implemented in a combination of client side interpretation of Javascript and SMIL, served by a relational database and the open source search engine Lucene, glued together by java and PHP layers.

V 1.0 7/55 2005.02.28

Page 8: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

Figure 2: meeting browser for parliament sessions

References W. Kraaij, M. Spitters, and A. Hulth. Headline extraction based on a combination of uni- and multidocument summarization techniques. In Proceedings of the ACL workshop on Automatic Summarization/Document Understanding Conference (DUC 2002), June 2002. A.H. Buist, W. Kraaij, and S. Raaijmakers. Extractive meeting summarization, June 2004. Note: Poster MLMI'04, Martigny.

3. Information access through automatic addressee identification Information about who is talking to whom in a given situation in a meeting is of great value for understanding what is going on in the meeting. We present results on automatically recognizing the speakers addressees in meeting situations by means of supervised machine learning techniques using information obtained from multi-modal channels. Bayesian network classifers were informed with data concerning visual features, linguistic features, and conversational features. To train the classifiers we annotated several small group meetings for dialogue acts, speakers’ addressees, gaze behavior. This report is organized as follows. We first give some theoretical background focussing on addressing in small group face-to-face conversations, to explain terminology used and to motivate the selection of features that was made for machine classification. Then we present the data, the way it was annotated and facts and figures on the data that are used by the classiers. In the third section we present classification results. Finally, we give conclusions.

V 1.0 8/55 2005.02.28

Page 9: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

Theorectical background on addressing

Upto a few years ago, research in the area of computational dialogue systems was restricted to interactions between two participants, the human user and the machine. For conversational analysts, however, two party dialogues have always been a special case of multi-party conversations. Only recently research in computational dialogue systems broadened its scope to multi-agent interactive systems and in comparing the new situation with the more familiar case of two party dialogues they identified a number of interesting issues that become salient in interactions involving more than two participants. One aspect that regained interest is addressing. Addressing is an aspect of every form of communication. A letter or e-mail message or e-mailing is addressed by the producer in order to make it clear who is the intended receiver. In the specific types of communication that we encounter in face-to-face meetings we also see specific means that participants use to address their speech. We restricted our analysis to the M4 scripted meetings which fall in the category of «small group » meetings. Small groups have no more than seven participants. The meetings we studied have four participant. Several studies have shown that speaker turn sequences in small group meetings differ substantially from those in larger groups. We expect that addressing behavior will be different in small groups as well if we compare it to larger groups. Addressees are those “ratified participants (…) oriented to by the speaker in a manner to suggest that his words are particularly for them, and that some answer is therefore anticipated from them, more so than from the other ratified participants”. This is the way Erving Goffman describes addressee in his “Replies and Response” [10], one of the papers of Goffman that has been of great value for our analysis of small group meetings that we present here. In multi-party conversations, the speaker not only has the responsibility to make his speech understandable for the listeners, but also to make clear whom he is addressing his speech. If this information is not part of the common ground as believed by the speaker she is committed to make clear who her intended addressees are. When a speaker contributes to the conversation, all those participants who happen to be in perceptual range of this event will have “some sort of participation status relative to it”. These conversational roles that the participants take in a given conversational situation make up the ‘participation framework’. (see, E. Goffman’s Introduction in: ‘Forms of Talk’, [11], p. 3). There are different ways to categorize the audience of a speech act in a multi-party conversations. We use a taxonomy of conversational roles proposed in [12]. People around an action are divided in those who really participate in the action (active participants) and those who do not (non-participants). The active participants in a conversation include speaker and addressee as well as other participants taking part in conversation but currently not being addressed. Clark called them side participants. All other listeners who have no rights to take part in conversation are called overhearers. Overhearers are divided in two groups: bystanders and eavesdroppers. Bystanders are overhearers who are present and the speaker is aware of their presence. Eavesdroppers are those overhearers who are listening without the speakers’ awareness. In our work on small group conversations, we will focus on the active particiaptns i.e. speaker, addressee and side-participants. There are many applications related to meeting research that could benefit from studying addressing behavior in human-human interactions. The result can be used by those who develop communicative agents in interactive intelligent (virtual) environments, meeting managers and presentation assistants. These human computer interaction systems need to recognize when they are addressed and how they have to address people in the environment. Moreover, if we can induce from recorded meetings the information ”who said what, when and to whom” we can use this information for making summarizations of meetings, and for meeting browsing. Knowing who the speaker is addressing in argumentative dialogue may help to understand the position that he takes towards the addressee as well as towards the postion that the addressed participant takes in the discussion.

V 1.0 9/55 2005.02.28

Page 10: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

Automatic addressee identification problem

The goal of our research is to automatically identify the addressee of each functional utterance in meeting conversations. That is: not only the type of addressing (is there one participant, or a subgroup, or a group of individuals addressed), but also who are the participants being addressed by the current speaker. As a computation model we use Bayesian networks. In order to build networks, we identified the addressing mechanisms that people use in identifying their addresses. From these we extracted a set of verbal, non-verbal and contextual features that are relevant for observers to identify the participants the speaker is talking to [8]. Speech as a main communication channels is the main source of information for addressee identification. The set of verbal features includes linguistic markers (personal pronouns, possessive pronouns, personal adjectives, indefinite pronouns, quantifying determiners in a combination with personal pronouns, etc.), dialogue acts, name of participants in vocative form. Non-verbal features include a speaker’s gaze direction and pointing gesture. The categories of context that contribute to addressee identification are: interaction history, meeting action history, user context and spatial context. Interaction history is related to conversational history and non-verbal interaction history. Conversational history contains temporal sequence of speakers, dialogue acts and their addressees. User context includes participants name, gender, social roles and institutional roles. Spatial context includes participants’ location, location of environmental objects and participants’ visible area. A subset of these fetrues is used in our initial computational models described in this report.

Data collection To train and test our models, we built a small corpus of hand-annotated meeting dialogues. The corpus consists of 10 meetings recorded at the IDIAP (M4 meeting data collection). Presently, the corpus contains hand-annotated dialogue acts, adjacency pairs, addressees and gaze directions of meeting participants.

Annotation scheme Dialogue acts. Our dialogue act tag set is based on the MRDA (Meeting Recorder Dialogue Act) set [2]. The MRDA set is a tag set for labeling dialogue acts in multiparty face-to-face meeting conversations. Each functional utterance in the MRDA is marked with a label, compound of one or more tags from the set. In contrast to MRDA, each functional utterance in our set is marked as Unlabeled or labeled with exactly one tag from the tag set. Thus, the selectedtag represents the most specific utterance’s function. Our dialogue act tag set, the mapping between our and the MRDA set and distribution of the tags in the selected five meetings from the annotated corpus are shown in Table 1. Adjacency pairs. Adjacency pairs are labeled at a separate level from dialogue acts. Utterances in AP are ordered with the first part (marked as a “source”) and the second part (marked as a “target”). Labeling of AP consists of marking the source dialogue act and the target dialogue act. If a dialogue act is a source with several targets, for each of those targets a new AP is created. Addressee. Addressee of a dialogue act is a person or a group of people to whom the act is addressed. The addressee tag set contains: where x ;

is unique identification for the speaker at channel

UnknownPPP yxx ,)(, + 3,2,1,0∈

xP x ; + means “one or more”. The utterances that are marked as Unlabeled receive the Unknown tag for the addressee. Also, if it is difficult from all available sources to determine to whom a dialogue act is addressed, the addressee for that dialogue act is marked as Unknown . Gaze. Labeling gaze direction denotes labeling gazed targets for each participant in a meeting. For addressee identification, the only targets of interests are meeting participants. Therefore, our tag sets consists of the speakers IDs ( ) and . Every time the participant changes her gaze target this is marked with the new gaze target.

xP etTNo arg−

V 1.0 10/55 2005.02.28

Page 11: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

DA tag set MRDA %

Statements

s Statement s Statement 49.87

Questions

q Information-Request Wh-question, Y/N question, OR-question, Or Clause After Y/N question 8.86

qo Open-ended Question Open-ended questions 3.54

qh Rhetorical Question Rhetorical Questions 0.25

Backchannels and Ack.

k Acknowledgement Acknowledgment, Backchannel 11.39

ba Assessment/Appreciation Assessment/Appreciation 1.77

Responses

rp- Positive response (Partial) Accept, Affirmative Answer 11.14

rn- Negative response (Partial)Reject ,Dispreferred and Negative Answer 2.28

ru- Uncertain response Maybe , No Knowledge 1.27

Action Motivators

al Influencing-listeners-action Command, Suggestion 3.29

as Committing-speaker-action Commitment, Suggestion 3.54

Group 6: Checks

f "Follow Me" "Follow Me" 0.00

br Repetition Request Repetition Request 0.25

bu Understanding Check Understanding Check 2.28

Politeness Mechanisms

fa Apology Apology 0.00

ft Thanks Thanks 0.25

fw Welcome Welcome 0.00

fo- Other polite Downplayer, Sympathy + 0.00

Table 1 Dialogue act annotation schema

Reliability of annotation schema To obtain valid research results, data on which they are based must be reliable. Reliability is a function of agreement achieved among annotators. Each meeting in our corpus has been annotated by exactly two annotators. Annotators were divided into two groups; each group annotated exactly 5 meetings We examined intra-annotator agreement on dialogue acts, adjacency pairs and addressee annotations using two chance corrected agreement coefficients: Kappa (κ) and Krippendorff’s alpha (α): κ for DA and addressee annotation and α for AP [3,4]. Chance-corrected agreement coefficients can all be represented in the alpha canonical form:

DeDo-1c = where Do is the observed disagreement; De is the disagreement expected by

chance. When Do=0 the annotators exhibit perfect agreement; and when Do=De agreement is equal to 0. The way expected disagreement is calculated is what make difference among κ and α. The overall agreement for the first annotators group is: DA (κ =0.77), addressee (κ=0.80), AP (α=0.82). The overall agreement for the second annotators group is: DA (κ =0.69), addressee

V 1.0 11/55 2005.02.28

Page 12: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

(κ=0.70), AP (α=0.81). According to Krippendorff’s scale1, annotators in the both groups reached the agreement that allows making tentative conclusions regarding dialogue acts and addressee annotation and good agreement regarding adjacency pairs annotation.

Annotation tools and data formats Annotations are done using two annotations tools developed at University of Twente: DACoder and CVL (Continuous Video Labeling) tool [5]. The tools are developed using NXT( NITE XML Toolkit) [1]. NXT uses a stand-off XML format which consists of inter-related XML files. The location and structure of those files are represented in a “metadata” file. The NXT stand-off XML format enables the capture and efficient manipulation of complex hierarchical structures.

Examples of addressing in the M4 data collection An example from one of the meetings is shown in Figure 1 as an illustration of some of the types of interactions we observed in the corpus.

Who is From tTable performhis dia Tshow tbiggestalkativfreque

Autom In this BayesiA BayeBs thatof prob

1 The K0.67 < κ

V 1.0

<DA sp=”p3”, type=”Open-ended-question”, add=”ALLP > so what did everyone come up with <\DA> <DA sp=”p3”, type=”Open-ended-question”, add=”p0” > Vivek what ideas did you have? <\DA> <DA sp=”p0”, type=”Positive response”, add=”p3” > I excellently agree with that <\DA> <DA sp=”p0”, type=”Statement”, add=”ALLP” > but I I think it's not feasible because you know people are so much idealistic!and what kind of programming …programming platform you want to create <\DA> <DA sp=”p1”, type=”Acknowledgment”, add=”p0” > yeah <\DA>

Figure1 : An example of the addressing in the M4 data collection

talking to whom

he annotated corpus, we selected five meetings for training and testing our classifiers. 2 shows for each of the five meetings in the data the number of dialogue acts ed by each of the participants as well as the number of times each speaker addressed

logue act to the other participants, to a subgroup of people or to the whole audience.

able 2 shows that addressing a subgroup of participants is very rare. Moreover, they hat the percentage of the dialogue acts that were addressed to all participants was the t over all five meetings. The results also show that there is no relation between ity of participants (in terms of numbers of dialogue acts they performed) and the

ncy of participants being addressed.

atic addressee classification

section we will present our initial results on automatic addressee classification using an Networks and Naïve Bayes. sian network B over a set of variable U consists of two parts: a directed acyclic graph

represents conditional independency assumptions among the variables in U, and a set ability distributions Bp associated with the graph.

rippendorff’s scale discounts any result with κ, α < 0.67, allows tentative conclusions when ,α < 0.8 and definite conclusions when κ,α ≥ 0.8

12/55 2005.02.28

Page 13: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

p0 p1 p3 p4 ALLP SubG UNK total p0 0 0 1 2 3 0 0 6 p1 1 0 3 0 2 0 2 8 p3 1 4 0 11 15 0 2 33 p4 4 0 10 0 7 0 0 21

m4-1

total 6 4 14 13 27 0 4 68 p0 0 13 9 6 2 0 4 34 p1 6 0 3 1 3 0 0 13 p3 7 2 0 6 26 0 5 46 p4 2 1 8 0 4 0 4 19

m4-9

total 15 16 20 13 35 0 13 112 p0 0 6 4 2 20 0 0 32 p1 3 0 0 3 11 0 2 19 p3 2 2 0 1 1 1 0 7 p4 4 1 0 0 3 0 0 8

m4-tst--3

total 9 9 4 6 35 1 2 66 p0 0 5 2 2 1 0 0 10 p1 5 0 5 4 27 0 1 42 p3 1 3 0 5 0 0 0 9 p4 1 5 3 0 0 0 0 9

m4-tst-14

total 7 13 10 11 28 0 1 70 p0 0 10 0 7 5 0 0 22 p1 4 0 6 5 20 0 4 39 p3 0 6 0 2 4 0 0 12 p4 5 10 3 0 3 0 2 23

m4-tst-26

total 9 26 9 14 32 0 6 96 total 46 68 57 57 157 1 26 412

Table 2: The distribution of each addressee values for each speaker in the 5 selected meetings from the M4 data collection (SubG : a subgroup of participants, UNK : the Unknown tag). The last column shows the total number of dialogue acts performed by each participant

Bp= p(u|pa(u))| u∈ U where pa(u) is a set of parents of the node u in Bs A Bayesian network represents a decomposition of the full joint probability distribution over U: . When using Bayesian network as a classifier, one node in

network represents the class variable and the other nodes in the network represent feature variables. Classification using he Bayesian networks operates by maximizing the probability of the class variable given the feature values over the all values of the class variable

))(|()( upupUPu

a∏=

))(|()()()()|(maxarg upupUpXpUPXcP

ua

c∏⋅=⋅== αα .

A Naïve Bayes classifier is a simplified version of a Bayesian network classifier, where all features are considered conditionally independent given the class variable. This assumption gives that . )|()( ∏=

u

cupUP

In a dialogue situation, which is an event that lasts as long as the dialogue act performed by the speaker in that situation, the class variable is the addressee of the dialogue act performed by the current speaker (ADD). Since only few instances of addressing a subgroup of people are present in the data, we removed them from data set and excluded all possible subgroups of meeting participants from the class values set. Therefore we define addressee classifiers to identify one of the following values: where andxP 3,2,1,0∈x ALLP . ALLP denotes the whole audience: all addressee values that include three participants such as or 321 ,, PPP

V 1.0 13/55 2005.02.28

Page 14: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

320 ,, PPP are grouped into the ALLP category since they represent the audience of the speaker and , respectively. 0

,2

P

xP

,1,0, ∈yx

1P

x NTlook _

NTlookSP _ _

)) =

ijk

|( i kxP =

thj

thk

p

N

Features We used three sorts of features: utterance features, gaze features and contextual features. As utterance features, we used the lexical marker (PP) which represents whether utterance contains personal pronouns we or you, both of them, or neither of them; the conversational function of the utterance (DA label) and duration of the utterance (Short: true, false). If an utterance’s duration is less than one second, the utterance is considered as a short one. As contextual features, we used those features that relate to the conversational history. They include the following features

o The speaker, addressee and conversational function of the immediately preceding utterance on the same or a different channel (SP-1, ADD-1 and DA-1)

o The speaker, addressee and conversational function of the related utterance (SP-R, ADD-R and DA-R). A related utterance is an utterance that is a source of the AP which target is the current utterance.

o the speaker of the current utterance (SP) We experimented with a variety of gaze features. In the first experiment, for each participant

we defined a set of features in the form and where

andyx PlookP __ NTlookPx __

3 y≠

SP _

; represents that the participant looks away. The value set represent the number of the times that speaker looks at during the time spam of the utterance; “zero” for 0, “one” for 1, “two” for 2 and “more” for 3 or more times. In the second experiment, we defined a feature set that incorporates only the information about gaze direction of the current speaker. The feature set includes the following gaze features:

and where . The value set is the same as in the first experiment.

Px _

look _

xP

xP

3,2,1,

yP

xP 0∈x

Results The instances where the addressee is labeled with the unknown tag or a subgroup tag were removed from the set. Of the 390 instances, 300 (77%) are used as a training set; 90 (23%) are used as a test set. We constructed our networks from the empirical data using the machine learning techniques. For learning the network structure, we applied the well-known search-and-score approach, the K2 algorithm [6]. For learning conditional probability distributions, we used direct estimates of the conditional probabilities implemented in the WEKA toolbox [7], that is

'

'

(ijij

ijkijkia

NN

NNx

+

+

N'ijk

; where is a number of samples for which takes

the value; is a number of samples for which takes the value and takes

value, is the alpha parameter set to be equal to 0.5.

ijN )( ia xp

ix)( ia xp thj

For each classifier, we performed 10-fold cross validation on the training set. Then, we re-evaluated the classifiers on the test set .As a baseline, we choose a simple classifier that guesses the most likely class. In Table 3 we summarized the overall accuracy results for both experimental setups.

V 1.0 14/55 2005.02.28

Page 15: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

10-fold cross-validation test set Naïve Bayes Bayesian Network BASELINE Naïve Bayes Bayesian Network BASELINE

1. experiment 73.90 77.63 42.37 70.00 77.78 35.56

2. experiment 72.88 76.61 42.37 71.11 80.00 35.56 Table 3: Accuracy for the addressee classification

The results show that in all cases all classifiers show a significant improvement in comparison to the chosen baseline. The performances of the Bayesian Networks are slightly above the performances of Naïve Bayes classifiers. This only small increase in the performance can be explained with having not enough data to learn conditional probabilities and structural dependences among nodes in the Bayesian Networks. An interesting observation is that both classifiers performed better on the test set in the second experiment, where only the information about the speaker gaze direction is taken in consideration. This mild increase in the performance in the second experiment can be explained with the fact that a speaker gazes at his selected addressees more than at the side participants. On the other hand, the side participants spent more time looking at the speaker i.e. at the participant they listen to, than at the addressed participants. Therefore, information about the gaze direction of side-participants underlines the information about the speaker. Apart from the overall accuracy, we reported Kappa as an added measure to assess how well the classifiers agree with the hand-annotated data. The Kappa coefficient indicates how much the agreement is above the agreement expected by chance. It provides a better measure of accuracy of a classifier than the overall accuracy, since it takes into account the whole confusion matrix instead of only the diagonal elements, as the overall accuracy does. Table 4 shows the Kappa values for the both classifiers.

10-fold cross-validation test set Naïve Bayes Bayesian Network Naïve Bayes Bayesian Network

1. experiment 0.63 0.69 0.61 0.71

2. experiment 0.61 0.68 0.61 0.73

Table 4: Kappa values for addressee classification The Kappa values indicate that Bayesian Network classifiers reach higher agreement with the manual classifications than the Naïve Bayes classifiers. Furthermore, according to Krippendorff’s scale the results of Bayesian Networks classifications allows tentative conclusions to be drown; according to some less restrictive scale such as Landis and Koch the both classifiers in all cases reached substantial agreement (0.60<k<0.81) [9]. The confusion matrix of the classifier with the best performance on the test set is shown in Table 5.

p0 p1 p2 p3 ALLP p0 7 0 2 0 0 p1 0 21 0 0 5 p2 1 0 7 0 1 p3 0 0 1 12 1

ALLP 2 2 1 2 25 Table 5: Confusion matrix of identified addressee. Rows and columns represent actual and predicted labels, respectively The participant numbers refer to fixed positions the participants occupy at the rectangular shaped meeting table. The positions are numbered the same in all meetings. So, Table 5

V 1.0 15/55 2005.02.28

Page 16: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

shows us that there is no confusion made between addressees sitting at position p0 and p3, po and p1, between p1 and p3, and between p1 and p2. The most confusions are between single addressing and group addressing.

Conclusion In the sections above, we have presented our research on automatic addressee identification using Bayesian Networks. Due to the limitation of our data collection, we explored only two types of addressees: addressee as a single participant and addressee as a whole group. The preliminary experiments show that even with a small set of data, the Bayesian network classifiers for addressee prediction have good performances. We will continue research on automatic addressee identification where we will use more data from more natural meetings that are being recorded within the AMI project.

References [1] J. Carletta, S. Evert, U. Heid, J. Kilgour, J. Robertson, and H. Voormann, The NITE XML Toolkit: flexible annotation for multi-modal language data, Behavior Research Methods, Instruments, and Computers, 35(3):353–363, 2003. [2] R. Dhillon, S. Bhagat, H Carvey, and E. Shriberg, Meeting recorder project: Dialogue act labeling guide, Technical report, ICSI Speech Group, Berkeley, USA, 2003. [3] Jacob Cohen, A coefficient of agreement for nominal scales, Educational and Psychological Measurement, 20:37–46, 1960. [4] Klaus Krippendorff, Content analysis: An introduction to its methodology, Sage Publications, 1980. [5] D. Reidsma, N. Jovanovic and D.H.W. Hofs, Designing annotation tools based on properties of annotation problems, Technical Report TRCTIT-04-45, CTIT, 2004. [6]G. Cooper, and E. Herskovits, Bayesian method for the induction of probabilistic networks from data, Machine Learning, 9:309-347, 1992. [7] http://www.cs.waikato.ac.nz/~ml/weka/ [8] N. Jovanovic and R. op den Akker, Towards automatic addressee identification in multi-party Dialogues, 5th SIGdial Workshop on Discourse and Dialogue, pages 89–92, 2004. [9] J. Landis, G. Koch, The measurement of observer agreement for categorical data, Biometrics, 33:159-174, 1977. [10] Erving Goffman, Replies and Responses, in Language in Society, 5[1976]:pages 257-313; reprinted in: Forms of Talk, University of Pennsylvania Press, Philadelphia, 1981. [11] Erving Goffman, Forms of Talk, University of Pennsylvania Press, Philadelphia, 1981 [12] H.H. Clark and T.B. Carlson, Hearers and speech acts, In : In Arenas of Language Use , pages 332-372, 1992.

V 1.0 16/55 2005.02.28

Page 17: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

4 Multimodal recognition of phonemes in meeting data

Phoneme recognition is one of key tasks of VUT Brno in M4. This section presents the results of audio-visual recognition phoneme recognition in meeting data [1, 2].

4.1 Feature extraction

The parameters of audio are well-known Mel-filterbank log energies (23 bands, window size 20ms, audioframe-rate Fafr = 100Hz). These parameters are extracted from beamformed audio recordings sampledat 16kHz.

Prior to extraction of visual features, we need to process the original video stream data in order todetect and track faces (heads) of humans (objects) in meetings. The method employed is based on askin color detection. The visual input in our experiments is a video stream which is supposed to containsequence of head poses of one human object. The video frame-rate Fvfr = 25Hz and the input resolution70× 70 pixel region is obtained for every video frame. Practically, each video frame contains the wholespeaker’s head including the hair and neck. After detecting the mouth area regions, the visual featurestested were:

• Average brightness of region-of-interest (ROI) (see Fig. 1 for illustration).

• Discrete cosine transform (DCT) coefficients of the ROI.

• Optical flow analysis coefficients.

We were also dealing with an algorithm detecting lip positions (as the center of mouth) in the givenimage frame:

• red pixels are detected on the face (this operation processes the input image in order to correctlylocate mouth’s pixels).

• Selection of the largest red area with a “seed algorithm”. First, several erosions on the binary imageare applied until a few white pixels remain. Then, the white pixels are used as a “seed” which hasto be extended to all the surrounding white pixels on the binary image.

4.2 Experiments and results

The audio-visual speech database collected in M4 has been used. For experimental purposes, the data issplit into three sets: training, (CV) cross-validation (together 41 minutes) and testing (9 minutes). First,two sets are used to train artificial neural nets (NN). Then, testing data is forward-passed through suchNN.

Acoustic features (Mel-filterbank log-energies) are computed with Fafr = 100Hz. Derivation of visualfeatures is the major part of our experiments. Each input video sequence is at the beginning processedby head tracking algorithm. Such visual data represents input to the following experiments:

A. One visual feature of average brightness from the ROI is derived for each video frame.

Figure 1: Detection of ROI (mouth area) using correlation technique.

2005.02.28 17/55 V 1.0

Page 18: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

−−−> time

Featurefusion

featuresAcoustic−visual

(39dim., 100Hz)

Acoustic features

(23 dim., 100Hz)(16kHz) parameterizationAcoustic

scalizationResize

Edgecalculation

2D−crosscorrelation

LPF2D − DCTMaximum Square cropping

(25Hz)

Visual signal Visual features Visual featuresInterpolationparameterization

Visual

Gray

(16 dim., 25Hz) (16 dim., 100Hz)

Acoustic signal

Figure 2: A/V phoneme recognition system.

B. 16 DCT coefficients from previously detected ROI are extracted (4 lowest DCTs in each dimension).The scheme of system including DCT features is shown in Fig. 2.

C. In the experiments with optical flow analysis, the ROI does not have to be detected. This analysisis applied directly on the sequence of the input video frames. Finally, three visual features arecomputed: horizontal and vertical variances of flow vector components and their covariance. Thesefeatures are supposed to indicate the movement of speaker’s mouth. They are especially useful forestimating silence periods.

D. The ROI has been also found by the lip locating algorithm based on edge detection and colorfiltering. From detected ROI, 16 DCT coefficients are derived.

E. We have performed several experiments with application of intra-frame linear discriminant anal-ysis (LDA) which is supposed to improve classification among speech classes and also providesthe second-stage dimensionality reduction, whereas first-stage reduction is performed by principalcomponent analysis (PCA). Each video frame (32× 64), containing lower half of tracked head po-sition, is transformed into first 512 PCA basis. This data is used to estimate LDA statistics. Finaltransformation matrix is obtained by multiplying PCA and LDA matrices. Visual features used inrecognizer are obtained by projection of input video frames onto first 45 PCA-LDA basis.

F. We have also experimented with application of PCA-LDA transform directly onto ROI (mouth posesdetected by our correlation-based approach). In this case, mouth regions (16 × 34) are used forestimation of PCA (256 basis) and PCA-LDA statistics. Finally, projection onto first 45 PCA-LDAbasis is performed.

In all experiments, acoustic and interpolated visual features are merged to build N-dimensional audio-visual feature vectors. The evaluation of different audio-visual features was done on phoneme set thatconsists of 46 phonemes. In addition, there were also two classes for silence and the gap (a part ofthe speech recording belonging to a different speaker). The recognition system is a simple NN, basedon forward-backward algorithm, employing three layer perceptron with the softmax nonlinearity at theoutput. The hidden layer consists of 200 neurons with sigmoid non-linearities.

To evaluate our various audio-visual feature extraction algorithms, we observe a) frame-based phonemeaccuracy on CV sets and b) frame-based phoneme accuracy on forward passed test data. Results aregiven in Tab. 1.

2005.02.28 18/55 V 1.0

Page 19: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Feature extraction Acc. CV [%] Acc. FWD [%] N

Audio only 28.9 31.0 23

Average brightness 29.0 31.5 23+1DCT coefficients 28.3 31.3 23+16Optical flow analysis 31.4 31.1 23+3Seed algorithm 28.8 31.3 23+16PCA-LDA (head) 26.7 31.8 23+45PCA-LDA (mouth) 27.6 32.5 23+45

Table 1: Experimental results for A/V phoneme recognition. The third column gives the total size ofvector with visual and acoustic features.

4.3 Conclusions

Experimental results with multimodal features are compared to the acoustic parameters only (the base-line). Obtained results expressed by frame-based phoneme accuracies show an improvement over thebaseline, but this improvement is smaller than expected. This is caused by:

• the resolution of input images for derivation of video features is very low (several pixels for moutharea).

• video data is noisy with mouth areas hidden or distorted (partially occluded, turned to bad angle,varying lighting conditions). We didn’t have any reliable parameter describing the “cleanness” ofvideo data.

• due to these conditions, the head tracking algorithm as well as mouth detection method are stillnot very reliable.

• we have concentrated at the extraction of video features but the classifier itself was quite rudimen-tary. The temporal information from sequence of video frames should help the classification.

• additional work was done on derivation of mathematical models of speaker mouths, whose param-eters could have later been used for classification. Unfortunately, it didn’t give us any recognitionimprovement, as well.

The A/V recognition typically outperforms standard audio-only methods in adverse environments (car,industry). As meeting room environment is quite clean, we can conclude that the A/V recognition withgiven image resolution does not bring significant improvement. The efforts in this direction were thereforestopped until better quality meeting data with face close-captions (as it is recorded within AMI) becomesavailable.

References

[1] Motlicek Petr, Burget Lukas, Cernocky Jan: Phoneme Recognition of Meetings using Audio-VisualData, poster at MLMI’04 Workshop, Martigny, CH, 2004.

[2] Motlicek Petr, Burget Lukas, Cernocky Jan: Multimodal Phoneme Recognition of Meeting Data, In:7th International Conference, TSD 2004 Brno, Czech Republic, September 2004 Proceedings, Brno,CZ, Springer, 2004, s. 379-384, ISBN 3-540-23049-1.

2005.02.28 19/55 V 1.0

Page 20: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Table 2: Group Action Lexicon 1

Action Description

Discussion most participants engaged in conversations

one participant speakingMonologue

continuously without interruption

Note-taking most participants taking notes

one participant presentingPresentation

and using the projector screen

one participant speakingWhite-board

and using the white-board

5 Multimodal Integration of features streams for Meeting GroupAction Segmentation and Recognition

5.1 Introduction

In the section, we give an overview of the group action recognition in meetings. Section 5.2 describes thetasks for meeting action recognition. Section 5.3 and Section 5.4 describe the action lexicon we definedand the data set we used. Next, we summarize the extracted features in Section 5.5 and the developedmodels in Section 5.6. We then present the performance measures in Section 5.7 and individual resultsand discussions in Section 5.8. Finally, we present an overall discussion in Section 5.9.

5.2 Tasks

- Task-1: group action classification (given group action boundary known). Research Institutesinvolved:

– TUM

- Task-2: continuous group action segmentation and recognition. Research Institutes involved:

– IDIAP, TUM and EDIN

5.3 Action Lexicon

We have defined two sets of meeting actions. The first set includes 8 meeting actions shown in Table 2The second set includes 14 group actions as in Table 3. Note that we differentiate monologue events bydifferent participants, i.e, monologue1 is a monologue by meeting participant 1, etc.

5.4 Data Set

We have collected a corpus of 59 five-minute, four-participant meetings [17], in IDIAP meeting roomequipped with cameras and microphones1. There are three cameras in the meeting room. Two camerascapture a frontal view of the meeting participants, and the third camera captures the white-board andthe projector screen. Audio was recorded using lapel microphones attached to participants, and aneight-microphone array in the center of the table.

30 meetings named based on the rule ‘Scripted-Meeting-TRN-NN’, are used in training; 29 meetingsnamed based on the rule ‘Scripted-Meeting-TST-NN’, are used in testing.

1http://mmm.idiap.ch/

2005.02.28 20/55 V 1.0

Page 21: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Table 3: Group Action Lexicon 2Action Description

Discussion most participants engaged in conversations

one participant speakingMonologue

continuously without interruption

Monologue+ one participant speaking continuouslyNote-taking others taking notes.

Note-taking most participants taking notes

one participant presentingPresentation

using the projector screen

Presentation+ one participant presenting usingNote-taking projector screen, others taking notes

one participant speakingWhite-board

using the white-board

White-board+ one participant speaking usingNote-taking white-board, others taking notes

5.5 Features

5.5.1 TUM Features

Static features First features were extracted from the hand-made annotations that were put togetherto a static feature vector for one meeting event. Later we used the results of ’specialized’ recognizerssuch as the speaker turn detection from Guillaume Lathoud [15] and the gesture recognizer from MartinZobl [34]. The possible features were all normalized to the length of the meeting event to provide therelative duration of this particular feature. In a monologue for example the ’talking’ of the participantthat delivers the speech would be close to one, since he is the only one that talks during the whole meetingevent. In the same way all other entries in the annotation were normalized so that the values are all ina range of 0 to 1 for each participant. From all available events only those that are highly discriminativewere chosen. This resulted in a nine dimensional feature vector as shown in Table 4.

Dynamic features Feature vectors have been extracted from the audio-visual stream. In the meet-ing room the four persons are expected to be at one of six different locations: one of four chairs, thewhiteboard, or at a presentation position:

L = C1, C2, C3, C4,W, P (1)

This information has been used to extract position dependent audio- and visual-features. Furthermorethe signals from the lapel-microphones have been used to add speaker dependent audio features. The finalobservation vector O(t) is a concatenation of the audio- and visual-features, resulting in a 68 dimensionalvector. For the HMM approaches the audio feature rates were adjusted to 12.5 Hz to match the visualfeature stream. For the DBN-approach adjusting the frame rate wasn’t necessary.

* Audio Features For each of the speakers four MFC coefficients and the energy were extracted fromthe lapel-microphones. This results in a 20-dimensional vector ~xS(t) containing speaker-dependent in-formation. A binary speech and silence segmentation (BSP) for each of the six locations in the meetingroom was extracted with the SRP-PHAT measure [17] from the microphone array. This results in asix-dimensional discrete vector ~xBSP (t) containing position dependent information.

2005.02.28 21/55 V 1.0

Page 22: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Table 4: Features of the static feature vector

talking(1) rel. duration of talking of person 1talking(2) rel. duration of talking of person 2talking(3) rel. duration of talking of person 3talking(4) rel. duration of talking of person 4∑

whiteboard sum of all activities in front of thewhiteboard∑

presentation sum of all activities near the projec-tor screen∑

writing sum of all writing activities∑standing sum of all standing activities

talkdist difference of the relative durationof the participant who talks mostto the participant who talks secondmost

* Visual Features For each of the six locations L in the meeting room a difference image sequenceILd (x, y) is calculated by subtracting the pixel values of two subsequent frames from the video stream.Then seven global motion features [29, 34] are derived from the image sequence: The center of motion iscalculated for the x- and y-direction according to:

mLx (t) =

∑(x,y) x · |ILd (x, y, t)|∑

(x,y) |ILd (x, y, t)| and mLy (t) =

∑(x,y) y · |ILd (x, y, t)|∑

(x,y) |ILd (x, y, t)| (2)

The changes in motion are used to express the dynamics of movements:

∆mLx (t) = mL

x (t)−mLx (t− 1) and ∆mL

y (t) = mLy (t)−mL

y (t− 1) (3)

Furthermore the mean absolute deviation of the pixels relative to the center of motion is computed:

σLx (t) =

∑(x,y) |ILd (x, y, t)| · (x−mL

x (t)∑

(x,y) |ILd (x, y, t)|and

σLy (t) =

∑(x,y) |ILd (x, y, t)| · (y −mL

y (t)∑

(x,y) |ILd (x, y, t)| (4)

Finally the intensity of motion is calculated from the average absolute value of the motion distribution:

iL(t) =

∑(x,y) |ILd (x, y, t)|

x · y (5)

These seven features are concatenated for each time step in the location dependent motion vector

~xL(t) = [mLx (t),mL

y (t),∆mLx (t),∆mL

y (t), σLx (t), σLy (t), iL(t)]T (6)

With this motion vector the high dimensional video stream is reduced to a seven dimensional vector,but it preserves the major characteristics of the currently observed motion. Concatenating the motionvectors from each of the six positions ~xL(t) leads to the final visual feature vector

~xV (t) = [~xC1(t), ~xC2(t), ~xC3 (t), ~xC4(t), ~xW (t), ~xP (t)]T (7)

that describes the overall motion in the meeting room with 42 features.

2005.02.28 22/55 V 1.0

Page 23: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

5.5.2 IDIAP Features

There are two types of AV features extracted by IDIAP: person-specific AV features and group-levelAV features. The former are extracted from individual participants. The latter are extracted from thewhiteboard and projector screen regions.

Person-Specific AV Features Person-specific visual features were extracted from the cameras thathave a close view of the participants. Person-specific audio features were extracted from the lapel micro-phones attached to each person, and from the microphone array. The complete set of features is listed inTable 5.

Person-specific visual features. For each video frame, the raw image is converted to a skin-colorlikelihood image, using a 5-component skin-color Gaussian mixture model (GMM). We use the chromaticcolor space, known to be less variant to the skin color of different people [31]. The chromatic colors aredefined by a normalization process: r = R

R+G+B , g = GR+G+B . Skin pixels were then classified based

on thresholding of the skin likelihood. A morphological postprocessing step was performed to removenoise. The skin-color likelihood image is the input to a connected-component algorithm (flood filling)that extracts blobs. All blobs whose areas are smaller than a given threshold were removed. We use 2-Dblob features to represent each participant in the meeting, assuming that the extracted blobs correspondto human faces and hands. First, we use a multi-view face detector to verify blobs corresponding tothe face. The blob with the highest confidence output by the face detector is recognized as the face.Among the remaining blobs, the one that has the rightmost centroid horizontal position is identifiedas the right hand (we only extracted features from the right hands since the participants in the corpusare predominately right-handed). For each person, the detected face blob is represented by its verticalcentroid position and eccentricity [26]. The hand blob is represented by its horizontal centroid position,eccentricity, and angle. Additionally, the motion magnitude for head and right hand are also extractedand summed into one single feature.

Person-specific audio features. Using the microphone array and the lapels, we extracted two typesof person-specific audio features. On one hand, speech activity was estimated at four seated locations,from the microphone array waveforms. The seated locations were fixed 3-D vectors measured on-site.The speech activity measure was SRP-PHAT [7], which is a continuous, bounded value that indicatesthe activity at a particular location. On the other hand, three acoustic features were estimated fromeach lapel waveform: energy, pitch and speaking rate. We computed these features on speech segments,setting a value of zero on silence segments. Speech segments were detected using the microphone array,because it is well suited for multiparty speech. We used the SIFT algorithm [16] to extract pitch, and acombination of estimators [18] to extract speaking rate.

Group AV Features Group AV features were extracted from the white-board and projector screenregions, and are listed in Table 5.

Group visual features. These were extracted from the camera that looks towards the white-boardand projector screen area. We first get difference images between a reference background image and theimage at each time, in the white-board and projector screen regions. On these difference images, we usethe average intensity over a grid of 16× 16 blocks as features.

Group audio features. These are SRP-PHAT features extracted using the microphone array from twolocations corresponding to the white-board and projector screen.

5.5.3 EDIN Features

We used three classes of features: prosodic features; speaker activity features; and lexical features. Wehave based our work mainly on speech and audio modalities, since these are the most informative inmeetings.

2005.02.28 23/55 V 1.0

Page 24: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Figure 3: Overview of the lexical feature generation process

Prosody The prosodic features are based on a denoised and stylised version of the intonation contour[25], an estimate of the syllabic rate of speech [19] and the energy. These acoustic features comprise a12 dimensional feature vector (3 features for each of the 4 speakers), highlighting the currently activespeakers and may indicate the level of engagement in the conversation for each participant. In order tocope with the high level of cross-talk between audio channels, each feature set was forced to zero if thecorresponding speaker was not active.

Speaker activity features Information about the locations of the active speakers was extracted usinga sound source localization process based on a microphone array [17]. A 216 element feature vectorresulted from all the 63 possible products of the 6 most probable speaker locations (four seats and twopresentation positions) during the most recent three frames [8]. A speaker activity feature vector at timet thus gives a local sample of the speaker interaction pattern in the meeting at around time t.

Lexical features In addition to the paralinguistic features outlined above, we also used a set of lex-ical features extracted from the word-level transcription [9]. A transcript is available for each speaker,resulting in a sequence of words. In these preliminary experiments we have used human generated tran-scriptions.

To correlate low-level text transcriptions with high level “meeting phases” (monologues and discus-sions), the system outlined in figure 3 has been adopted. Our approach is based on unigram languagemodels with a multinomial distribution over words used to model the monologue class M1 and the dis-cussion or dialogue class M2 (although these principles are valid for the other actions also). The sequenceof words (from the transcript under test) is compared with each model Mk, and each word w is classifiedas a member of the class k which provides the highest mutual information I(w;Mk):

k(w) = arg maxk∈KI(w;Mk)

The sequence of symbols k is very noisy, and the true classification output is hidden by a cloud of mis-classified words. To address this drawback we compute a smoothed version of k, that uses only the mostfrequent symbols. The resulting symbols sequence may be used to discriminate between monologues anddiscussions with an accuracy of 93.6% (correct classified words).

2005.02.28 24/55 V 1.0

Page 25: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Table 5: Feature list from IDIAP

Descriptionhead vertical centroid

head eccentricityVisual right hand horizontal centroid

Person- right hand angleSpecific right hand eccentricityFeatures head and hand motion

SRP-PHAT from each seatspeech relative pitch

Audio speech energyspeech rate

mean difference from white-boardGroup

Visualmean difference from projector screen

Features SRP-PHAT from white-boardAudio

SRP-PHAT from projector screen

5.6 Models

5.6.1 TUM Models for Task-1

This section describes the various algorithms TUM used for the task of recognizing detected segmentsusing a lexicon of eight group actions defined in Table 2.

* Static approaches First the approaches for recognizing a group actions with given boundaries isexplained. For details see [23]. These methods are based on the static features explained in section 5.5.1.

* Simple classifierFor the classification task we use a number of various classifiers.

• a simple hybrid Bayesian Network (BN) consisting of a discrete node as parent with five states(one for each meeting event) and nine continuous nodes directly connected to the parent node,representing the nine dimensions of the feature vector,

• Gaussian Mixture Models (GMM) with various numbers of Gaussians depending on the number oftraining material,

• a Neural Net with Multilayer Perceptrons (MLP) with 3 layers,

• a Radial Basis Network (RBN)with maximum 10 neurons,

• Support Vector Machines (SVM) with RBF-Kernel.

* Late semantic fusionClassifier fusion is often used to enhance the recognition results of single recognizers. Here the goal

is to provide more solid results throughout the recognition process. The fusion method is derived from aproposal of [14]. Each classifier i produces a pseudo-probability di,j ∈ [0, 1] for each class j by normalizingthe output via a limitation function. Since this method requires no training, it is quite easy and quick

2005.02.28 25/55 V 1.0

Page 26: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

to implement. These classifier outputs are organized in a decision profile (DP) as a matrix.

DP =

d1,1 d1,2 d1,3 d1,4 d1,5

d2,1 d2,2 d2,3 d2,4 d2,5

d3,1 d3,2 d3,3 d3,4 d3,5

d4,1 d4,2 d4,3 d4,4 d4,5

d5,1 d5,2 d5,3 d5,4 d5,5

; (8)

Here the di,j is the pseudo probability of classifier i for class j. The rows represent the output of oneclassifier whereas the columns show all probabilities for one class of all used classifiers. We now searchthe minimum of the decision profile columnwise and get the class C as the one with the maximum valueas shown in Eq. 9.

µ =[min(DP:,1) min(DP:,2) . . . min(DP:,5)

];

C = arg maxµ; (9)

Also a pseudo-probability µ(C) is returned that reflects the support of the fused classifier for this class.

* Late semantic fusion of three different types of recognition techniquesThe classification task is performed by three different approaches, that are combined via Late Semantic

Fusion: A static approach using simulated results of specialized recognizers, a dynamic approach usingthe audio files, and a dynamic approach using the transcriptions.

The basic idea of the first approach is to take advantage of the results of various specialized recognizers,like gesture recognizers, person trackers and so on. For each period of time global statistics are calculatedthat tell the percentage of an action in that period. Carefully selected items are then put together intoa feature vector. The classification of an unknown feature vector is performed by a multi-class SupportVector Machine.

The second approach uses only the audio files. From the four lapel files (one of each participant) theMel-Frequency-Cepstral-Coefficients (MFCCs) are calculated [32]. Here we use twelve cepstral coefficientsplus the energy. These thirteen features are calculated for each participant and then concatenated. Sowe get our feature vector with 52 dimensions. The MFCCs are extracted every 10 milliseconds. All thesefeatures are then trained by an Hidden-Markov-Model with six states and continuous Gaussian mixtureoutputs.

Our third approach for recognizing meeting events is based on the transcriptions that are available for acouple of these scripted meetings. Following suggestions made by [30] each word is assigned a probability.Also for each word the conditioned probability that it belongs to a specific class is calculated. Then withthe use of the Bayes’ Rule the conditioned probability for a sequence of words belonging to a specificclass is derived.

* Dynamic approachesThis section gives an overview of the dynamic approaches used for the recognition of group actions.

For more detail see [1]. These algorithms are based on the dynamic features described in section 5.5.1Three approaches for multi-modal event recognition in meeting events have been compared: a single-

stream Hidden Markov Model (HMM), a multi-stream-HMM, and a DBN. Additionally three single-stream-HMMs without any fusion process have been evaluated for comparison with the multi-streamapproaches.

* Training All eight different approaches consists of eight sub-models: one for each event class Ej . Eachsub-model is trained with a number of sample observation sequences Oi from the corresponding eventclass. Training of all models has been performed with the EM-algorithm [6, 4] using 126 meeting eventsfrom 30 training videos.

2005.02.28 26/55 V 1.0

Page 27: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

* Classification The classification rule was the same for all eight approaches: unknown sample sequencesOu are presented to all eight sub-models. The unknown sequence is then assigned to the class E∗j thatcorresponds to the sub-model with the highest likelihood:

E∗j = argmaxEj∈E

P (Ou|Ej) (10)

Thereby the HMM-classification was performed with the Viterbi algorithm [28]. For the DBN-classificationapproximate inference [13] was applied. 126 Meeting events from 30 unknown test videos have been usedfor the evaluation. The recognition results for all models are listed in Table 7.

* HMMs without Fusion Three standard HMMs, one for each modality, have been trained. For all threemodels a different number of hidden states and Gaussian mixtures has been evaluated. These HMMscorrespond to single-modality models without any fusion process.

For the binary speech and silence segmentation HMM, the vector ~xBSP (t) has been used for thetraining. Best results were achieved with only two hidden states and two Gaussians.

In the case of the speech-only HMM the MFC-vector ~xS(t) has been used to train the model. Here,best results were achieved with three hidden states and only one Gaussian.

For the visual HMM the global motion vector ~xV (t) has been used. Again, best results were achievedwith three hidden states and only one Gaussian.

* Early Fusion with a Single-Stream-HMM For the multi-modal single-stream approach, the threedifferent feature vectors ~xBSP (t), ~xS(t), and ~xV (t) are concatenated at each time step t to one largeobservation vector O(t). Therefore the different frame rates of the three feature vectors had to beadjusted by up-sampling of the visual feature stream. This process is often referred to as an early fusionprocess.

The resulting observation vector O(t) is then used to train a standard single-stream HMM. Again,different combinations of hidden states and Gaussian mixtures have been evaluated. The best model hadthree hidden states and only one Gaussian.

* Late Fusion with a Multi-Stream-HMM For the multi-stream-HMM each of the three features vectors~xBSP (t), ~xS(t), and ~xV (t) is modeled in a separate stream-HMM. However, these three HMMs areconnected and the probabilities recombined after each state. Thereby each stream has a weight ωStreamfor the probability recombination. Thus, streams with more information can be weighted higher. Thisprocess can be regarded as late fusion.

Multi-stream approaches have been found to be robust against noise in one stream [12]. However,due to the recombination after each state, all streams are expected to be state-synchronous. Thereforethe different frame rates of the three feature vectors had to be adjusted as for the early fusion approach.

In this work a model with three hidden states, one Gaussian mixture and high stream weights on thespeech channel has been used.

* Dynamic Bayesian Network ModelThe new DBN-model for multi-modal event classification is shown in Fig. 4. Continuous Gaussian

nodes are represented by a circle and discrete probability tables by a square. Observed nodes (OStreami )are marked gray; hidden state nodes (HStream

i ) and mixture nodes (MStreami ) white, the index i denotes

the time. Probable dependencies are denoted with arrows, where the head points to the statisticaldependent child node. In this model the time flows from top to bottom. The last two rows X2 and X3

represent one time slice and are repeated as long until the last sample point in the feature vector hasbeen processed.

2005.02.28 27/55 V 1.0

Page 28: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Figure 4: A DBN with three streams for multi-modal event classification

The model can be divided into three parts: The first two columns XBSPi represent the BSP-feature

stream ~xBSP (t). The stream is observed discrete, thus the nodes OBSPi can be implemented as conditionalprobability table (CPT). Their parent nodes HBSP

i represent the state probabilities and are implementedas CPTs, as well. For the BSP-stream the CPTs of the hidden states has only three entries. Thus, thisstream can be viewed as a discrete HMM with three hidden states.

Columns three, four, and five (XSi ) represent the MFC-speech-stream ~xS(t). Here, the stream is

observed continuous. Therefore the observed nodes ~OSi are implemented as Gaussian probability distri-butions (GPD). With the CPTs MS

i these observation nodes are extended to Gaussian mixtures. Again,the parent nodes HS

i represent the state probabilities. The number of hidden states has been chosen tofive, the number of mixtures to two. Thus, this stream is a continuous Gaussian mixtures HMM withfive hidden states.

Finally, the last three columns XVi represent another continuous Gaussian mixture HMM with three

hidden nodes and two mixtures for the visual-stream ~xV (t). However the third row is missing for thisstream: The visual stream has only half the feature frame rate than the BSP- and the MFC-stream.Hence, where the other two streams need two observation nodes in each time slice, only one observationnode is necessary for the visual stream.

The three independent streams are connected and exchange information over their hidden nodesHStreami . Therefore the DBN is a coupled HMM. However there are some major differences to the multi-

stream-HMM approach. Within the DBN, each stream can have its own observation representation andits own number of hidden states. Unlike the multi-stream-HMM the weights for the different streamsare not set prior, but trained within the model. Thus the stream weights are adapted to the problem.Furthermore the DBN is not state-synchronous, this allows different feature frame rates for each stream.

5.6.2 TUM Models for Task-2

This section gives an overview of the various techniques for the segmentation and integration of recordedmeetings into group events. For details see [24]. All segmentation algorithms are based on the staticfeatures defined in section 5.5.1.

Dynamic programming approach

Here the segmentation task is performed in two steps. At first, potential segment boundaries are searched;in the second step from all these possible boundaries those are chosen that give the highest overall score.

First the possible boundaries have to be found. Again two connected windows are shifted over thetime scale as shown in Figure 5. This time the length of the windows remains fixed at 10 seconds each.Inside these two windows the feature vector is calculated and classified. If the results differ a potentialsegment boundary is assumed. In the same step a clustering of all found boundaries is performed. As

2005.02.28 28/55 V 1.0

Page 29: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

a b c

Figure 5: Two connected windows are shifted over the time scale to produce potential boundaries.

long as the classification result K(a, b) in the left window remains equal, the new assumed boundary isappended to the existing cluster Gi. Otherwise a new cluster Gi+1 is created. After that all clustersthat contain less than three possible boundaries are discarded so that only important boundaries remain.Now we have a collection of arrays Gi, i = 1, . . . , N , where N is the number of clusters, consisting inthe potential boundaries.

Having found all boundaries that come into question, in each cluster Gi the in some sense ’best’boundary has to be chosen. This is accomplished via Dynamic Programming (DP). This approachassumes that the meeting events are mutually independent. So each boundary of a meeting event canbe found if only the direct predecessor is known. The first and the last boundary are known a priori(beginning and end of the meeting), so the task is to choose the remaining inner boundaries that givethe highest overall score. The score of a meeting event is calculated as the pseudo-probability that theclassifier returns for the examined interval. This could be for example the normalized probability of theGMM or the normalized output of the neural net. As additional constraint only those boundaries couldbe chosen that ensure a minimum length of a meeting event of 15 seconds.

G1 G2 Gi GN−1 GN

1 2 3 4 5 6 7 8 910111213

#B

Figure 6: Finding the optimal boundaries: the path with the highest overall score is found throughbacktracking. The abscissa denotes the clusters of potential boundaries, the ordinate the number of theboundary.

In Figure 6 the procedure for finding the optimal segment boundaries is illustrated. For each boundaryx ∈ Gi the score sx(y) to each boundary y ∈ Gi−1, i = 2, . . . , N is calculated. Then the maximumscore smaxfor each x is chosen.

sx,max = max sx(y); (11)

The sum of this score and the overall score until i − 1 is calculated and saved in a score-matrix SGi

2005.02.28 29/55 V 1.0

Page 30: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

together with the predecessor y.

SGi =

......

...x sx,max + SGi− 1y,2 y...

......

; (12)

This is done for all clusters Gi. Afterwards the best path through all score matrices is found throughbacktracking. Starting with the last score matrix SGN, which contains only one boundary, and fol-lowing the indices in the third column those boundaries are chosen that produce the best overall score.In a completing step two segments that contain the same meeting event are merged.

This approach has the advantage of being computationally much less expensive in comparison tothe following method, since there are much less segments to test due to the fixed length of the slidingwindows.

Integrated approach

The integrated approach combines the detection of the boundaries and classification of the segments inone step. The strategy is similar to that one used in the BIC-Algorithm [27] and is illustrated in Figure 5.Two connected windows with variable length are shifted over the time scale. Thereby the inner border isshifted from the left to the right in steps of one second and in each window the feature vector is classified.If there is a different result in the two windows, the inner border is considered a boundary of a meetingevent. If no boundary is detected in the actual window, the whole window is enlarged and the innerborder is again shifted from left to the right. This procedure can be described by the following algorithm(a is the left border, b is the inner border, c is the right border of the window, L is the minimum lengthof a meeting event, K(a, b) is the classification result of the interval [a, b]):

(1) initialize interval [a, c]:a = 1; b = a+ L; c = a+ 3L;

(2) if K(a, b) 6= K(b, c) then

save b as boundary

a = c; b = a+ L; c = a+ 3L;else

b = b+ 1;(3) if (c− b) < L

c = c+ 1; b = a+ L;goto (2)

else

goto (2)

This algorithm is run until the right border c has reached the end of the video file.To find a comparable measure for the results of the various methods, we use an error measure first

proposed in [11].

Early integration HMM based on IDIAP-features

For our collaboration project we conducted a quick experiment using a relatively straight forward earlyintegration HMM in which all features from the data files were included. We tried various numbers ofstates and discovered a certain independence between the results and the number of states. For examplean HMM with three states gave same results as an HMM with eight states. This is a quite strangebehavior that we can not yet explain right now.

2005.02.28 30/55 V 1.0

Page 31: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

The Neural-Field-like system

For the task of segmenting the meetings into group actions we propose a new approach that is based on thetheory of the neural fields, first analyzed by Amari [2]. More details can be found in [22]. Here the idea isto present the features of a whole meeting to the neural field simultaneously and get a segmentation andclassification as output. In this way elements from the end of a meeting can have influence on elementsat the beginning, which should increase the robustness of the classification task.

A detailed analysis (cf. [22]) shows that it is possible to define an equivalent recurrent neural nat,that has almost the same architecture as our proposed neural-field-like system.

For each frame i ∈ [1 . . .N ] there is one neuron in the recurrent neural net. The input of eachneuron consists of six or twelve features from the speaker-turn detection and global-motion detectionrespectively, depending on whether we use an unimodal or a multi-modal approach. The output is binarycoded. Therefore the output layer consists in 8 · N neurons since we have eight classes. For each of theN time frames the resulting group action is determined by the neuron with the highest activity.

[]... []... []... []... []... []... []... []... []... []... []... []... []... []...

............................................................

........................

...

neurons

featuresdata

...

......

output

1 2 3 N...

Figure 7: Architecture of the equivalent recurrent neural net (not all connections are shown)

5.6.3 IDIAP Models for Task-2

Details of our models were reported in [33]. In our framework, we distinguish group actions (which belongto the whole set of participants) from individual actions (belonging to specific persons). Our ultimategoal is the recognition of group activity, and so individual actions should act as the bridge between groupactions and low-level features, thus decomposing the problem in stages. The definition of both actionsets is thus clearly intertwined.

Let I-HMM denote the lower recognition layer (individual action), and G-HMM denote the upperlayer (group action). I-HMM receives as input AV features extracted from each participant, and outputsrecognition results, either as soft or hard decisions. In turn, G-HMM receives as input the output fromI-HMM, and a set of group features, directly extracted from the raw streams, which are not associated toany particular individual. In our framework, each layer is trained independently, and can be substitutedby any of the HMM variants that might capture better the characteristics of the data, more specificallyasynchrony [3], or different noise conditions [10] between the audio and visual streams. Our approach issummarized in Figure 8.

Compared with a single-layer HMM, the layered approach has the following advantages, some of whichwere previously pointed out by [20]: (1) a single-layer HMM is defined on a possibly large observationspace, which might face the problem of over-fitting with limited training data. It is important to noticethat the amount of training data becomes an issue in meetings where data labeling is not a cheap task.In contrast, the layers in our approach are defined over small-dimensional observation spaces, resultingin more stable performance in cases of limited amount of training data. (2) The I-HMMs are person-independent, and in practice can be trained with much more data from different persons, as each meetingprovides multiple individual streams of training data. Better generalization performance can then beexpected. (3) The G-HMMs are less sensitive to slight changes in the low-level features because their

2005.02.28 31/55 V 1.0

Page 32: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

1. (Audio-Visual Feature Extraction)1-1. extract individual-level AV features1-2. extract group-level AV features

2. (Individual Action Recognition)2-1. given individual features for each person, train I-HMM

selecting best model by cross-validation2-2. output individual action recognition results

3. (Group Action Recognition)3-1. construct a feature space by concatenating

individual action results and group-level features3-2. train G-HMM selecting best model by cross-validation3-3. output group action recognition results

Figure 8: Two-layer HMM framework

observations are the outputs of the individual action recognizers, which are expected to be well trained.(4) The two layers are trained independently. Thus, we can explore different HMM combination systems.In particular, we can replace the baseline I-HMMs with models that are more suitable for multi-modalasynchronous data sequences, with the goal of gaining understanding of the nature of the data. Theframework thus becomes simpler to understand, and amenable to improvements at each separate level.(5) The framework is general and extensible to recognize new group actions defined in the future.

5.6.4 EDIN Models for Task-2

The DBN formalism allows the construction and development of a variety of models, starting from asimple HMM and extending to more sophisticated models, with richer hidden state. Among the manyadvantages provided by the adoption of a DBN formalism, one benefit is the flexibility in the modelinternal state factorization. With a small effort, DBNs are able to factorize the internal hidden state,organizing it in a set of interconnected and specialised hidden variables.

Our multi-stream model (bottom of figure 9) exploits this principle in two ways: decomposing meet-ing actions into smaller logical units, and modelling the three feature streams independently. We assumethat a meeting action can be decomposed into a sequence of small units: meeting subactions. In accor-dance with this assumption the state space is decomposed into two levels of resolution: meeting actions(nodes A) and meeting subactions (nodes SF ). Note that the decomposition of meeting actions intomeeting subactions is done automatically through the training process. These synthetic subactions donot necessarily have a clear human interpretation.

Feature sets derived from different modalities are usually governed by different laws, have differentcharacteristic time-scales and highlight different aspects of the communicative process. Starting fromthis hypothesis we further subdivided the model state space according to the nature of features that areprocessed, modelling each feature stream independently (multistream approach). The resulting model hasan independent substate node SF for each feature class F (prosodic features, speaker activities, lexicalfeatures, etc.), and integrates the information carried by each feature stream at a ‘higher level’ of themodel structure (arcs between A and SF ,F = [1, n]).

Each substate node SF ,F = [1, n] follows an independent Markov chain, but the substate transitionmatrix and an initial state distribution are functions of the action variable state At = k. The discretesubstates SF generate the continuous observation vectors Y F through mixtures of Gaussians; and thesequence of action nodes A form a Markov chain with subaction nodes SF , F = 1, .., n as parents. Likeany ordinary Markov chain A has an associated transition matrix and an initial state probability vector.A has a cardinality of 8, since there is a dictionary of 8 meeting actions. The cardinalities of the sub-action

2005.02.28 32/55 V 1.0

Page 33: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Figure 9: Multistream DBN model (a) enhanced with a “counter structure” (b); square nodes representdiscrete hidden variables and circles must be intend as continuous observations

nodes S are part of parameter set, and for our experiments we have a typical cardinality of 6.The probability to remain in an HMM state corresponds to an inverse exponential [21]: a similar

behavior is displayed by the proposed model. This distribution is not well-matched to the behaviour ofmeeting action durations. Rather than adopting ad hoc solutions, such as action transition penalties,we preferred to improve the flexibility of state duration modelling, by enhancing the existing model witha counter structure (top of figure 9). The counter variable C, which is ideally incremented during eachaction transition, attempts to model the expected number of recognized actions. Action variables A nowalso generate the hidden sequence of counter nodes C, together with the sequence of sub-action nodes S.Binary enabler variables E have an interface role between action variables A and counter nodes C.

This model presents several advantages over a simpler HMM in which features are “early integrated”into a single feature vector:

• feature classes are processed independently according to their nature

• more freedom is allowed in the state space partitioning and in the optimization of the sub-statespace assigned to each feature class

• higher flexibility, for example when the feature set need to be modified

• knowledge from different streams is integrated together at an higher level of the model structure

• state duration modelling is improved

Unfortunately these advantages, and the increased accuracy that can be achieved, are balanced by anincreased model size, and therefore by an increased computational complexity.

2005.02.28 33/55 V 1.0

Page 34: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Table 6: Recognition rates of the classifiers (BN: Bayesian Network, GMM: Gaussian Mixture Models,MLP: Multilayer Perceptron Network, RBN: Radial Basis Network, SVM: Support Vector Machines)

Classifier Recognition Rate (%)BN 95.90GMM 88.04MLP 96.72RBN 97.54SVM 97.54FUSED 96.72

5.7 Performance Measures

5.7.1 Measures for Task-1

We use the Recognition Rate to evaluate results for Task-1. Recognition rate is defined as the proportionof the correctly recognized actions over the total number of actions,

Recognition Rate =number of correct actions

number of total actions× 100% (13)

5.7.2 Measures for Task-2

We use the action error rate (AER) to evaluate results for Task-2. AER is defined as the sum of insertion(Ins), deletion (Del), and substitution (Subs) errors, divided by the total number of actions in the ground-truth:

AER =Subs + Del + Ins

Total Actions× 100% (14)

Measures based on deletion (Del) and insertion (Ins) and substitution (Subs) are also used to evaluateaction recognition results. A confusion matrix is also used for performance evaluation.

5.8 Individual Results and Discussions

5.8.1 TUM Results for Task-1

This section presents TUM results for Task-1 using a lexicon of eight group actions defined in Table 2.

Simple classifier

In table 6 the recognition rates of each of the simple classifier is shown (cf. 5.6.1). Two classifiers (RBNand SVM) yield a quite good result with 97.54% whereas the GMMs seem not to be able to adapt wellenough and achieve a recognition rate of 88.04%. One cause of this difference may be the relatively lownumber of training material available.

Late semantic fusion

This approach yields a recognition rate that is as high as using the MLP (see table 6). A better resultshould be achieved if more training data was available. Until now we have only used the initially available53 meetings. With more material the single recognizers could be trained on a distinct set of training dataas it is recommended for fusion techniques.

2005.02.28 34/55 V 1.0

Page 35: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Table 7: Recognition results for the different classifiers

HMMs without Fusion EarlyFusion

Late Fusion

BSP Speech Visual HMM HMM DBN

73.77% 81.15% 67.24% 84.48% 83.62% 84.35%

Late semantic fusion of three different types of recognition techniques

A comparison of the three different classificaton techniques introduced in section 5.6.1, gives an interest-ing result. If only the video was available without audio signal and the single actions of the participantscould be extracted perfectly, then the recognition rate is already farely high with 82.79%. If only theaudio files were used, the recognition rate decreases significantly by about 14 percent. This is probablydue to the loss of the local information that was needed to distinquish e.g. monologues from presentationevents. The even worser result by using the transcriptions could be explained by the lack of the informa-tion, who is saying what, because there only the words themselves are considered, but not who utteredthem. So a combination of all of this three methods should give better results because the classifiers usecomplementary information.

Each instance of the three specified meeting recognition techniques produces an output, in which themost likely meeting event is reported. In addition a score is delivered that declares how reliable thisresult can be. To reach better results than each of the classifiers alone a simple fusion technique is used:If two or more of the instances deliver the same class then the fused result is that class. Only if all threeclassifier say different things, then the one with the highest score is considered best. With this fusiontechnique a gain of about three percent in the recognition rate was obtained.

Classifier Annotations MFCCs Transcripts FusedRecognition Rate 82.79 % 68.03 % 44.44 % 86.07 %

Dynamic approaches

All models have been tested with an unknown set of 126 meeting events, extracted from 30 meetingvideos. The overall recognition performance of all models is shown in Table 7.

The best single-modality model was the speech stream HMM. It reached a recognition rate of 81.15%.All multi-modal approaches increased the recognition performance significantly; the best by more than3% compared to the speech stream HMM. Between the different multi-modal models, the early fusionwith a single stream HMM reached the best performance of 84.48%, the multi-stream DBN has nearlythe same recognition rate of 84.35%. The multi-stream HMM approach reached with 83.62% the worstresult between the multi-modal models.

5.8.2 TUM Results for Task-2

This section presents TUM results for Task-2 using a lexicon of eight group actions defined in Table 2.

Dynamic programming and integrated approach

The results of the segmentation are shown in Table 8 and Table 9 respectively (BN: Bayesian Network,GMM: Gaussian Mixture Models, MLP: Multilayer Perceptron Network, RBF: Radial Basis Network,SVM: Support Vector Machines). Each row denotes the classifier that was used. The columns showthe insertion rate (number of insertions in respect to all meeting events), the deletion rate (number of

2005.02.28 35/55 V 1.0

Page 36: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Table 8: Segmentation results using the integrated approach (BN: Bayesian Network, GMM: GaussianMixture Models, MLP: Multilayer Perceptron Network, RBF: Radial Basis Network, SVM: SupportVector Machines). The columns denote the insertion rate, the deletion rate, the accuracy in seconds andthe classification error rate.

Classifier Insertion (%) Deletion (%) Mean absolute error AER (%)

BN 14.7 6.22 7.93 39.0GMM 24.7 2.33 10.8 41.4MLP 8.61 1.67 6.33 32.4RBF 6.89 3.00 5.66 31.6SVM 17.7 0.83 9.08 35.7

Table 9: Segmentation results using Dynamic Programming.

Classifier Insertion (%) Deletion (%) Mean absolute error AER (%)

BN 16.5 4.67 6.66 36.6GMM 29.7 2.50 33.2 49.1MLP 18.7 3.17 16.0 38.9RBF 17.3 0.83 16.0 39.6

deletions in respect to all meeting events), the mean absolute error of the found segment boundaries inseconds and the recognition error rate. In all columns lower numbers denote better results.

As can be seen from the tables, the results are quite variable and heavily depend on the used classifier.With the integrated approach (cf. Table 8) the best outcome is achieved by the radial basis network.Here the insertion rate is the lowest. The detected segment boundaries match pretty well with a deviationof only about five seconds to the original defined boundaries.

The results of the segmentation with dynamic programming were in general slightly worse. Due tothe impossibility to get a score from the SVMs, these were not used here. Remarkable is the differenceof ten seconds in the accuracy of the found boundaries between the Bayesian Network and the NeuralNetworks. The Bayesian Networks miss the given boundaries by 6.6 seconds on average. The neuralnetwork approaches make a greater mistake and produce a deviation of approx. 16 seconds.

Early integration HMM based on IDIAP-features

With the definition of the Action-Error-Rate in Eq. 14 we got an AER on the scripted meeting data of28.994%. The exact outcome can be seen in the confusion-matrix in table 10.

The Neural-Field-like system

Results using dynamic features

All features (cf. section 5.5.1) of an entire meeting are presented to the recurrent neural net in parallel.With a frame rate of five Hertz and a length of a meeting of approx. five minutes, this results in a totalnumber of 5min · 60 sec

min · 5 1sec = 1500 features of at least six dimensions. Unfortunately such an amount

of data is not feasible. Therefore some features have to be combined to one item. Another reason for thisprocedure is that we can guarantee that a recognized group action has at least the length of the mergedfeatures. This we refer to as minimum length of a group action.

2005.02.28 36/55 V 1.0

Page 37: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Table 10: Confusion matrix of early-integration HMM with eight states

mon

olo

gu

e1

mon

olo

gu

e2

mon

olo

gu

e3

mon

olo

gu

e4

dis

cuss

ion

note

takin

g

pre

senta

tion

wh

iteb

oard

Del

etio

ns

monologue1 14 0 0 0 0 0 0 0 0monologue2 0 14 0 0 0 0 0 0 0monologue3 0 0 19 0 0 0 0 0 0monologue4 0 0 0 11 0 0 0 0 2discussion 3 7 0 0 39 0 0 0 14note taking 1 1 0 1 0 0 0 1 2presentation 0 0 0 0 0 0 13 3 2whiteboard 0 0 0 0 0 0 0 22 0Insertions 2 3 4 2 0 0 0 1

We conducted several experiments with varying numbers of inputs and various time scales.First experiments with only one modality (only speaker-turns) were accomplished. Table 11 shows

the results of several passes with different minimum lengths. Here the best result is achieved, whenthe group actions have a minimum length of twenty seconds. Then the frame error rate is only 0.333.Unfortunately there is no dependency between the number of seconds to be merged and the frame errorrate. So it cannot be predicted which configuration will perform best.

Doing the same experiments using only global motion features gives a similar but slightly worse result(cf. table 11). The best frame error rate was achieved with 0.463 at a minimum length of 20 seconds andno coupling to other neurons.

In table 12 the minimum length of a group action is twenty seconds (features of 20 seconds aremerged). The FER of different coupling widths is shown. As can be seen, there is also no dependencybetween the coupling length and the frame error rate.

One would expect that the result would increase, if more information (i.e. speaker-turns and global-motion features) are combined. If the columns of table 12 are compared, there is never an improvementin the frame error rate, when both modalities are used. This could have various accounts. One reasoncould be that the global motion features are not suitable for the task of group action recognition. Anotherreason may be that our architecture can not profit from the additional information but is likely to beconfused by it.

Nevertheless quite promising results could be achieved. The overall best frame error rate is obtained,when only speaker-turns are used, features of twenty seconds are merged and the recurrent neural net hasa coupling length of three neurons. Then the frame error rate is roughly 0.333. This result is comparableto the one that we achieved using a completely different approach in [24]. There the best results wereframe error rates between 0.3180 and 0.3495, using only speaker turns, depending on which classifier wasused. So this neural-field-like approach seems to be able to compete with conventional methods.

Results using features from IDIAP

In a comparative experiment we used the features kindly provided from IDIAP. With the approachpresented above we yielded a frame error rate of 0.45. Using predefined boundaries of the group actionsfor the recognition task we reached a action error rate of 73.7 % using only video features. An overview

2005.02.28 37/55 V 1.0

Page 38: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Table 11: Results of different time granularity with a coupling length of three neurons using only speaker-turn detection (ST), only global-motion features (GM) and both modalities (ST&GM). The first columnshows the minimum length of a group action in seconds, the following column denotes the frame errorrate.

#sec. FER (ST) (%) FER (GM) (%) FER (ST&GM) (%)

2 42.5 58.9 53.14 42.4 62.2 46.76 40.2 62.6 44.38 42.1 54.3 49.7

10 37.0 52.4 42.712 40.2 54.2 44.714 42.0 52.4 41.716 39.5 49.5 46.818 38.9 55.5 45.720 33.3 51.1 43.8

of the results is presented in table 13.

5.8.3 IDIAP Results for Task-2

This section presents IDIAP results for Task-2 using a lexicon of fourteen group actions defined in Table3.

Table 14 showed results using the layered HMM framework, compared with the single-layer HMM.We investigated the following cases:

• Early integration, visual-only, soft decision. A normal HMM is trained using the combination ofthe results of the I-HMM trained on visual-only features, and the visual group features. The softdecision criteria is used.

• Early integration, audio-only, soft decision. Same as above, but replacing visual-only by audio-onlyinformation.

• Early integration, AV, hard decision. Same as above, but replacing visual-only by audio-visualinformation. The hard decision criteria is used.

• Early integration, AV, soft decision. Same as above, but changing the criteria to link two HMMlayers.

• Multi-stream, AV, hard decision, using the multi-stream HMM approach as I-HMM. The harddecision criteria is used.

• Multi-stream, AV, soft decision. Same as above, but changing the criteria to link two HMM layers.

• Asynchronous HMM, AV, hard decision. We use the asynchronous HMM for individual action layerand audio-visual features. The hard decision criteria is used.

• Asynchronous HMM, AV, soft decision. Same as above, but changing the criteria to link two HMMlayers.

2005.02.28 38/55 V 1.0

Page 39: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Table 12: Results of different experiments with various coupling-length (i.e. numbers of neurons thatcan influence each other), using only speaker-turn detection (ST), only global-motion features (GM) andboth modalities (ST&GM). The minimum length of a group action is 20 seconds. The first column showsthe coupling length, the following columns denote the frame error rates (in percentage (%)).

#N FER (ST) FER (GM) FER (ST&GM)

1 39.3 46.3 45.42 34.2 55.4 45.03 33.3 51.1 43.84 40.4 53.2 41.75 36.0 51.2 39.46 40.6 48.2 44.57 42.5 54.4 45.88 36.8 48.0 42.59 39.6 50.0 40.6

10 39.3 50.1 46.211 41.4 58.5 43.712 41.6 51.1 44.713 40.1 66.9 44.314 36.0 50.0 41.915 35.7 49.7 43.617 38.1 46.4 49.019 38.1 46.4 49.020 38.1 46.4 49.0

We observe from Table 14 that the use of AV features outperformed the use of single modalities forboth single-layer HMM and two-layer HMM methods. This result supports the hypothesis that the groupactions we defined are inherently multimodal. Furthermore, the best two-layer HMM method (A-HMM)using AV features improved the performance by over 8% compared to the AV single-layer HMM. Giventhe small number of group actions in the corpus, a standard proportion test indicates that the differencein performance between AV single-layer and the best two-layer HMM is significant at the 96% confidencelevel. Additionally, the standard deviation for the two-layer approach is half the baseline’s, which suggeststhat our approach might be more robust to variations in initialization, given the fact that each HMMstage in our approach is trained using an observation space of relatively low dimension. Regarding hardvs. soft decision, soft decision produced a slightly better result, although not statistically significant giventhe number of group actions. However, the standard deviation using soft-decision is again around halfthe corresponding to hard-decision. Overall, the soft decision two-layer HMM appears to be favored bythe results.

5.8.4 EDIN Results for Task-2

All our experiments were conducted on 53 meetings using a lexicon of eight group actions defined in Table2. We implemented the proposed models using the Graphical Models Toolkit (GMTK) [5]. The evaluationis performed using a leave-one-out procedure, in which the system was trained using 52 meetings andtested on the remaining one, iterating this procedure 53 times.

Table 15 shows some experimental results achieved using: an ergodic 11-states HMM, a multi-streamapproach (section 5.6.4) with two feature streams, and the full counter enhanced multi-stream model.The base 2-stream approach has been tested in two different sub-action configurations: imposing

∣∣S1∣∣ =

2005.02.28 39/55 V 1.0

Page 40: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Table 13: Action-Recognition-Rates (in percentage (%)) of the neural-field-like system using featuresfrom IDIAP with a coupling length of twelve neurons and given action boundaries.

#sec. ARR(Audio) ARR(Video) ARR(AV)1 42.4 42.8 40.52 44.9 42.9 40.03 44.9 30.3 44.44 55.0 28.4 42.45 33.1 30.7 42.56 47.3 45.2 44.47 47.3 40.5 42.58 59.4 45.2 42.99 18.7 28.3 42.410 42.9 40.5 42.411 20.6 27.8 57.012 37.5 44.8 57.513 43.0 40.5 35.214 44.8 42.3 40.815 64.8 64.9 53.816 34.5 48.7 51.717 37.8 64.8 23.518 62.7 73.7 21.319 35.1 37.2 22.920 63.1 46.1 64.8

∣∣S2∣∣ = 6 and fixing these cardinalities to

∣∣S1∣∣ =

∣∣S2∣∣ = 7. Therefore four experimental setups were

investigated; and each setup has been tested with 3 different feature sets, leading to 12 independentexperiments. The first feature configuration (“UEDIN”) assigns prosodic features outlined in section5.5.3 to the stream S1 and speaker activity features (section 5.5.3) to the second sub-action variable S2.The feature configuration labeled as “IDIAP” makes use of the multimodal features extracted at IDIAP,representing audio related features (prosodic data and speaker localisation) through the observable nodeY 1 and video related measures through Y 2. The last setup (“TUM”) relies on two feature familiesextracted at the Technische Universitat Munchen: binary speech profiles derived from IDIAP speakerlocations and video related global motion features; each of those has been assigned to an independentsub-action node. Note that in the HMM based experiment only one observable feature stream Y isavailable, thus Y has been obtained by merging together both the feature vectors Y 1 and Y 2.

Looking only at results (of table 15) obtained within the UEDIN feature setup, it is clear that thesimple HMM shows much higher error than any other multi-stream configuration. The adoption of amultistream based approach reduces the AER to less than 20%, providing the lowest AER (11%) whensub-action cardinalities are fixed to 7. The % correct also rises from around 64% with the HMM approachto values highly above 80%. Unfortunately with this feature configuration the counter structure seems tobe ineffective: AER is increased from 17.1% to 18.9%. More experimental results using lexical featurestoo (section 5.5.3) are shown in table 16. Since only 30 meetings are transcribed at a word level, andlexical features rely on those transcriptions, these experiments have been conducted with a leave-one-outcross-evaluation over a corpus subset containing only 30 meetings. Therefore the following comparisonsbetween table 15 and table 16 are unfair and must be regarded as approximative. The lexical extendedfeature set seems to improve results both with the HMM approach and with the multi-stream framework.

2005.02.28 40/55 V 1.0

Page 41: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Table 14: Results of group action recognition

Method AER (%)

Visual-only 48.20Audio-only 36.70

Single-layer HMM Early Int. 23.74MS-HMM 23.13A-HMM 22.20

Visual-only 42.45Audio-only 32.37

hard decision 17.98Early Int. soft decision 16.55

Two-layer HMMhard decision 17.27

MS-HMM soft decision 15.83hard decision 17.85

A-HMM soft decision 15.11

If using a basic multi-stream approach alone this improvement is imperceptible (from 11% to 10.9% inthe AER), but the adoption of a counter structure provides a deeper improvement, leading to our bestAER (only 9%).

5.9 Overall Discussions

The results of table 15 show UEDIN features having a higher accuracy compared with IDIAP and TUMsetups, but it is essential to remember that our DBN models have been optimised for the UEDIN features.In particular sub-action cardinalities have been intensively studied with our features, but it will beinteresting to discover optimal values for IDIAP and TUM features too. Moreover overall performancesachieved with the multistream approach are very similar (AER are always in the range from 26.7%to 11.0%), and all my be considered promising. The TUM setup seems to be the configuration forwhich switching from a HMM to a multistream DBN approach provides the greatest improvement inperformance: the error rate decreases from 92.9% to 21.4%. If with the features outlined in section 5.5.3the adoption of a counter structure is not particularly effective, with IDIAP features the counter providesa significant AER reduction (from 26.7% to 24.9%).

Independently of the feature configuration, the best overall results are achieved with the multistreamapproach and a state space of 7 by 7 substates. We are confident that further improvements with IDIAPfeatures could be obtained by using more than 2 streams (such as the 3 multistream model introducedin section 5.8.4).

References

[1] M. A. Al-Hames and G. Rigoll. An investigation of different modeling techniques for multi-modalevent classification in meeting scenarios. MLMI 2004, JOINT AMI/PASCAL/IM2/M4 Workshopon Multimodal Interaction and Related Machine Learning Algorithms, 2004. Poster presentation.

[2] S.-I. Amari. Dynamics of pattern formation in lateral-inhibition type neural fields. Biological Cy-bernetics, 27:77–87, 1977.

[3] S. Bengio. An asynchronous hidden markov model for audio-visual speech recognition. In S. Becker,S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, NIPS15. MIT Press, 2003.

2005.02.28 41/55 V 1.0

Page 42: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Model Feature Set Corr. Sub. Del. Ins. AERUEDIN 63.3 13.2 23.5 11.7 48.4

HMM IDIAP 62.6 19.9 17.4 24.2 61.6TUM 60.9 25.6 13.5 53.7 92.9UEDIN 86.1 5.7 8.2 3.2 17.1

2 streams(∣∣SF

∣∣ = 6)

IDIAP 77.9 8.9 13.2 4.6 26.7TUM 85.4 9.3 5.3 6.8 21.4UEDIN 85.8 7.5 6.8 4.6 18.9

2 streams(∣∣SF

∣∣ = 6)

IDIAP 79.4 10.0 10.7 4.3 24.9+ counter TUM 85.1 5.7 9.3 6.4 21.4

UEDIN 90.7 2.8 6.4 1.8 11.02 streams

(∣∣SF∣∣ = 7

)IDIAP 86.5 7.8 5.7 3.2 16.7TUM 82.9 7.1 10.0 4.3 21.4

Table 15: Action error rates (%) for a simple hidden Markov model, and for a multi-stream (2 streams)approach with and without the “counter structure” using two different sub-action spaces; all these modelshave been individually tested with 3 different feature configurations

Model Corr. Sub. Del. Ins. AERHMM 70.5 10.3 19.2 14.7 44.2multistream (3 streams) 91.7 4.5 3.8 2.6 10.9multistream + counter 92.9 5.1 1.9 1.9 9.0

Table 16: Performances (%) for: a simple HMM, multistream approach, and the multistream modelenhanced with a “counter structure”; using three feature streams: prosody, speaker activities and lexicalfeature

[4] J. Bilmes. A gentle tutorial of the EM algorithm and its application to parameter estimation forGaussian mixture and Hidden Markov Models. Technical Report ICSI-TR-97-021, ICSI, April 1998.

[5] J. Bilmes. Graphical models and automatic speech recognition. Mathematical Foundations of Speechand Language Processing, 2003.

[6] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EMalgorithm. Journal of the Royal Statistical Society B, 39(1):1–38, 1977.

[7] J. DiBiase, H. Silverman, and M. Brandstein. Robust localization in reverberant rooms. In M. Brand-stein and D. Ward, editors, Microphone Arrays, chapter 8, pages 157–180. Springer, 2001.

[8] A. Dielmann and S. Renals. Dynamic Bayesian networks for meeting structuring. Proc. IEEEICASSP, pages 629–632, May 2004.

[9] A. Dielmann and S. Renals. Multistream dynamic Bayesian network for meeting segmentation.Lecture Notes in Computer Science, 3361:76–86, 2005.

[10] S. Dupont and J. Luettin. Audio-visual speech modeling for continuous speech recognition. IEEETransactions on Multimedia, 2(3):141–151, September 2000.

[11] S. Eickeler and G. Rigoll. A novel error measure for the evaluation of video indexing systems. InIEEE Int. Conference on Acoustics, Speech, and Signal Processing (ICASSP), Istanbul, Turkey, June2000.

2005.02.28 42/55 V 1.0

Page 43: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

[12] J. Gowdy, A. Subramanya, C. Bartels, and J. Bilmes. DBN-based multi-stream models for audio-visual speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Pro-cessing, Montreal, Canada, May 2004.

[13] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational methods forgraphical models. In M. Jordan, editor, Learning in Graphical Models, pages 105–161. MIT Press,1998.

[14] L. I. Kuncheva, J. C. Bezdek, and R. P. Duin. Decision templates for multiple classifier fusion: Anexperimental comparison. Pattern Recognition, 34(2):299–314, 1999.

[15] G. Lathoud, I. A. McCowan, and J.-M. Odobez. Unsupervised Location-Based Segmentation ofMulti-Party Speech. In Proceedings of the 2004 ICASSP-NIST Meeting Recognition Workshop,Montreal, Canada, May 2004. IDIAP-RR 04-14.

[16] J. D. Markel. The SIFT algorithm for fundamental frequency estimation. IEEE Transactions onAudio and Electroacoustics, 20:367–377, 1972.

[17] I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang. Automaticanalysis of multimodal group actions in meetings. In IEEE Transactions on Pattern Analysis andMachine Intelligence (PAMI), volume 27, pages 305–317, 2005.

[18] N. Morgan and E. Fosler-Lussier. Combining multiple estimators of speaking rate. In Proc. ofthe 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-98),1998.

[19] N. Morgan and E. Fosler-Lussier. Combining multiple estimators of speaking rate. Proc. IEEEICASSP, pages 729–732, 1998.

[20] N. Oliver, E. Horvitz, and A. Garg. Layered representations for learning and inferring office activityfrom multiple sensory channels. In Proc. ICMI, Pittsburgh, Oct. 2002.

[21] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition.Proc. of the IEEE, 2(77):257 – 286, 1989.

[22] S. Reiter and G. Rigoll. A neural-field-like approach for modeling human group actions in meetings.submitted.

[23] S. Reiter and G. Rigoll. Segmentation and classification of meeting events using multiple classifierfusion and dynamic programming. Proc. IEEE ICPR, August 2004.

[24] S. Reiter and G. Rigoll. Multimodal meeting analysis by segmentation and classification of meetingevents based on a higher level semantic approach. In Proceedings of the 30th International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, USA, March 2005.

[25] K. Sonmez, E. Shriberg, L. Heck, and M. Weintraub. Modelling dynamic prosodic variation forspeaker verification. Proc. ICSLP, 7(920):3189–3192, 1998.

[26] T. Starner and A. Pentland. Visual recognition of american sign language using HMMs. In Proc.Int. Work. on AFGR, Zurich, 1995.

[27] A. Tritschler and R. A. Gopinath. Improved speaker segmentation and segments clustering usingthe bayesian information criterion. In Proceedings of EUROSPEECH, pages 679–682, 1999.

[28] A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm.In IEEE Transactions on Information Theory, 1977.

2005.02.28 43/55 V 1.0

Page 44: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

[29] F. Wallhoff, M. Zobl, and G. Rigoll. Action segmentation and recognition in meeting room scenarios.In IEEE Proceedings on International Conference on Image Processing (ICIP), October 2004.

[30] F. Walls, H. Jin, S. Sista, and R. Schwartz. Topic detection in broadcast news. In Proceedings ofthe DARPA Broadcast News Workshop, pages 193–198, 1999.

[31] J. Yang, W. Lu, and A. Waibel. Skin-color modeling and adaptation. Proc. of Asian Conference onComputer Vision, 2:687–694, 1998.

[32] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, V. Valtchev, and P. Woodland.The HTK Book. Cambridge University Engineering Department, 2002.

[33] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, and G. Lathoud. Modeling individual andgroup actions in meetings: a two-layer hmm framework. In IEEE Workshop on Event Mining at theConference on Computer Vision and Pattern Recognition (CVPR), 2004.

[34] M. Zobl, F. Wallhoff, and G. Rigoll. Action recognition in meeting scenarios using global motionfeatures. In J. Ferryman, editor, Proceedings of the Fourth IEEE International Workshop on Per-formance Evaluation of Tracking and Surveillance (PETS-ICVS), 2003.

6 Interactive video retrieval based on multimodal dissimilarity

representation

Determining semantic concepts by allowing users to iteratively refines their queries is a key issue inmultimedia content-based retrieval. The relevance feedback loop allows to build complex queries madeout of positive and negative documents as examples. From this training set, a learning process shouldthen extract relevant documents from feature spaces. Many relevance feedback techniques have beendeveloped that operate directly in the feature space [1, 2].

Describing content of videos requires to deal in parallel with many high-dimensional feature spacesexpressing the multimodal characteristics of the audiovisual stream. This mass of data makes retrievaloperations computationally expensive when dealing directly with features. The simplest task of computingthe distance between a query and all other elements becomes infeasible when involving tens of thousandof documents and thousand of feature space components. A more convenient way is to compute offlinemonomodal dissimilarity relationships between elements and to use the dissimilarity matrices as an indexfor retrieval operations.

In this paper, we show how dissimilarities can be used to build a low-dimensional multimodal rep-resentation space where learning machines based on eg non-linear discriminant analysis could operate.Our thorough evaluation on a large video corpus shows that this multimodal dissimilarity space allowsto perform effective retrieval of video documents in real time.

6.1 Classification in dissimilarity space

In the proposed retrieval system, video segments are represented by their dissimilarity relationships com-puted over several audiovisual features. The user can formulate complex queries by iteratively providingpositive and negatives examples in a relevance feedback loop. From these training data, the aim is toperform a real-time dissimilarity-based classification that will return relevant documents to user.

6.1.1 Dissimilarity space

Let d(xi,xj) be the distance between elements i and j according to their descriptors x ∈ F . F expressesthe (unavailable) original feature space. The dissimilarity space is defined as the mapping d(z,Ω) : F →

2005.02.28 44/55 V 1.0

Page 45: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

IRN gven by (see [3] for details):

d(z,Ω) = [d(z,x1), d(z,x2), . . . d(z,xN )]. (15)

The representation set Ω = x1, . . . ,xN is a subset of N objects defining the new space: the new“features” of an input element are now the dissimilarities between the representation objects. As aconsequence, learning or classification tools for feature representations are also available to deal with thedissimilarities.

The dimensionality of the dissimilarity space is directly linked to the size of Ω which controls theapproximation made on the original feature space (such an approximation could be computed usingprojection algorithms like classical scaling [4]). Increasing the number of elements in Ω increases therepresentation accuracy. On the other hand, we are interested in minimizing the space dimensionality soas to limit computation and to speed up the response time of the system. The selection of Ω will howeverbe driven by considerations on the classification problem as explained now.

6.1.2 Non-linear discriminant analysis

Let us define the set T of positive and negative training examples (respectively denoted P and N withT = P ∪ N ). Their coordinates in the dissimilarity space are respectively d+

i = d(zi∈P ,Ω) and d−i =d(zi∈N ,Ω).

Given a query T , the aim is to find a relevance measure D(di) : IRN → IR that maximizes thefollowing Fisher criterion

maxD

∑iD

2(d−i )∑iD

2(d+i ). (16)

The measure D(d) gives us a new ranking function where positive elements tend to be placed at the topof the list while negatives one are pushed to the end.

Depending on the separability of the data according to a query T , the ranking function D(d) may bechosen as a linear or non-linear function of the dissimilarities. Following the kernel machine formulation,D(d) is written in both cases (linear or not) as an expansion of kernels centered on training patterns [5]:

D(d) =∑

i∈Tαik(d,d±i ) + b. (17)

Using such non-linear model in criterion (16) leads to the formulation of the Kernel Fisher Discriminant(KFD) [6]. It has been shown that this problem can be solved by using mathematical programs (quadraticor linear). The proofs and the implementation of the algorithm we use to optimize (16) can be found in[7].

In general, we are dealing with a 1+x class setup with 1 class associated to positives and x to negatives[2]. It is then needed to estimate complex decision functions to learn the semantic concepts, increasingthe risk to encounter difficulties for choosing and tuning well-adapted kernels. However, selecting therepresentation set as the set of positives examples P turns the problem into a binary classification.Assuming that the positive examples are close to each other while being far from negatives, the vectorsd(zi∈P ,P) (within scatter) have norms lower than vectors d(zi∈N ,P) (between scatter), leading to abinarization of the classification, as illustrated in figure 10. In addition, this choice readily induces towork in a low dimensional space of p = ||P|| components, where online learning processes are dramaticallyspeed-up.

Kernel selection and setting is a critical issue to successfully learn queries. It actually decides uponthe classical trade-off between over-fitting and generalization properties of the classifier and hence is verydependent of the considered dissimilarity space. This problem is discussed in the next section.

2005.02.28 45/55 V 1.0

Page 46: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

3

3.5

4

Figure 10: The 1 + x class problem in feature space (left) and dissimilarity space (right) where therepresentation objects are two points from the central class (cross)

6.2 Multimodal space

The video content is characterized by features corresponding to multiple modalities (eg, visual, audio,speech). Each of them leads to a dissimilarity matrix containing pairwise distances between all documents.Let us note dfi the distance measure applied on the feature space Fi and assume that dissimilarity matricesare known for M feature spaces. We define the multimodal dissimilarity space ~d as the concatenation ofall monomodal spaces ~dfi

~d = [~df1 , ~df2 , . . . , ~dfM ]. (18)

The kernel function used in equation (17) now operates in a multimodal space. Its choice is thena critical issue to ensure the success of the modalities fusion coming from the resolution of equation

(16). The RBF kernel k(x,y) = e−(x−y)TA(x−y) presents a convenient solution for our problem: itis indeed able to learn semantic concepts that are locally distributed within the representation space,and the scaling symmetric positive definite matrix A permits to tune the trade-off between over-fittingand generalization. As the input space is multimodal, the scaling matrix is constructed so as to allowindependent scaling for each feature space, so that A = diag[σf1 , · · · , σfM ]. The vector σfi ∈ IRp is

constant with all values equal to the scale parameter σfi estimated for the dissimilarity space dfi . Theestimation of σfi is based on a heuristic adapting the model to the query

σfi = C ·mediani(minj||d+

i − d−j ||2) (19)

The scale value is tuned to the median of all the minimum distances between the negative and the positiveexamples in feature space Fi. In that way, the kernel becomes tighter as the two classes become closerto each other. The parameter C has been empirically set to 2.0.

6.3 Experimentations

Our multimodal interactive learning algorithm has been experimented in the context of ViCoDE, thevideo retrieval system we have developed. The segmented video documents, their multimodal descriptionas well as manual annotations are stored in a database that keep synchronized all data and allows large-scale evaluations of retrieval results [8].

The experimentation consists in making queries corresponding to annotated concepts and measuringthe average precision (ratio of relevant documents in the retrieved list averaged over 50 queries) forretrieved lists of various length. The annotated positives examples are removed from the hitlist so thatthey are not taken into account when measuring performances.

6.3.1 The video database

We use a complete annotated video corpus composed of 133 hours of video broadcast. Videos are seg-mented into shots and every shot has been annotated by several concepts. The speech transcripts ex-tracted by Automatic Speech Recognition (ASR) at LIMSI laboratory [9] are also available.

2005.02.28 46/55 V 1.0

Page 47: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

We extracted the three following features from the 37’500 shots composing the corpus: Color his-togram, Motion vector histogram and Word occurrence histogram (after stemming and stopping). Thedistance measures used are euclidian for Color and Motion histogram and intersection for Word occurrencehistogram.

6.3.2 Results

We first evaluate how the combination of modalities can improve the retrieval efficiency. Figure 11compares the average precision for several concepts when the query is learn in the monomodal spacesand in the multimodal space. We can observe that, even for queries (Car and Desert) where the rawfeatures used are not well-suited, the combination of the three modalities performs better than consideringthem separately. The precision graphs also compare the algorithm with a random retrieval (e.g seekinghits at random within the database). This comparison illustrates the capability of the algorithm to uselow-level multimodal information to create models of semantic concepts defined by user. This improvesdramatically the performance of the search.

The second experiment proposed an answer to the question of how the precision evolves when thenumber of positive and negative documents grows. As figure 12 shows, the precision within the 100 firstdocuments retrieved increases with the size of the training set.

Finally, we were interested in the computation time problem. Average response time for 20 negativeand 5, 10, and 40 positive examples are respectively 1.4s, 2s and 7.4s while for 10 positive and 100 negativeexamples the time is 4.3s. As the dimensionality of the representation space is equal to the number ofpositive examples, the response time increases according to their number. On the other hand, negativeexamples have less influence since they are just involved in the learning process.

6.4 Summary

We have presented a retrieval framework for multimedia documents. Based on a multimodal dissimilarityspace associated to a non-linear discriminant analysis, the algorithm is able to take benefit from low-level multimodal descriptions of video documents and, as a consequence, to learn semantic queries froma limited number of input examples. The design of the dissimilarity space as been achieved so as tosimplify the classification problem while building a low-dimensional representation of the data. As aresult, queries on large databases are processed near real-time which authorizes the use of feedback loopas a search paradigm. Extensive evaluations on a large corpus show the efficiency and the usability ofthe proposed techniques to retrieve documents within a large corpus of videos.

References

[1] E. Y. Chang, B. Li, G. Wu, and K. Go, “Statistical learning for effective visual information retrieval,”in Proceedings of the IEEE International Conference on Image Processing, 2003.

[2] X.S. Zhou and T.S. Huang, “Small sample learning during mutlimedia retrieval using biasmap,” inProceedings of the IEEE Conference on Pattern Recognition and Computer Vision, CVPR’01, Hawaii,vol. I, pp. 11–17.

[3] E. Pekalska, P. Paclık, and R.P.W. Duin, “A generalized kernel approach to dissimilarity-basedclassification,” Journal of Machine Learning Research, vol. 2, pp. 175–211, December 2001.

[4] T.F. Cox and M.A.A. Cox, Multidimensional scaling, Chapman & Hall, London, 1995.

[5] B. Scholkopf and A. J. Smola, Learning with Kernels, MIT Press, 2002.

2005.02.28 47/55 V 1.0

Page 48: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Query ’Basketball’

Prec

ision

ColorMotionASR3 modalitiesRandom guess

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Query ’Weather News’

Prec

ision

ColorMotionASR3 modalitiesRandom guess

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Query ’Desert’

Prec

ision

ColorMotionASR3 modalitiesRandom guess

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Query ’Car’

Prec

ision

ColorMotionASR3 modalitiesRandom guess

Figure 11: Average precision vs. length of retrieved lists for monomodal and multimodal dissimilarityspaces. The query is composed of 5 positive examples (annotated by the concept) and 20 negativeexamples randomly selected in the database. The “random guess” line is equal to the proportion of theconcept in the database.

0 5 10 15 20 25 30 3518

20

22

24

26

28

30

32

34

Prec

ision @

100 (

%)

# of positives

Query ’Anchor person’

0 20 40 60 80 100 12020

21

22

23

24

25

26

27

28

29

30Query ’Anchor person’

Prec

ision @

100 (

%)

# of negatives

Figure 12: Average precision at 100 when positive examples and negative examples increase.

2005.02.28 48/55 V 1.0

Page 49: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

[6] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Muller, “Fisher discriminant analysis withkernels,” in Neural Networks for Signal Processing IX, Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas,Eds. 1999, pp. 41–48, IEEE.

[7] S. Mika, G. Ratsch, and K.-R. Muller, “A mathematical programming approach to the kernel fisheralgorithm,” in NIPS, 2000, pp. 591–597.

[8] Nicolas Moenne-Loccoz, Bruno Janvier, Stephane Marchand-Maillet, and Eric Bruno, “Managingvideo collections at large,” in Proceedings of the First Workshop on Computer Vision Meets DatabasesCVDB’04, Paris, France, 2004.

[9] J.L. Gauvain, L. Lamel, and G. Adda, “The limsi broadcast news transcription system,” SpeechCommunication, vol. 37, no. 1-2, pp. 89–108, 2002.

7 Data editing and annotation

7.1 Meeting data annotation

The intended purpose of doing annotations on the IDIAP scripted meetings data was of a two-foldnature. The first aim was to create some data as ground truth on which several recognisers, suchas speech recognisers, hand- and body-gesture recognisers, face finder, person finder and person- andobject-tracker could be trained. In addition these hand labelled data should also represent the results ofthe aforementioned recognisers. These faked results then could be used as a basis to perform work ona higher semantic level in multimodal integration techniques, e.g. the recognition of group events, untilthe results of the real recognisers are available.

The annotation was made with the tool Anvil c© from Michael Kipp2 (cf. fig. 13). The annotationscheme comprises all single actions of the participants as well as all occurring group actions. The usedgroup actions are:

• Discussion

• Monologue1

• Monologue2

• Monologue3

• Monologue4

• Note-taking

• Presentation

• White-board

The annotation tool produces an output in the XML-format (cf. fig 14) that can be read by any otherprogram that is able to cope with the XML-format. The annotations can be obtained from the IDIAPmedia file sever3 in the protected area.

2see www.dfki.de/∼kipp/anvil3mmm.idiap.ch

2005.02.28 49/55 V 1.0

Page 50: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Figure 13: The annotation tool Anvil

2005.02.28 50/55 V 1.0

Page 51: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

M4 (IST-34485) Multimodal information access and integration Deliverable D3.3

Figure 14: Cut-out of the annotation in XML-format

2005.02.28 51/55 V 1.0

Page 52: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

7.2 Continuous Video Labeling Tool

The Continuous Video labeling (CVL) tool supports labeling of time-aligned annotation layers directly related to the signal files. Any annotation layer that consists of labeling of non-overlapping segments of the time line can be coded using this tool. Examples include gaze directions, postures, emotions and meeting actions. The CVL tool is developed using NXT (NITE XML Toolkit). NXT represents data for each observation (e.g. one meeting) and for each agent (e.g. one meeting participant) in separate interrelated XML files. The structure and location of the XML files are represented in a metadata file. The CVL tool supports making new annotations for more than one agent and more than one annotation layer simultaneously. It also supports viewing the existing annotations for the various annotation layers.

Figure 1: Making annotations in the CVL tool Making annotations Before the actual annotation is started, the user must select at least one agent and at least one annotation layer from the Annotate menu. The main window displays annotation windows for all combinations of the selected agents and layers (see Figure 1).

V 1.0 52/55 2005.02.28

Page 53: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

Figure 2: An annotation window

The Annotation Window consists of three parts (see Figure 2). The top part is an annotation area that displays previously created annotations. The annotation elements are displayed as a list of the labels’ text representations. The part under the annotation area shows information about the selected element or the element that is being created. It displays the start time, end time, label and annotator comments for the current annotation element. The bottom part contains the buttons that represent labels of the current annotation layer. If a label set contains agents (e.g. a label set of the gaze-layer), the annotation window displays, at the top of the button part, a combo box that lists all signals available for the current observation. The user may select from the combo-box the signal that is used as an input layer for the current annotations. The agent buttons are ordered so that they match the positions of the speakers in the selected signal file. This facilitates the annotation process in the sense that the annotator does not have to be concerned about the position of the agents that are not captured in the selected video file. The CVL tool supports real-time as well as off-line annotations. The user can create annotations by clicking the label buttons while the media file is playing or by pausing the media file before clicking the label button. A segment of time-aligned annotations is an arbitrary time fragment. From the perspective of the annotation of one agent, the segments are continuous, non-overlapping and they fully cover the whole input layer. When a user click on a new label button, the end time of previous segment is set with the current signal time and a new segment is automatically started at this point Viewing annotations The tool supports viewing annotations of the several annotation layers without creating annotations in that layer. The layers are selected from the View menu (see Figure 3). For each of the selected layers, a view window is displayed. The view window contains tab sheets for all available agents. So, the view window of an annotation layer displays, for each agent, the labelled elements in that layer.

V 1.0 53/55 2005.02.28

Page 54: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

54/55 2005.02.28

Figure 3: Viewing annotations in the CVL tool Playing media files The media files are played using the NITE Media Player. The tool supports playing several media files simultaneously. A new layer is created by clicking at the “New” button at the bottom of the media player. At the top of a media player, there is a combo box that lists all signals available for the current observation. For each of the created players, the user may choose from the combo box which signal will be played in that media player. Synchronization with the media files The annotation and view windows are synchronised with the media signals. Thus, while the media is playing, the annotation and view windows highlight the annotation elements that cover the current signal time (see Figure3). The tool also allows users to select one or more annotation elements and to play the media fragment that covers those elements.

7.2. Automatic Video Editing An off - line automatic video editing algorithm was implemented. It is able to generate compact output video from several source videos, that are recorded from different views. The algorithm is based on an activity evaluation and processing of editing rules as is described in report D3.2. Information about participants’ positions and speech activity are used. The output of the algorithm is a scenario for audio-video streams editing tool. Adjustability according to the desired information (query) is included. The user can specify which participant or camera should be preferred or inhibited in the output video. The appropriate rules are modified according to these requirements. In addition, basic video summarization according to measured activity was implemented. The viewer can specify the maximum acceptable length of the produced video sequence. The intervals with low activity are generally removed from output video. The editing phase is in this case iteratively repeated until the desired length is reached. The fixed virtual cameras are used for detail shots. Virtual cameras are zoomed viewports of source picture. The virtual cameras are means of highlighting of the desired activities. The development of tool for editing of AV streams was also started. Several audio and video source streams can be transformed into one output stream according to the scenario generated with the video editing algorithm. Various effects can be applied to audio and video

V 1.0

Page 55: D3.3: Final report on Multimodal information access and ...spandh.dcs.shef.ac.uk/projects/m4/publicDelivs/D3-3.pdf · Deliverable D3.3 Multimodal information access and integration

Deliverable D3.3 Multimodal information access and integration M4 (IST–2001–34485)

tracks. This tool uses an in-house library for manipulation with media files. Linux version of this library was additionally implemented. Transparent interface for MS Windows and Linux applications is available now. The library is currently used in applications for image processing running on Linux cluster.

References [1] I. Potucek, S. Sumec, M. Spanel: Participant activity detection by hands and face movement tracking in the meeting room, In. Proceedings of CGI 2004, Crete, Greece, June 2004. [2] S. Sumec: Multi-Camera Automatic Video Editing, In: Proceedings of ICCVG 2004, Warsaw, Poland, 2004

8. Conclusion Multimodal integration and access have been fully addressed throughout the M4 project. We are now able to report on a robust framework for the generic integration and similarity access of multimodal data. This framework is coupled with development of multimodal meeting features benefiting from all modalities classically encountered. We have also developed a number of approaches for managing data manually where needed and essentially for the creation of multimodal test and training data. This deliverable therefore forms a consistent report on novative and implemented techniques for multimodal information access and integration.

V 1.0 55/55 2005.02.28