overview

Multimodal Integration for Meeting Group Action

Segmentation and Recognition

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang

Institute for Human-Machine Communication, Technische Universität MünchenCentre for Speech Technology Research, University of Edinburgh

IDIAP Research Institute

M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 2

Overview

1. Introduction: Structuring of meetings

2. Audio-visual features

3. Models and experimental results• Segmentation using semantic features (BIC-like)• Multi-layer HMM• Multi-stream DBN• Noise robustness with a mixed-state DBN

4. Summary

Structuring of Meetings

Meetings can be modelled as a sequence of events,group, or individual actions from a set

A = {A1, A2, A3, …, AN}

A2 A1 A2 A6 A3

Structuring of Meetings

Meetings can be modelled as a sequence of events,group, or individual actions from a set

A = {A1, A2, A3, …, AN}

Different sets represent different meeting views

Neutral LowHighGroup Interest

Group task Information sharing Decision Information sharing

Presentation DiscussionDiscussion Phase Monologue

Idle Writing SpeakingPerson 1 Idle

Meeting Views

Discussion Phase

• Discussion

• Monologue (with ID)

• Note-taking

• Presentation

• Whiteboard

• Monologue (ID) + Note-taking

• Presentation + Note-taking

• Whiteboard + Note-taking

Interest Level

• High

• Neutral

• Low

• Brainstorming

• Decision making

• Information sharing

Individual

Interest Level

• High

• Neutral

• Low

Actions

• Speaking

• Writing

• Idle

Further sets could represent the current agenda item or the topic discussed, but may require higher semantic knowledge for an automatic analysis.

Action Lexicon

Discussion Phase

• Discussion

• Monologue (with ID)

• Note-taking

• Presentation

• Whiteboard

• Monologue (ID) + Note-taking

• Presentation + Note-taking

• Whiteboard + Note-taking

Action lexicon IOnly single actions 8 action classes

Action lexicon IISingle actions and combinations of parallel actions 14 action classes

M4 Meeting Corpus

• Recorded at IDIAP• Three fixed cameras• Lapel microphones• Microphone array

• Four participants• 59 videos• Each 5 minutes

System Overview

Multimodal Feature Extraction

Action Segmentation

Action Classification

Cameras Microphone Array

Lapel Microphones

Visual Features

Global motionFor each position

• Centre of motion• Changes in motion

(dynamics)• Mean absolute deviation• Intensity of motion

Skin colour blobs (GMM)For each person

• Head: orientation and motion• Hands: Position, size,

orientation, and motion• Moving blobs from

background subtraction

Audio Features

Microphone arrayFor each position

• For each seat (4), whiteboard, and projector screen

• SRP-PHAT measure to estimate a speech activity

• And a speech and silence segmentation

Lapel microphonesFor each person

• For each speech segment• Energy• Pitch (SIFT algorithm)• Speaking rate (combination of

estimators)• MFC-Coefficients

0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0 240.0 260.0 280.0

S1S2S3S4PRWH

Lexical Features

• Speech transcription (or output of ASR)

• Gesture transcription (or output of an automatic gesture recognizer)

• Gesture inventory

– Writing– Pointing– Standing up– Sitting down– Nodding– Shaking head

For each person

BIC-like approach

• Strategy similar to the Bayesian information criterion (BIC)• Segmentation and classification in one step

• Two windows with variable length shifted over the time scale• Inner border is shifted from left to right• Classify each window: different results for left and right window

-> inner border is considered as a boundary of a meeting event• No boundary is detected -> enlarge the whole window• Repeat until right border reaches end of meeting

BIC-like approach

Classifier Insertion-rate [%]

Deletion-rate [%]

Accuracy[s]

Rec-Error[%]

BN 14.7 6.22 7.93 39.0

GMM 24.7 2.33 10.8 41.4

MLP 8.61 1.67 6.33 32.4

RBF 6.89 3.00 5.66 31.6

SVM 17.7 0.83 9.08 35.7

Experimental setup: 57 meetings using “Lexicon 1”

Multi-layer Hidden Markov Model

Two-layer HMM

GroupHMM

Individual

I-HMM 1

I-HMM 2

I-HMM 3

I-HMM 4

Person 1features

Person 2features

Person 3features

Person 4features

Groupfeatures

• Individual action layer: I-HMM– Speaking, Writing, Idle

• Group action layer: G-HMM– Discussion, Monologue, …

• Each layer trained independently

• Simultaneous segmentation and recognition

Two-layer HMM

GroupHMM

Individual

I-HMM 1

I-HMM 2

I-HMM 3

I-HMM 4

Person 1features

Person 2features

Person 3features

Person 4features

Groupfeatures

• Smaller observation space-> stable for limited training data

• I-HMMs are person independent-> much more training data from different persons available

• G-HMM less sensitive to small changes in the low-level features

• Two layers are trained independently-> different HMM combinations can be explored

Method AER (%)

Single-layer HMM

Visual only 48.2

Audio only 36.7

Early integration 23.7

Multi-stream 23.1

Asynchronous 22.2

Multi-layer HMM

Visual only 42.4

Audio only 32.3

Early Integration 16.5

Multi-stream 15.8

Asynchronous 15.1

Multi-stream DBN Model

• Counter structure– Improves state-duration

modelling– Action counter C is incremented

after each “action transition”

• Hierarchical approach

• State-space decomposition:– Two feature related sub-action

nodes (S1 and S2)– Each sub-state corresponds to a

cluster of feature vectors– Unsupervised training

• Independent modality processing

A0 AtAt+1….

St+12….

Yt2 Yt+1

Experimental Results

7x7 6x66x6

21.421.492.9TUM (Beamf. + Video)

24.926.761.6IDIAP (Audio + Video)

18.917.148.4UEDIN (Turn.+Pros.)

Mult.+ Count.

Multi-streamHMM

Feature setup

Experimental setup: cross-validation over 53 meetings using “Lexicon 1”

Mixed-State DBNLD

Multi-strea

HAt+1HA

HVt+1HV

UVt UV

XVt XV

OVt OV

Time t

io streamV

isual stream

Coupled System

HMM LDSVideo

AudioVideo

• Couple a multi-stream HMM with a linear dynamical system (LDS)

• HMM is driving input for the LDS

• With the HMM as driving input disturbances can be compensated

• System can be described as DBN

Mixed-State DBN

Recognition Results (%)Single modal HMMs Multi-modal

Audio Array Video HMM DBN

I Clear test data 83.1 83.6 67.2 85.2 87.8

II Lapel disturbed 61.1 80.9 87.0

III Array disturbed 75.4 83.5 87.0

IV Video disturbed 40.9 82.6 83.5

V All streams disturbed 80.0 81.7

Summary• Joint effort of TUM, IDIAP, and UEDIN

• Modelling meetings as a sequence of individual and group actions

• Large number of audio-visual features

• Four different group action modelling frameworks:• Algorithm based on BIC• Multi-layer HMM• Multi-stream DBN• A mixed-state DBN

• Good performances of the proposed frameworks

• Future work– Space for further improvements: Both in the feature domain and

in the model structures– Investigate combinations of the proposed models

overview

idlegroup action layer

action t

individual actions

different sets

different results

right border

different persons availableg

person head

Documents

lesson overview lesson overview specialized tissues in...

overview indonesia overview indonesia overview sez sei...

program overview & contacts incumbent program overview ......

lesson overview lesson overview the neuron lesson overview...

overview - cisco€¦ · overview...

digital experience cloud overview. agenda market overview...

lesson overview lesson overview cell differentiation lesson...

help.csod.com · web viewtranscript: transcript overview....

district overview district overview

lesson overview lesson overview aquatic ecosystems lesson...

lesson overview lesson overview seed plants lesson overview...

lesson overview lesson overview specialized tissues in...

lesson overview lesson overview the skeletal system lesson...

lesson overview lesson overview infectious disease lesson...

icf '09 overview updated overview

solution overview veracode - e-spin group...solution...

overview - cisco · overview •overview,onpage1 overview...

lesson overview lesson overview photosynthesis: an overview...

overview · overview 7 overview securedigitalcards....

ehr architecture architecture overview - · pdf fileopenehr...