we acknowledge support from: nsf-stimulate program, grant no. iri-9618887, “gesture, speech, and...

We acknowledge support from:• NSF-STIMULATE program, Grant No. IRI-9618887,

“Gesture, Speech, and Gaze in Discourse Segmentation”

• NSF- KDI program, Grant No. BCS-9980054, “Cross-Modal Analysis of Signal and Sense: Multimedia Corpora and Tools for Gesture, Speech, and Gaze Research”

• NSF-ITR program, Grant No. IIS-0219875, “Beyond The Talking Head and Animated Icon: Behaviorally Situated Avatars for Tutoring”

• ARDA-VACE II program “From Video to Information: Cross-Modal Analysis of Planning Meetings”

VACE Multimodal Meeting CorpusLei Chen, Travis Rose, Fey Perill, Xu Han, Jilin Tu, Zhongquian Huang, Mary Harper, Francis Quek, David McNeill, Ronald Tuttle, and Thomas Huang

Francis Quek

Professor of Computer Science

Director, Center for Human Computer Interaction

Virginia Tech

Corpus Rationale

A quest for meaning: Embodied cognition and language production drives our research

Analysis of ‘natural’ [human human*] meetings Resource in support of research in

Multimodal language analysis Speech recognition and analysis Vision-based communicative behavior analysis

Why Multimodal Language Analysis?

S1 you know like those ¿fireworks?

S2 well if we're trying to drive'em / out her<r>e # we need to put'em up her<r>e

S1 yeah well what I'm saying is we should*

S2 in front

S1 we should do it* we should make it a lin<n>e through the room<m>s / so that they explode like here then here then here then here

Multimodal Language Example

QuickTime™ and aCinepak decompressor

are needed to see this picture.

Embodied Communicative Behavior Constructed dynamically at the moment of speaking

(thinking for speaking) Dependent on cultural, personal, social, cognitive

differences Speaker is often unwitting of gestures Reveals the contrastive foci of language stream (Hajcova,

Halliday et. al.) Is co-expressive (co-temporal) with speech Is multiply determined Temporal synchrony is critical for analysis

In a Nutshell

Gesture/Speech Framework: (McNeill 1992, 2000, 2001, Quek et al 1999-2003)

DiscourseProduction

MentalImagery

EmbodiedImagery

VideoAccess

Computable'Image-bearing'

Features

Inference- Imagistic gesture features- Coherence with conceptual discourse units

Inference

ARDA/VACE Program

ARDA is to the intelligence community what DARPA is to the military

Interest is in the exploitation of video data (Video Analysis and Content Exploitation)

A key VACE challenge: Meeting Analysis Our key theme: Multimodal communication

analysis

From Video to Information: Cross-Modal Analysis for Planning Meetings

Team

Multimodal Meeting Analysis: A Cross-Disciplinary Enterprise

Overarching Approach Coordinated multidisciplinary research Corpus assembly

Data is transcribed and coded for relevant speech/language structure War-gaming (planning) scenarios are captured to provide real planning behavior

in a controlled experimental context (reducing many ‘unknowns’) Meeting room is multiply instrumented with cross-calibrated video, synchronized

audio/video, motion tracking All data components are time-aligned across the dataset

Multimodal video processing research Research on posture, head position/orientation, gesture tracking, hand-shape

recognition, and in multimodal integration Research in tools for analysis, coding and interpretation Speech analysis research in support of multimodality

Scenarios

Each Scenario to have Five Participants Roles Tailored to Available Participant Expertise Five Initial Scenarios

Delta II Rocket Launch Foreign Material Exploitation Intervention to Support Democratic Movement Humanitarian Assistance Scholarship Selection

Scenarios (cont’d)

Planned Scenarios (to be Developed) Lost Aircraft Crisis Response Hostage Rescue Downed Pilot Search & Rescue Bomb Shelter Design

Scenario Development

Humanitarian Assistance Walkthrough Purpose: Develop Plan for Immediate Military Support to Dec 04 Asian

Tsunami Victims Considerable Open Source Information from Internet for Scenario

Development Roles:

Medical Officer Task Force Commander Intel Officer Operations Officer Weather Officer

Mission Goals & Priorities Provided for Each Role

Before Tsunami

After Tsunami

“As intelligence officer, your role is to provide intelligence support to OPERATION UNIFIED ASSISTANCE. While the extent of damage is still unknown, early reporting indicates that coastal areas throughout South Asia have been affected. Communications have been lost with entire towns. Currently, the only means of determining the magnitude of destruction is from overhead assets. Data from the South Asia and Sri Lanka region has already been received from civilian remote sensing satellites. Although the US military will be operating in the region on a strictly humanitarian mission, the threat still exists of hostile action to US personnel by terrorist factions opposed to the US. As intel officer, you are responsible for briefing the nature of the terrorist threat in the region.”

Meulaboh, Indonesia

Corpus Assembly

Multi-modal Elicitation Experiment

Tim

e A

lign

ed M

ulti

med

ia T

rans

crip

tion

Video Processing:10-Camera Calibration, Vector Extraction, Hand Tracking, Gaze Tracking, Head Modeling, Head Tracking, Body Tracking

Motion Capture Interpretation

Speech & Psycholinguistic CodingSpeech Transcription, Psycholinguistic Coding

Speech & Audio ProcessingAutomatic Transcript Word/Syllable Alignment to Audio, Audio Feature Extraction

10-camera video & digital audio capture

3D Vicon Extraction

Data Acquisition & Processing

Meeting Room and Camera Configuration

B C DA BEDDCB

H A BE FD

G FH E F H A HGF

A[4, 6, 8]

B[6, 7, 8, 10]

C[7, 10]

D[1, 7, 9, 10]

E[4, 6, 10]

F[1, 2, 3, 5]

G[2, 5]

H[2, 4, 5, 6]

2[F, G, H]

3[E, F]

4[A, H]

5[F, G, H]

10[B, C, D]

9[D, E]

8[A, B]

7[B, C, D]

6[B, A, H]

1[D, E, F]

T1 T2

1 C9C3 7 C7C10

2 C1C3 8 C2C5

3 C9C1 9 C2C4

4 C4C8 10 C3C5

5 C4C6 11 C7C9

6 C6C8 12 C8C10

C1 DEF C6 BAH

C2 HGF C7 DCB

C3 FE C8 BA

C4 HA C9 DE

C5 FGH C10 BCD

QuickTime™ and a decompressor


Cam1

48 Calibration Dots for Calibration

18 Vicon Markers for Coordinate System Transformation

Y=RX+T

Global & Pairwise Camera Calibration

Error Distribution in X Direction Error Distribution in Y Direction

Error Distribution in Z Direction

X Direction

maximum: 0.5886mm minimum: 0.4mm

mean: 0.4755mm

Z Direction

maximum: 0.5064 mm minimum: 0.3804mm

mean: 0.4317mm

Y Direction

maximum: 0.6925mm minimum: 0.3077mm

mean: 0.4529mm

(Camera pair 5~12)Error Distributions in Meeting Room Area

VICON Motion Capture

Motion capture technology: Near-IR cameras Retro-reflective markers Datastation + PC workstation

Vicon modes of operation: Individual points (as seen in calibration) Kinematic models Individual objects

QuickTime™ and aTIFF (Uncompressed) decompressor


VICON Motion Capture

Learning about MoCap 11/03: Initial Installation 6/04: Pilot scenario, using kinematic models 10/04: Follow-up training using object models 11/04: Rehearsed using Vicon with object models 1/05: Data captured for FME scenario

Export position information for each participant’s head, hand, body position & orientation

Post-processing of motion capture data: ~1 hour per minute for a 5-participant meeting

Incorporating MoCap into Workflow Labeling of point clusters is labor intensive 3 Work Studies @ 20 hours/wk = ~60 minutes (1 dataset) per week

Speech Processing Tasks

Formulate an audio work flow to support the efficient and effective construction of a large-size high quality multimodal corpus

Implement support tools to achieve the goal Package time-aligned word transcriptions into

appropriate data formats that can be efficiently shared and used

Audio Processing

Audio Recording,Meeting Metadata Annotation

Audio Segmentation

Manual Transcription

Forced Alignment

OOV WordResolution

segmentationtranscription

audio

audio

Corpus Integration

VACE Metadata Approach

Data Collection Status Pilot: June 04

Low Audio Volume. Sound Mixer Purchased Video Frame Drop-out. Purchased High Grade DV Tapes

(AFIT 02-07-05 Democratic movement assistance) (AFIT 02-07-05 Democratic movement assistance, session 2)

audio clipping in close-in mikes -- may be able to salvage data using the desktop mics.

AFIT 02-24-05 Humanitarian Assistance (Tsunami) AFIT 03-04-05 Humanitarian Assistance (Tsunami) AFIT 03-18-05 Scholarship selection AFIT 04-08-05 Humanitarian Assistance (Tsunami) AFIT 04-08-05 Card Game AFIT 04-25-05 Problem Solving Task, (cause of deterioration of Lincoln

Memorial) AFIT 06-??-05 Problem Solving Task

Some Multimodal Meeting Room Results

F2 & F1 Lance Armstrong Episode

QuickTime™ and a decompressor

are needed to see this picture.NIST Microcorpus July 29, 2003Meeting Dynamics – F1 vs F2

M1 [do you wanna pull up a site here to]gaze/orientation: head; instrumentalF1, M1 and M2 gaze at screen off cameraor might be looking at F2 is now off camera walking toward the board

00:00:46:11F2 [washington post]

gaze/orientation: head; interactive36.000 F1 directs gaze at F2actionF1 twists in chair to toward F2

M1 [/]F2 [/]

gaze/orientation: head; interactiveM1 and F1 gaze directed at F2M2 gaze remains directed at screen

F1 yeah what do you what do you li ke]

00:00:47:14F1 yeah washington

actionF2 facing white board

M2 [we should put them in and just vote ]Gaze/orientation: head; interactiveM2 turns head to direct gaze to M1 (?)NB it’s possible that M2’s gaze is following F2 who is walking around table atthis point, up until M2 begins speaking

M1 [ / ]F2 [ / ]

gaze/orientation: head; interactiveF1 and M1 gaze toward M2

M2 %chuckle

M1 vote[<uhh>]

M1 turns head L to direct gaze at screen (possibly)

M2 we all follow the news rightactionM2 pulls back in chair away from table

F2 [ / ]gaze/orientation: head: instrumentalF1 gaze directed down at paper on tableactionF1 tears off a piece of paper from her pad on the table

Gaze direction tracks social patterns (interactive gaze) and engagement of objects (instrumental gaze), which may be a form of pointing as well as perception

Gaze source

Gaze target

F2 F1 M1 M2 ∑

F2 1 2 3 F1 3 4 3 10 1M 6 1 2 9 2M 3 2 5

∑ 12 1 7 7

Interactive gaze - 5 min. sample:

Interactive gaze occurrences

Instrumental gaze

Instrumental gaze

Gaze - NIST July 29, 2003 Data

Gaze - AFIT dataGazee

Gazer

B C D E F G H TotalB ---- 1 3 1 3 8C ---- 6 1 5 12D 4 ---- 3 5 1 13E 10 ---- 1 3 14F 5 5 ---- 3 13G 7 3 1 ---- 11H 10 1 10 1 8 ---- 30

Total 37 1 30 5 27 1

CO (Moderator)General’s Rep. Engineering Lead

F-formation analysis

“An F-formation arises when two or more people cooperate together to maintain a space between them to which they all have direct and exclusive [equal] access.” (A. Kendon 1977).

An F-formation is discovered from tracking gaze direction in a social group. It is not only about shared space. It reveals common ground and has an associated meaning. The cooperative property is crucial.

It is useful for detecting units of thematic content being jointly developed in a conversation.

this week I believe

You're good

I don't know I was thinking about the sports story

That sports one's

You thinking Coby Bryant

Click on sports right here

I'm thinking

on the left

Lance Armstrong

I prefer that story

I'm I'm a huge Actually I'm a huge Tour fan so

I think Lance Armstrong's

I was thinkin' Lance Armstrong

Yeah I just didn't like the bad news aspect of the

I'm sorry

Yeah the the

F1

F2

M1

M2

F1

F2

M1

M2

F1

F2

M1

M2

F2@M2 M1@F1

F2-M1-Mutual

F2@Screen M1@F2 F2@M1 M1@Screen

F1@Screen

F1@M1

F1@F2 F2@F1

M1-F1 Mutual

this week I believe

You're good

I don't know I was thinking about the sports story

That sports one's

You thinking Coby Bryant

Click on sports right here

I'm thinking

on the left

Lance Armstrong

I prefer that story

I'm I'm a huge Actually I'm a huge Tour fan so

I think Lance Armstrong's

I was thinkin' Lance Armstrong

Yeah I just didn't like the bad news aspect of the

I'm sorry

Yeah the the

F1

F2

M1

M2

F1

F2

M1

M2

F1

F2

M1

M2

F2@M2 M1@F1

F2-M1-Mutual

F2@Screen M1@F2 F2@M1 M1@Screen

F1@Screen

F1@M1

F1@F2 F2@F1

M1-F1 Mutual

NIST-F-Formation Coding (76.11s–92.27s)

NIST-F-Formation Coding (92.27s–108.97s)

bike guy Lance Yeah Yeah I think so too

Yeah Lance Armstrong

Yeah Lance Armstrong Tour de France

Lance okay

Yeah

Where were you

Where did you wanna go Laura

Well there's sports right up there right near the metro

Oooo-Kaaaaay Yeah that's ...

No Never mind

[Laughs]

[Laughs]

Let's go to your page What is it you like CNN?

CNN

CNN

Yeah let's look there

Yeah you don't

F1

F2

M1

M2

F1

F2

M1

M2

F1

F2

M1

M2

M1-F1-Mutual F2@Whiteboard

F2@Notes

F2@F1 F2@Screen

M1@F1 M1@Screen

F1@F2 F1@Screen F1@M1 F1@Whiteboard

F1@M2 F1@ScreenF1@Screen F1@F2

M1+F1 Shared

M1-F1-Mutual F2@Whiteboard

F2@Notes

F2@F1 F2@Screen

M1@F1 M1@Screen

F1@F2 F1@Screen F1@M1 F1@Whiteboard

F1@M2 F1@ScreenF1@Screen F1@F2

M1+F1 Shared

Summary

Corpus collection based on sound scientific foundations Data includes audio, video, motion-capture, speech

transcription, and manual codings A suite of tools for visualizing and coding the cotemporal

data has been developed Research results demonstrate multimodal discourse

segmentation and meeting dynamics analysis

we acknowledge support from: nsf-stimulate program, grant no. iri-9618887, “gesture, speech, and...

Documents

crossmodal analysis

crosscalibrated video

language production

multimodal integrationresearch

real planning behavior

gesture tracking

gaze researchnsfitr

quek et