we acknowledge support from: nsf-stimulate program, grant no. iri-9618887, “gesture, speech, and...
TRANSCRIPT
We acknowledge support from:• NSF-STIMULATE program, Grant No. IRI-9618887,
“Gesture, Speech, and Gaze in Discourse Segmentation”
• NSF- KDI program, Grant No. BCS-9980054, “Cross-Modal Analysis of Signal and Sense: Multimedia Corpora and Tools for Gesture, Speech, and Gaze Research”
• NSF-ITR program, Grant No. IIS-0219875, “Beyond The Talking Head and Animated Icon: Behaviorally Situated Avatars for Tutoring”
• ARDA-VACE II program “From Video to Information: Cross-Modal Analysis of Planning Meetings”
VACE Multimodal Meeting CorpusLei Chen, Travis Rose, Fey Perill, Xu Han, Jilin Tu, Zhongquian Huang, Mary Harper, Francis Quek, David McNeill, Ronald Tuttle, and Thomas Huang
Francis Quek
Professor of Computer Science
Director, Center for Human Computer Interaction
Virginia Tech
Corpus Rationale
A quest for meaning: Embodied cognition and language production drives our research
Analysis of ‘natural’ [human human*] meetings Resource in support of research in
Multimodal language analysis Speech recognition and analysis Vision-based communicative behavior analysis
Why Multimodal Language Analysis?
S1 you know like those ¿fireworks?
S2 well if we're trying to drive'em / out her<r>e # we need to put'em up her<r>e
S1 yeah well what I'm saying is we should*
S2 in front
S1 we should do it* we should make it a lin<n>e through the room<m>s / so that they explode like here then here then here then here
Multimodal Language Example
QuickTime™ and aCinepak decompressor
are needed to see this picture.
Embodied Communicative Behavior Constructed dynamically at the moment of speaking
(thinking for speaking) Dependent on cultural, personal, social, cognitive
differences Speaker is often unwitting of gestures Reveals the contrastive foci of language stream (Hajcova,
Halliday et. al.) Is co-expressive (co-temporal) with speech Is multiply determined Temporal synchrony is critical for analysis
In a Nutshell
Gesture/Speech Framework: (McNeill 1992, 2000, 2001, Quek et al 1999-2003)
DiscourseProduction
MentalImagery
EmbodiedImagery
VideoAccess
Computable'Image-bearing'
Features
Inference- Imagistic gesture features- Coherence with conceptual discourse units
Inference
ARDA/VACE Program
ARDA is to the intelligence community what DARPA is to the military
Interest is in the exploitation of video data (Video Analysis and Content Exploitation)
A key VACE challenge: Meeting Analysis Our key theme: Multimodal communication
analysis
From Video to Information: Cross-Modal Analysis for Planning Meetings
Team
Multimodal Meeting Analysis: A Cross-Disciplinary Enterprise
Overarching Approach Coordinated multidisciplinary research Corpus assembly
Data is transcribed and coded for relevant speech/language structure War-gaming (planning) scenarios are captured to provide real planning behavior
in a controlled experimental context (reducing many ‘unknowns’) Meeting room is multiply instrumented with cross-calibrated video, synchronized
audio/video, motion tracking All data components are time-aligned across the dataset
Multimodal video processing research Research on posture, head position/orientation, gesture tracking, hand-shape
recognition, and in multimodal integration Research in tools for analysis, coding and interpretation Speech analysis research in support of multimodality
Scenarios
Each Scenario to have Five Participants Roles Tailored to Available Participant Expertise Five Initial Scenarios
Delta II Rocket Launch Foreign Material Exploitation Intervention to Support Democratic Movement Humanitarian Assistance Scholarship Selection
Scenarios (cont’d)
Planned Scenarios (to be Developed) Lost Aircraft Crisis Response Hostage Rescue Downed Pilot Search & Rescue Bomb Shelter Design
Scenario Development
Humanitarian Assistance Walkthrough Purpose: Develop Plan for Immediate Military Support to Dec 04 Asian
Tsunami Victims Considerable Open Source Information from Internet for Scenario
Development Roles:
Medical Officer Task Force Commander Intel Officer Operations Officer Weather Officer
Mission Goals & Priorities Provided for Each Role
Before Tsunami
After Tsunami
“As intelligence officer, your role is to provide intelligence support to OPERATION UNIFIED ASSISTANCE. While the extent of damage is still unknown, early reporting indicates that coastal areas throughout South Asia have been affected. Communications have been lost with entire towns. Currently, the only means of determining the magnitude of destruction is from overhead assets. Data from the South Asia and Sri Lanka region has already been received from civilian remote sensing satellites. Although the US military will be operating in the region on a strictly humanitarian mission, the threat still exists of hostile action to US personnel by terrorist factions opposed to the US. As intel officer, you are responsible for briefing the nature of the terrorist threat in the region.”
Meulaboh, Indonesia
Corpus Assembly
Multi-modal Elicitation Experiment
Tim
e A
lign
ed M
ulti
med
ia T
rans
crip
tion
Video Processing:10-Camera Calibration, Vector Extraction, Hand Tracking, Gaze Tracking, Head Modeling, Head Tracking, Body Tracking
Motion Capture Interpretation
Speech & Psycholinguistic CodingSpeech Transcription, Psycholinguistic Coding
Speech & Audio ProcessingAutomatic Transcript Word/Syllable Alignment to Audio, Audio Feature Extraction
10-camera video & digital audio capture
3D Vicon Extraction
Data Acquisition & Processing
Meeting Room and Camera Configuration
B C DA BEDDCB
H A BE FD
G FH E F H A HGF
A[4, 6, 8]
B[6, 7, 8, 10]
C[7, 10]
D[1, 7, 9, 10]
E[4, 6, 10]
F[1, 2, 3, 5]
G[2, 5]
H[2, 4, 5, 6]
2[F, G, H]
3[E, F]
4[A, H]
5[F, G, H]
10[B, C, D]
9[D, E]
8[A, B]
7[B, C, D]
6[B, A, H]
1[D, E, F]
T1 T2
1 C9C3 7 C7C10
2 C1C3 8 C2C5
3 C9C1 9 C2C4
4 C4C8 10 C3C5
5 C4C6 11 C7C9
6 C6C8 12 C8C10
C1 DEF C6 BAH
C2 HGF C7 DCB
C3 FE C8 BA
C4 HA C9 DE
C5 FGH C10 BCD
QuickTime™ and a decompressor
are needed to see this picture.
Cam1
48 Calibration Dots for Calibration
18 Vicon Markers for Coordinate System Transformation
Y=RX+T
Global & Pairwise Camera Calibration
Error Distribution in X Direction Error Distribution in Y Direction
Error Distribution in Z Direction
X Direction
maximum: 0.5886mm minimum: 0.4mm
mean: 0.4755mm
Z Direction
maximum: 0.5064 mm minimum: 0.3804mm
mean: 0.4317mm
Y Direction
maximum: 0.6925mm minimum: 0.3077mm
mean: 0.4529mm
(Camera pair 5~12)Error Distributions in Meeting Room Area
VICON Motion Capture
Motion capture technology: Near-IR cameras Retro-reflective markers Datastation + PC workstation
Vicon modes of operation: Individual points (as seen in calibration) Kinematic models Individual objects
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
VICON Motion Capture
Learning about MoCap 11/03: Initial Installation 6/04: Pilot scenario, using kinematic models 10/04: Follow-up training using object models 11/04: Rehearsed using Vicon with object models 1/05: Data captured for FME scenario
Export position information for each participant’s head, hand, body position & orientation
Post-processing of motion capture data: ~1 hour per minute for a 5-participant meeting
Incorporating MoCap into Workflow Labeling of point clusters is labor intensive 3 Work Studies @ 20 hours/wk = ~60 minutes (1 dataset) per week
Speech Processing Tasks
Formulate an audio work flow to support the efficient and effective construction of a large-size high quality multimodal corpus
Implement support tools to achieve the goal Package time-aligned word transcriptions into
appropriate data formats that can be efficiently shared and used
Audio Processing
Audio Recording,Meeting Metadata Annotation
Audio Segmentation
Manual Transcription
Forced Alignment
OOV WordResolution
segmentationtranscription
audio
audio
Corpus Integration
VACE Metadata Approach
Data Collection Status Pilot: June 04
Low Audio Volume. Sound Mixer Purchased Video Frame Drop-out. Purchased High Grade DV Tapes
(AFIT 02-07-05 Democratic movement assistance) (AFIT 02-07-05 Democratic movement assistance, session 2)
audio clipping in close-in mikes -- may be able to salvage data using the desktop mics.
AFIT 02-24-05 Humanitarian Assistance (Tsunami) AFIT 03-04-05 Humanitarian Assistance (Tsunami) AFIT 03-18-05 Scholarship selection AFIT 04-08-05 Humanitarian Assistance (Tsunami) AFIT 04-08-05 Card Game AFIT 04-25-05 Problem Solving Task, (cause of deterioration of Lincoln
Memorial) AFIT 06-??-05 Problem Solving Task
Some Multimodal Meeting Room Results
F2 & F1 Lance Armstrong Episode
QuickTime™ and a decompressor
are needed to see this picture.NIST Microcorpus July 29, 2003Meeting Dynamics – F1 vs F2
M1 [do you wanna pull up a site here to]gaze/orientation: head; instrumentalF1, M1 and M2 gaze at screen off cameraor might be looking at F2 is now off camera walking toward the board
00:00:46:11F2 [washington post]
gaze/orientation: head; interactive36.000 F1 directs gaze at F2actionF1 twists in chair to toward F2
M1 [/]F2 [/]
gaze/orientation: head; interactiveM1 and F1 gaze directed at F2M2 gaze remains directed at screen
F1 yeah what do you what do you li ke]
00:00:47:14F1 yeah washington
actionF2 facing white board
M2 [we should put them in and just vote ]Gaze/orientation: head; interactiveM2 turns head to direct gaze to M1 (?)NB it’s possible that M2’s gaze is following F2 who is walking around table atthis point, up until M2 begins speaking
M1 [ / ]F2 [ / ]
gaze/orientation: head; interactiveF1 and M1 gaze toward M2
M2 %chuckle
M1 vote[<uhh>]
M1 turns head L to direct gaze at screen (possibly)
M2 we all follow the news rightactionM2 pulls back in chair away from table
F2 [ / ]gaze/orientation: head: instrumentalF1 gaze directed down at paper on tableactionF1 tears off a piece of paper from her pad on the table
Gaze direction tracks social patterns (interactive gaze) and engagement of objects (instrumental gaze), which may be a form of pointing as well as perception
Gaze source
Gaze target
F2 F1 M1 M2 ∑
F2 1 2 3 F1 3 4 3 10 1M 6 1 2 9 2M 3 2 5
∑ 12 1 7 7
Interactive gaze - 5 min. sample:
Interactive gaze occurrences
Instrumental gaze
Instrumental gaze
Gaze - NIST July 29, 2003 Data
Gaze - AFIT dataGazee
Gazer
B C D E F G H TotalB ---- 1 3 1 3 8C ---- 6 1 5 12D 4 ---- 3 5 1 13E 10 ---- 1 3 14F 5 5 ---- 3 13G 7 3 1 ---- 11H 10 1 10 1 8 ---- 30
Total 37 1 30 5 27 1
CO (Moderator)General’s Rep. Engineering Lead
F-formation analysis
“An F-formation arises when two or more people cooperate together to maintain a space between them to which they all have direct and exclusive [equal] access.” (A. Kendon 1977).
An F-formation is discovered from tracking gaze direction in a social group. It is not only about shared space. It reveals common ground and has an associated meaning. The cooperative property is crucial.
It is useful for detecting units of thematic content being jointly developed in a conversation.
this week I believe
You're good
I don't know I was thinking about the sports story
That sports one's
You thinking Coby Bryant
Click on sports right here
I'm thinking
on the left
Lance Armstrong
I prefer that story
I'm I'm a huge Actually I'm a huge Tour fan so
I think Lance Armstrong's
I was thinkin' Lance Armstrong
Yeah I just didn't like the bad news aspect of the
I'm sorry
Yeah the the
F1
F2
M1
M2
F1
F2
M1
M2
F1
F2
M1
M2
F2@M2 M1@F1
F2-M1-Mutual
F2@Screen M1@F2 F2@M1 M1@Screen
F1@Screen
F1@M1
F1@F2 F2@F1
M1-F1 Mutual
this week I believe
You're good
I don't know I was thinking about the sports story
That sports one's
You thinking Coby Bryant
Click on sports right here
I'm thinking
on the left
Lance Armstrong
I prefer that story
I'm I'm a huge Actually I'm a huge Tour fan so
I think Lance Armstrong's
I was thinkin' Lance Armstrong
Yeah I just didn't like the bad news aspect of the
I'm sorry
Yeah the the
F1
F2
M1
M2
F1
F2
M1
M2
F1
F2
M1
M2
F2@M2 M1@F1
F2-M1-Mutual
F2@Screen M1@F2 F2@M1 M1@Screen
F1@Screen
F1@M1
F1@F2 F2@F1
M1-F1 Mutual
NIST-F-Formation Coding (76.11s–92.27s)
NIST-F-Formation Coding (92.27s–108.97s)
bike guy Lance Yeah Yeah I think so too
Yeah Lance Armstrong
Yeah Lance Armstrong Tour de France
Lance okay
Yeah
Where were you
Where did you wanna go Laura
Well there's sports right up there right near the metro
Oooo-Kaaaaay Yeah that's ...
No Never mind
[Laughs]
[Laughs]
Let's go to your page What is it you like CNN?
CNN
CNN
Yeah let's look there
Yeah you don't
F1
F2
M1
M2
F1
F2
M1
M2
F1
F2
M1
M2
M1-F1-Mutual F2@Whiteboard
F2@Notes
F2@F1 F2@Screen
M1@F1 M1@Screen
F1@F2 F1@Screen F1@M1 F1@Whiteboard
F1@M2 F1@ScreenF1@Screen F1@F2
M1+F1 Shared
M1-F1-Mutual F2@Whiteboard
F2@Notes
F2@F1 F2@Screen
M1@F1 M1@Screen
F1@F2 F1@Screen F1@M1 F1@Whiteboard
F1@M2 F1@ScreenF1@Screen F1@F2
M1+F1 Shared
Summary
Corpus collection based on sound scientific foundations Data includes audio, video, motion-capture, speech
transcription, and manual codings A suite of tools for visualizing and coding the cotemporal
data has been developed Research results demonstrate multimodal discourse
segmentation and meeting dynamics analysis