beyond actions: discriminative models for contextual group activities
DESCRIPTION
M.Sc. Thesis Defense. Beyond Actions: Discriminative Models for Contextual Group Activities. Tian Lan School of Computing Science Simon Fraser University August 12, 2010. Outline. Group Activity Recognition with Context Structure-level (latent structures) - PowerPoint PPT PresentationTRANSCRIPT
Beyond Actions: Discriminative Models for Contextual Group
Activities
Tian LanSchool of Computing Science
Simon Fraser University August 12, 2010
M.Sc. Thesis Defense
Outline
• Group Activity Recognition with Context– Structure-level (latent structures)– Feature-level (Action Context descriptor)
• Experiments
• Introduction
Activity Recognition• Goal Enable computers to analyze and understand
human behavior.
Answering a phone Kissing
Action vs. Activity Activity: a group of
people forming a queue Action: Stand
in a queue and facing left
Activity Recognition
• Activity Recognition is important
• Activity Recognition is difficult intra-class variation, background clutter, partial
occlusion, etc.
SurveillanceEntertainment
SportHCI
Group Activity Recognition
• Motivation human actions are rarely performed in
isolation, the actions of individuals in a group can serve as context for each other.
• Goal explore the benefit of contextual information
in group activity recognition in challenging real-world applications
Group Activity Recognition
Context
Group Activity Recognition
• Two types of ContextTalk
… …
group-person interaction
person-person interaction
Latent Structured Model
y
h1 h2 yh
x1 x2 xn image
action class
activity class
x0
…
Activity
Action
Feature
Hidden layer
y
h1 h2 yhn
x1 x2 xn
image
action class
activity class
x0
…
Latent Structured Modelgroup-person
Interaction
person-person Interaction
Structure-level
Feature-level
Difference from Previous Work
• Group Activity Recognition
Previous Work• Single-person action recognition Schuldt et al. icpr 04• Relative simple activity recognition Vaswani et al. cvpr 03• Dataset in controlled conditions
Our work• Group activity recognition in realistic videos• Two new types of contextual information• A unified framework
Difference from Previous Work
• Latent Structured Models
Our work latent structure for the hidden layer, automatically infer it during learning and inference.
Previous worka pre-defined structure for the hidden layer, e.g. tree (HCRF) ( Quattoni et al. pami 07, Felzenszwalb et al. cvpr 08)
Outline
• Group Activity Recognition with Context– Structure-level (latent structures)– Feature-level (Action Context descriptor)
• Experiments
• Introduction
y
h1 h2 yhn
x1 x2 xn
image
action class
activity class
x0
…
Structure-level Approach
person-person Interaction
Structure-level
Feature-level
Structure-level Approach
• Latent Structure
Queue ?
Talk
Talk
Model Formulation
y
h1 h2 yhn
x1 x2 xn
x0
…
Image-ActivityImage-Action Action-Activity
Action-Action
Input: image-label pair (x,h,y)
Inference
• Score an image x with activity label y
• Infer the latent variables
NP hard !
Inference
• Holding Gy fixed,
• Holding hy fixed,
Loopy BP
ILP
Learning with Latent SVM
Optimization: Non-convex bundle method (Do & Artieres, ICML 09)
y
h1 h2 yhn
x1 x2 xn
image
action class
activity class
x0
…
Feature-level Approach
person-person Interaction
Structure-level
Feature-level
Feature-level Approach
• Model
y
h1 h2 yh
x1 x2 xn image
action class
activity class
x0
…Action Context
Descriptor
Action Context Descriptorτ
(a)
action(c)
τ
z
+action
Focal person Context(b)
Action Context Descriptor
Feature Descriptor
Multi-class SVM
action class
scor
e
action class
scor
e
…action class
scor
e
max
action classsc
ore
e.g. HOG by Dalal & Triggs
Outline
• Group Activity Recognition with Context– Structure-level (latent structures)– Feature-level (Action Context descriptor)
• Experiments
• Introduction
Dataset
• Collective Activity Dataset (Choi et al. VS 09)
• 5 action categories: crossing, waiting, queuing, walking, talking. (per person)
• 44 video clips
Collective Activity Dataset
Dataset
• Nursing Home Dataset• activity categories: fall, non-fall. (per image)• 5 action categories: walking, standing, sitting,
bending and falling. (per person)• In total 22 video clips (2990 frames), 8 clips for
test, the rest for training. 1/3 are labeled as fall.
Nursing Home Dataset
Baselines• root (x0) + svm (no structure)• No connection• Min-spanning tree• Complete graph within r
h1
h2
h3
h4
h1
h2
h3
h4rh1
h2
h3
h4
h1
h2
h3
h4
Structure-level approach
Hidden layer
System Overview
Person
DetectorPerson
DescriptorVideo
u
v
Model
• Pedestrian Detection by Felzenszwalb et al.• Background Subtraction
• HOG by Dalal & Triggs • LST by Loy et al. at cvpr 09
Results – Collective Activity Dataset
Results – Correct Examples
Results – Incorrect Examples
Crossing Waiting
Walking Talking
Queuing
Results – Nursing Home Dataset
Results – Correct Examples
Results – Incorrect Examples
Conclusion
• A discriminative model for group activity recognition with context.
• Two new types of contextual information:– group-person interaction– person-person interaction• structure-level: Latent structure• Feature-level: Action Context descriptor
• Experimental results demonstrate the effectiveness of the proposed model
Future Work
• Modeling Complex Structures– Temporal dependencies among action
• Contextual Feature Descriptors– How to encode discriminative context?
• Weakly supervised Learning– e.g. multiple instance learning for fall detection
Thank you!
Pairwise Weightyhj
hk
Pairwise Weight
Pairwise Weight
Infer the graph structures
0/1 loss – optimize overall accuracy
Results – Nursing Home Dataset
Results – Nursing Home Dataset
new loss – optimize mean per-class accuracy
Person Detectors
• Collective Activity Dataset: • Pedestrian Detector (Felzenszwalb et al., CVPR 08)
• Nursing Home Dataset
BackgroundSubtraction
Moving RegionsVideo
Person Descriptors
• Collective Activity Dataset: • HOG
• Nursing Home Dataset• Local Spatial Temporal (LST) Descriptor (Loy et al.,
ICCV 09)
u
v
Results – Correct Examples
Results – Incorrect Examples
Results – Collective Activity Dataset
Root+SVM Structure-levelFeature-level
Group Context Descriptor
y
h1 h2 yhn
x1 x2 xn
x0
…
Learning
• Training data consists of {xn,hn,yn}
Structure-levelFeature-level
No connection
Structure-levelFeature-level
No connection
Results – Nursing Home Dataset