【bmvc2016】recognition of transitional action for short-term action prediction using...
TRANSCRIPT
Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature
Hirokatsu Kataoka, Ph.D. Computer Vision Research Group (CVRG), AIST
http://www.hirokatsukataoka.net/
Yudai Miyashita (TDU), Masaki Hayashi (Liquid Inc., Keio Univ.), Kenji Iwata, Yutaka Satoh (AIST)
Related work: Early Action Recognition
• [Ryoo, ICCV2011]
M. S. Ryoo, “Human Activity Prediction: Early Recognition of Ongoing Activities from Streaming Videos”, International Conference on Computer Vision (ICCV), pp.1036-1043, 2011.
Related work: Action Prediction
• [Kataoka+, VISAPP2016]
???Daytime (Time Zone)
Walking (Previous Activity)
Sitting (Current Activity)
??? (Next Activity)
xtimezone xprevious xcurrent
θ = “Using a PC”
Given Not givenTime series
H. Kataoka, Y. Aoki, K. Iwata, Y. Satoh, “Activity Prediction using a Space-Time CNN and Bayesian Framework”, in VISAPP, 2016.
Problem of related works
• Early action recognition
– Action recognition in an early frame of the action
– Enough cue is required, so almost equals to action recognition
• Action prediction
– Complete future prediction in an unstable situation
Proposal
• Transitional Action (TA): Action-class while an action is transitive
– TA contains cue of prediction: Earlier than early action recognition
– Recognition-like future action prediction: More stable prediction
[Applications] Autonomous driving, active safety and robotics
Δt
【Proposal】 Short-term action prediction recognize “cross” at time t5
【Previous works】 Early action recognition recognize “cross” at time t9
Walk straight (Action)
Cross (Action)
Walk straight – Cross (Transitional action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
Problem settings
Framework Problem Action Recognition
Early Action Recognition
Action Prediction
Transitional Action Recognition
f (F1...tA )→ At
f (F1...t−LA )→ At
f (F1...tA )→ At+L
f (F1...tTA )→ At+L
Difference
Framework Problem Action Recognition
Early Action Recognition
Action Prediction
Transitional Action Recognition
f (F1...tA )→ At
f (F1...t−LA )→ At
f (F1...tA )→ At+L
f (F1...tTA )→ At+L
Walk straight (Action)
Cross (Action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
f (F1...t−LA )→ At
A(cross)The objective action is - Early action recognition is late response
Difference
Framework Problem Action Recognition
Early Action Recognition
Action Prediction
Transitional Action Recognition
f (F1...tA )→ At
f (F1...t−LA )→ At
f (F1...tA )→ At+L
f (F1...tTA )→ At+L
Walk straight (Action)
Cross (Action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
f (F1...tA )→ At+L
A(cross)The objective action is - Action prediction is unstable
Difference
Framework Problem Action Recognition
Early Action Recognition
Action Prediction
Transitional Action Recognition
f (F1...tA )→ At
f (F1...t−LA )→ At
f (F1...tA )→ At+L
f (F1...tTA )→ At+L
Walk straight (Action)
Cross (Action)
Walk straight – Cross (Transitional action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
A(cross)The objective action is - Transitional action recognition is reasonable
f (F1...tTA )→ At+L
Details of transitional action (TA)
• Annotation for TA
– TA and normal action (NA) classes are partially overlapped each other
• Difficulty of TA
– Temporally mixed between NA and TA
Subtle Motion Descriptor (SMD)
• A discriminative temporal CNN feature
– To divide classes between NA and TA
Subtle Motion Descriptor (SMD)
• Activation feature from VGG-16
– Fully-connected layer (N = 4,096)
– Based on pooled time series (PoT) [Ryoo+, CVPR2015]
Subtle Motion Descriptor (SMD)
• Temporal difference ΔVt is calculated
– (Frame t) – (Frame t-1)
Subtle Motion Descriptor (SMD)
• Temporal pooling from ΔV t
– Plus and minus
– Zero-around values are pooled (→This is the contribution of SMD)
– TH is experimentally fixed
Datasets
• Temporal action datasets
– NTSEL [Kataoka+, ITSC2015]
• Walk (NA), cross (NA), bicycle (NA), turn (TA) with human bbox
– UTKinect-Action [Xia+, CVPRW2012]
• Ordered 10 NAs (e.g. walk, throw, sit)
• 8 TAs (excluding push/pull; next page)
• Without human bbox
– Watch-n-Patch [Wu+, CVPR2015]
• Daily 10 NAs (e.g. read, turn on monitor, leave office)
• Top frequent 10 TAs (next page)
• Without human bbox
Experimental settings (list of TAs)
• @UTKinect-Action @Watch-n-Patch
Implements
• Action recognition appraoches
– Temporal CNN models
• Pooled Time-series (PoT) [Ryoo+, CVPR2015]
• CNN accumulation
• CNN + IDT [Jain+, ECCVW2014]
– Improved dense trajectories (IDT) and with improved features
• IDT [Wang+, ICCV2013]
• IDT + cooccurrence-feature [Kataoka+, ACCV2014]
• All Features in IDT
Exploration experiment
• Parameters
– Frame accumulation
– Thresholding value TH
– Layer fc6 vs fc7
Exploration experiment
• Temporal accumulation [frames]
– Faster prediction: 3 [frames] (0.1s)
– Toward state-of-the-art: 10 [frames] (0.33s)
– Baseline should be 3 and 10 frames accumulation
Exploration experiment
• Thresholding value
– Depending on data
Exploration experiment
• Layer fc6 vs fc7
– Layer fc6 is better
Results
• SMD (ours) is state-of-the-art in transitional action recognition
Comparison of PoT
• Subtle motion is effective for transitional action recognition
– NTSEL: +2.18%, +8.63%
– UTKinect: +7.19%, +4.31%
– Watch-n-Patch: +4.82%, +5.12%
Conclulsion
• Two contribusions:
1. Definition of transitional action for short-term action prediction
2. Subtle Motion Descriptor (SMD) to classify transitional and normal actions