【bmvc2016】recognition of transitional action for short-term action prediction using...

Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature

Hirokatsu Kataoka, Ph.D. Computer Vision Research Group (CVRG), AIST

http://www.hirokatsukataoka.net/

Yudai Miyashita (TDU), Masaki Hayashi (Liquid Inc., Keio Univ.)， Kenji Iwata, Yutaka Satoh (AIST)

Related work: Early Action Recognition

•  [Ryoo, ICCV2011]

M. S. Ryoo, “Human Activity Prediction: Early Recognition of Ongoing Activities from Streaming Videos”, International Conference on Computer Vision (ICCV), pp.1036-1043, 2011.

Related work: Action Prediction

•  [Kataoka+, VISAPP2016]

？？？Daytime (Time Zone)

Walking (Previous Activity)

Sitting (Current Activity)

??? (Next Activity)

xtimezone xprevious xcurrent

θ = “Using a PC”

Given Not givenTime series

H. Kataoka, Y. Aoki, K. Iwata, Y. Satoh, “Activity Prediction using a Space-Time CNN and Bayesian Framework”, in VISAPP, 2016.

Problem of related works

•  Early action recognition

–  Action recognition in an early frame of the action

–  Enough cue is required, so almost equals to action recognition

•  Action prediction

–  Complete future prediction in an unstable situation

Proposal

•  Transitional Action (TA): Action-class while an action is transitive

–  TA contains cue of prediction: Earlier than early action recognition

–  Recognition-like future action prediction: More stable prediction

[Applications] Autonomous driving, active safety and robotics

Δt

【Proposal】 Short-term action prediction recognize “cross” at time t5

【Previous works】 Early action recognition recognize “cross” at time t9

Walk straight (Action)

Cross (Action)

Walk straight – Cross (Transitional action)

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12

Problem settings

Framework Problem Action Recognition

Early Action Recognition

Action Prediction

Transitional Action Recognition

f (F1...tA )→ At

f (F1...t−LA )→ At

f (F1...tA )→ At+L

f (F1...tTA )→ At+L

Difference



Action Prediction


f (F1...tA )→ At

f (F1...t−LA )→ At

f (F1...tA )→ At+L



Cross (Action)

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12

f (F1...t−LA )→ At

A(cross)The objective action is -  Early action recognition is late response

Difference



Action Prediction


f (F1...tA )→ At

f (F1...t−LA )→ At

f (F1...tA )→ At+L



Cross (Action)

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12

f (F1...tA )→ At+L

A(cross)The objective action is -  Action prediction is unstable

Difference



Action Prediction


f (F1...tA )→ At

f (F1...t−LA )→ At

f (F1...tA )→ At+L



Cross (Action)

Walk straight – Cross (Transitional action)

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12

A(cross)The objective action is -  Transitional action recognition is reasonable


Details of transitional action (TA)

•  Annotation for TA

–  TA and normal action (NA) classes are partially overlapped each other

•  Difficulty of TA

–  Temporally mixed between NA and TA

Subtle Motion Descriptor (SMD)

•  A discriminative temporal CNN feature

–  To divide classes between NA and TA


•  Activation feature from VGG-16

–  Fully-connected layer (N = 4,096)

–  Based on pooled time series (PoT) [Ryoo+, CVPR2015]


•  Temporal difference ΔVt is calculated

–  (Frame t) – (Frame t-1)


•  Temporal pooling from ΔV t

–  Plus and minus

–  Zero-around values are pooled (→This is the contribution of SMD)

–  TH is experimentally fixed

Datasets

•  Temporal action datasets

–  NTSEL [Kataoka+, ITSC2015]

•  Walk (NA), cross (NA), bicycle (NA), turn (TA) with human bbox

–  UTKinect-Action [Xia+, CVPRW2012]

•  Ordered 10 NAs (e.g. walk, throw, sit)

•  8 TAs (excluding push/pull; next page)

•  Without human bbox

–  Watch-n-Patch [Wu+, CVPR2015]

•  Daily 10 NAs (e.g. read, turn on monitor, leave office)

•  Top frequent 10 TAs (next page)

•  Without human bbox

Experimental settings (list of TAs)

•  @UTKinect-Action @Watch-n-Patch

Implements

•  Action recognition appraoches

–  Temporal CNN models

•  Pooled Time-series (PoT) [Ryoo+, CVPR2015]

•  CNN accumulation

•  CNN + IDT [Jain+, ECCVW2014]

–  Improved dense trajectories (IDT) and with improved features

•  IDT [Wang+, ICCV2013]

•  IDT + cooccurrence-feature [Kataoka+, ACCV2014]

•  All Features in IDT

Exploration experiment

•  Parameters

–  Frame accumulation

–  Thresholding value TH

–  Layer fc6 vs fc7


•  Temporal accumulation [frames]

–  Faster prediction: 3 [frames] (0.1s)

–  Toward state-of-the-art: 10 [frames] (0.33s)

–  Baseline should be 3 and 10 frames accumulation


•  Thresholding value

–  Depending on data


•  Layer fc6 vs fc7

–  Layer fc6 is better

Results

•  SMD (ours) is state-of-the-art in transitional action recognition

Comparison of PoT

•  Subtle motion is effective for transitional action recognition

–  NTSEL: +2.18%, +8.63%

–  UTKinect: +7.19%, +4.31%

–  Watch-n-Patch: +4.82%, +5.12%

Conclulsion

•  Two contribusions:

1.  Definition of transitional action for short-term action prediction

2.  Subtle Motion Descriptor (SMD) to classify transitional and normal actions

【bmvc2016】recognition of transitional action for short-term action prediction using...

Science