predicting the future: a jointly learnt model for action ...our action anticipation method jointly...

Predicting the Future: A Jointly Learnt Model for Action Anticipation

Harshala Gammulle Simon Denman Sridha Sridharan Clinton FookesImage and Video Research Lab, SAIVT, Queensland University of Technology (QUT), Australia

{pranali.gammule, s.denman, s.sridharan, c.fookes}@qut.edu.au

Abstract

Inspired by human neurological structures for action an-ticipation, we present an action anticipation model that en-ables the prediction of plausible future actions by forecast-ing both the visual and temporal future. In contrast to cur-rent state-of-the-art methods which first learn a model topredict future video features and then perform action antic-ipation using these features, the proposed framework jointlylearns to perform the two tasks, future visual and temporalrepresentation synthesis, and early action anticipation. Thejoint learning framework ensures that the predicted futureembeddings are informative to the action anticipation task.Furthermore, through extensive experimental evaluationswe demonstrate the utility of using both visual and temporalsemantics of the scene, and illustrate how this representa-tion synthesis could be achieved through a recurrent Gen-erative Adversarial Network (GAN) framework. Our modeloutperforms the current state-of-the-art methods on multi-ple datasets: UCF101, UCF101-24, UT-Interaction and TVHuman Interaction. 1

1. IntroductionWe propose an action anticipation model that uses visual

and temporal data to predict future behaviour, while alsopredicting a frame-wise future representation to support thelearning. Unlike action recognition where the recognitionis carried out after the event, by observing the full videosequence (Fig. 1(a)), the aim of action anticipation (Fig.1(b)) is to predict the future action as early as possible byobserving only a portion of the action [3]. Therefore, forthe prediction we only have partial information in the formof a small number of frames, so the available informationis scarce. Fig. 1(c) shows the intuition behind our pro-posed model. The action anticipation task is accomplishedby jointly learning to predict the future embeddings (bothvisual and temporal) along with the action anticipation task,where the anticipation task provides cues to help compen-

1This research was supported by an Australian Research Council(ARC) Linkage grant LP140100221

Action?

Observed

(a) Action RecognitionObserved Unobserved

? ??

Action?

(b) Typical Action Anticipation

Observed Unobserved

? ??

Predicted Future Embeddings

Action?

highFive

(c) Proposed Action Anticipation Method

Figure 1. Action anticipation through future embedding predic-tion. Action recognition approaches (a) carry out the recognitiontask via fully observed video sequences while the typical actionanticipation methods (b) are based on predicting the action from asmall portion of the frames. In our proposed model (c) we jointlylearn the future frame embeddings to support the anticipation task.

sate for the missing information from the unobserved framefeatures. We demonstrate that joint learning of the two taskscomplements each other.

This approach is inspired by recent theories of how hu-mans achieve the action predictive ability. Recent psychol-ogy literature has shown that humans build a mental im-age of the future, including future actions and interactions(such as interactions between objects) before initiating mus-cle movements or motor controls [10, 17, 31]. These repre-sentations capture both the visual and temporal informationof the expected future. Mimicking this biological process,

arX

iv:1

912.

0714

8v1

[cs

.CV

] 1

6 D

ec 2

019

our action anticipation method jointly learns to anticipatefuture scene representations while predicting the future ac-tion, and outperforms current state-of-the-art methods.

In contrast to recent works [3, 45, 50] which rely solelyon visual inputs, and inspired by [10, 17, 31], we propose ajoint learning process which attends to salient componentsof both visual and temporal streams, and builds a highly in-formative context descriptor for future action anticipation.In [50] the authors demonstrate that the context semantics,which capture high level action related concepts includingenvironmental details, objects, and historical actions andinteractions, are more important when anticipating actionsthan the actual pixel values of future frames. Furthermore,the semantics captured through pre-trained deep learningmodels show robustness to background and illuminationchanges as they tend to capture the overall meaning of theinput frame rather than simply using pixel values [8, 26].Hence in the proposed architecture we extract deep visualand temporal representations from the inputs streams andpredict the future representations of those streams.

Motivated by recent advances in Generative AdversarialNetworks (GAN) [1, 16, 33] and their ability to automati-cally learn a task specific loss function, we employ a GANlearning framework in our approach as it provides the capa-bility to predict a plausible future action sequence.

Although there exist individual GAN models for antici-pation [32,56], we take a step further in this work. The maincontribution is the joint learning of a context descriptor fortwo tasks, action anticipation and representation prediction,through the joint training of two GANs.

Fig. 2 shows the architecture of our proposed ActionAnticipation GAN (AA-GAN) model. The model receivesthe video frames and optical flow streams as the visualand temporal representations of the scene. We extract asemantic representation of the individual streams by pass-ing them through a pre-trained feature extractor, and fusethem through an attention mechanism. This allows us toprovide a varying level of focus to each stream and effec-tively embed the vital components for different action cat-egories. Through this process low level feature representa-tions are mapped to a high-level context descriptor whichis then used by both the future representation synthesis andclassification procedures. By coupling the GANs (visualand temporal synthesisers) through a common context de-scriptor, we optimally utilise all available information andlearn a descriptor which better describes the given scene.

Our main contributions are as follow:

• We propose a joint learning framework for early ac-tion anticipation and synthesis of the future represen-tations.

• We demonstrate how attention can efficiently deter-mine the salient components from the multi-modal in-

formation, and generate a single context descriptorwhich is informative for both tasks.

• We introduce a novel regularisation method basedon the exponential cosine distance, which effectivelyguides the generator networks in the prediction task.

• We perform evaluations on several challengingdatasets, and through a thorough ablation study,demonstrate the relative importance of each compo-nent of the proposed model.

2. Previous WorkHuman action recognition is an active research area that

has great importance in multiple domains [7, 9, 23]. Sincethe inception of the field researchers have focused on im-proving the applicability of methods to tally with real worldscenarios. The aim of early works was to develop discreteaction recognition methods using image [6, 25] or video in-puts [15,22,46], and these have been extended to detect ac-tions in fine-grained videos [29, 37]. Although these meth-ods have shown impressive performance, they are still lim-ited for real-world applications as they rely on fully com-pleted action sequences. This motivates the development ofaction anticipation methods, which can accurately predictfuture actions utilising a limited number of early frames,and thereby providing the ability to predict actions that arein progress.

In [50], a deep network is proposed to predict a rep-resentation of the future. The predicted representation isused to classify future actions. However [50] requires theprogress level of the ongoing action to be provided duringtesting, limiting applicability [19]. Hu et al. [19] introduceda soft regression framework to predict ongoing actions. Thismethod [19] learns soft labels for regression on the subse-quences containing partial action executions. Lee et al. [30]proposed a human activity representation method, termedsub-volume co-occurrence matrix, and developed a methodto predict partially observed actions with the aid of a pre-trained CNN. The deep network approach of Aliakbarianet al. [3] used a multi-stage LSTM architecture that incor-porates context-aware and action-aware features to predictclasses as early as possible. The CNN based action anticipa-tion model of [40] predicts the most plausible future motion,and was improved via an effective loss function based ondynamic and classification losses. The dynamic loss is ob-tained through a dynamic image generator trained to gener-ate class specific dynamic images. However, performance islimited due to the hand-crafted loss function. A GAN basedmodel can overcome this limitation as it can automaticallylearn a loss function and has shown promising performancein recent research [35, 38, 54].

In our work we utilise a conditional GAN [12,13,36] fordeep future representation generation. A limited number

C

Classifier

Temporal Input (optical flow)

Visual Input

Resnet50

Resnet50

LSTM

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

LSTM

future temporal representation

GV DV

GTP DTP

real/fake

real/fake

future visual representation

(real)

(real)

(fake)

(fake)

generated visual

generated temporal

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

305 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

305 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

5 10 15 20 25 30

5

10

15

20

25

30

!V

!TP

Resnet50

Resnet50

time

time

Anticipated Action

Figure 2. Action Anticipation GAN (AA-GAN): The model receives RGB and optical flow streams as the visual and temporal representa-tions of the given scene. Rather than utilising the raw streams we extract the semantic representation of the individual streams by passingthem through a pre-trained feature extractor. These streams are merged via an attention mechanism which embeds these low-level featurerepresentations in a high-level context descriptor. This context representation is utilised by two GANs: one for future visual representa-tion synthesis and one for future temporal representation synthesis; and the anticipated future action is obtained by utilising the contextdescriptor. Hence context descriptor learning is influenced by both the future representation prediction, and the action anticipation task.

of GAN approaches can be found for human action recog-nition [1, 33]. In [33], a GAN is used to generate masksto detect the actors in input frame and action classifica-tion is done via a CNN. This method is prone to difficul-ties with the loss function as noted previously. Consideringother GAN methods, [32, 56] require human skeletal datawhich is not readily available; [56] only synthesises the fu-ture skeletal representation; and [55] considers the task ofsynthesising future gaze points using a single generator anddiscriminator pair and directly extracting spatio-temporalfeatures from a 3D CNN. In contrast to these, we analysetwo modes and utilise an attention mechanism to embed thesalient components of each mode into a context descriptorwhich can be used for multiple tasks; and we learn this de-scriptor through joint training of two GANs and a classifier.

The authors of [45] have adapted the model of [50] to aGAN setting; using GANs to predict the future visual fea-ture representation. Upon training this representation, theytrain a classifier on the predicted features to anticipate thefuture action class. We argue that the approach of [45] issuboptimal, as there is no guarantee that the future actionrepresentation is well suited to predicting the action due tothe two stage learning. Our approach, which learns the tasksjointly, ensures that a rich multi-modal embedding is learnt

that captures the salient information needed for both tasks.Furthermore, by extending this to a multi-modal setting, wedemonstrate the importance of attending to both visual andtemporal features for the action anticipation task.

3. Action Anticipation ModelOur action anticipation model is designed to predict the

future while classifying future actions. The model aims togenerate embeddings for future frames, to obtain a completenotion of the ongoing action and to understand how best toclassify the action. In Sec. 3.1, we discuss how the con-text descriptor is generated using the visual and temporalinput streams while Sec. 3.2 describes the use of the GANin the descriptor generation process. The future action clas-sification procedure is described in Sec. 3.3 and we furtherimprove this process with the addition of the cosine distancebased regularisation method presented in Sec. 3.4.

3.1. Context Descriptor Formulation

Inputs to our model are two fold: visual and temporal.The visual inputs are the RGB frames and the temporal in-puts are the corresponding optical flow images (computedusing [4]). If the number of input video frames is T, then

both the visual input (IV ) and the temporal input (ITP ) canbe represented as follows,

IV = {IV1 , IV2 , . . . , IVT },ITP = {ITP

1 , ITP2 , . . . , ITP

T }.(1)

These inputs are passed through a pre-trained feature ex-tractor which extracts features θV and θTP frame wise,

θV = {θV1 , θV2 , . . . , θVT },θTP = {θTP

1 , θTP2 , . . . , θTP

T }.(2)

Then θV and θTP are sent through separate LSTM net-works to capture the temporal structure of the input features.The LSTM outputs are defined as,

hVt = LSTM(θVt ), hTPt = LSTM(θTP

t ). (3)

Attention values are generated for each frame such that,

eVt = tanh(aV [hVt ]>), eTPt = tanh(aTP [hVt ]>), (4)

where aV and aTP are multilayer perceptrons trained to-gether with the rest of the network, and are passed througha sigmoid function to get the score values,

αVt = σ([eVt , e

TPt ]), αTP

t = 1− αVt . (5)

Then, an attention weighted output vector is generated,

µVt = αV

t hVt , µ

TPt = αTP

t hVt (6)

Finally these output vectors are concatenated (denotedby [, ]) to generate the context descriptor (Ct),

Ct = [µVt , µ

TPt ]. (7)

Ct encodes the recent history of both inputs, and thus isused to predict future behaviour.

3.2. Visual and Temporal GANs

GAN based models are capable of learning an output thatis difficult to discriminate from real examples. They learn amapping from the input to this realistic output while learn-ing a loss function to train the mapping. The context de-scriptor,Ct, is the input for both GANs (visual and temporalsynthesisers, see Fig. 2). The ground truth future visual andtemporal frames are denoted FV and FTP , and are givenby,

FV = {FV1 , F

V2 , . . . , F

VT },

FTP = {FTP1 , FTP

2 , . . . , FTPT }.

(8)

We extract features for FV and FTP similar to Eq. 2,

βV = {βV1 , β

V2 , . . . , β

VT },

βTP = {βTP1 , βTP

2 , . . . , βTPT }.

(9)

These features, βV and βTP , are utilised during GANtraining. The aim of the generator (GV or GTP ) of eachGAN is to synthesise the future deep feature sequence thatis sufficiently realistic to fool the discriminator (DV orDTP ). It should be noted that the GAN models do not learnto predict the future frames, but the deep features of theframes (visual or temporal). As observed in [50] this allowsthe model to recognise higher-level concepts in the presentand anticipate their relationships with future actions. Thisis learnt through the following loss functions,

LV (GV , DV ) =

T∑t=1

logDV (Ct, βV )+

T∑t=1

log(1−DV (Ct, GV (Ct))),

(10)

LTP (GTP , DTP ) =

T∑t=1

logDTP (Ct, βTP )+

T∑t=1

log(1−DTP (Ct, GTP (Ct))).

(11)

3.3. Classification

The deep future sequences are learnt through the twoGAN models as described in Sec. 3.2. A naive way toperform the future action classification is using the trainedfuture feature predictor and passing the synthesised futurefeatures to the classifier. However, this is sub-optimal asGV and GTP have no knowledge of this task, and thus fea-tures are likely sub-optimal for it. As such, in this workwe investigate joint learning of the embedding predictionand future action anticipation, allowing the model to learnthe salient features that are required for action anticipation.Hence, the GANs are able to support learning salient fea-tures for both processes. We perform future action classi-fication for the action anticipation task through a classifier,the input for which is Ct. Then the classification loss canbe defined as,

LC = −T∑

t=1

yt log fC(Ct). (12)

It is important to note that the context descriptor Ct is in-fluenced by both the classification loss, Lc, and the GANlosses, LV and LTP , as GV and GTP utilise the contextdescriptor to synthesise the future representations.

3.4. Regularisation

To stabilise GAN learning a regularisation method suchas the L2 loss is often used [20]. However the cosine dis-tance has been shown to be more effective when comparingdeep embeddings [51, 52]. Furthermore when generatingfuture sequence forecasts it is more challenging to forecastrepresentations in the distant future than the near future.However, the semantics from the distant future are more in-formative for the action class anticipation problem, as theycarry more information about what the agents are likely todo. Hence we propose a temporal regularisation mechanismwhich compares the predicted embeddings with the groundtruth future embeddings using the cosine distance, and en-courages the model to focus more on generating accurateembeddings for the distant future,

LR =

T∑t=1

−etd(GV (Ct), βVt ) +

T∑t=1

−etd(GTP (Ct), βTPt ),

(13)

where d represents the cosine distance function. Motivatedby [3] we introduce the exponential term, et, encouragingmore accurate prediction of distant future embeddings.

Then, the loss for the final model that learns the con-text descriptor Ct and is reinforced by both deep future se-quence synthesisers (GAN models), and the future actionclassification can be written as,

L = wV LV + wTPLTP + wcLC + wRLR, (14)

where wV , wTP , wc and wR are hyper-parameterswhich control the contribution of the respective losses.

4. Evaluations

4.1. Datasets

Related works on action anticipation or early action pre-diction typically use discrete action datasets. The fourdatasets we use to evaluate our work are outlined below.

UCF101 [49] has been widely used for discrete actionrecognition and recent works for action anticipation due toits size and variety. The dataset includes 101 action classesfrom 13,320 videos with an average length of 7.2 seconds.In order to perform comparison to the state-of-the-art meth-ods, we utilise the provided three training/testing splits andreport the average accuracy over three splits.

UCF101-24 [47] is a subset of the UCF101 dataset. Itis composed of 24 action classes in 3207 videos. In orderto compare action anticipation results to the state-of-the-artwe utilise only the data provided in set1.

UT-Interaction (UTI) [43] is a human interactiondataset, which contains videos of two or more people per-forming interactions such as handshake, punch etc. in a se-quential and/or concurrent manner. The dataset has total of120 videos. For the state-of-the-art comparison we utilisea 10-fold leave-one-out cross validation on each set and themean performance over all sets is obtained, as per [3].

TV Human Interaction (TV-HI) [39] dataset is a col-lection of 300 video clips collected from 20 different TVshows. It is composed of four action classes of people per-forming interactions such as handshake, highfive, hug andkiss, and a fifth action class called ‘none’ which does notcontain any of the four actions. The provided train/ testsplits are utilised with a 25-fold cross validation, as per [50].

4.2. Network Architecture and Training

Considering related literature for different datasets, dif-ferent numbers of observed frames [3, 45] are used. Let Tbe the number of observed frames, then we extract framesT + 1 to T + T as future frames, where T is the number offuture frames for embedding prediction. As the temporal in-put, similar to [46] we use dense optical flow displacementscomputed using [4]. In addition to horizontal and verticalcomponents we also use the mean displacement of the hor-izontal and vertical flow. Both visual and temporal inputsare individually passed through a pre-trained ResNet50 [18]trained on ImageNet [41], and activations from the ‘activa-tion 23’ layer are used as the input feature representation.

The network of the generator is composed of two LSTMlayers followed by a fully connected layer. The generatoris fed only with the context input while the discriminatoris fed with both the context and the real/fake feature repre-sentations. The two inputs of the discriminator are passedthrough separate LSTM layers and then the merged outputis passed through two fully connected layers. The classi-fier is composed of a single LSTM layer followed by a sin-gle fully connected layer. For clarity, we provide modeldiagrams in the supplementary materials. For all LSTMs,300 hidden units are used. For the model training proce-dure we follow the approach of [20], alternating betweenone gradient decent pass for the discriminators, and the gen-erators and the classifier using 32 samples per mini batch.The Adam optimiser [24] is used with a learning rate of0.0002 and a decay of 8 × 10−9, and is trained for 40epochs. Hyper-parameters wV , wTP , wc, wR are evaluatedexperimentally and set to 25, 20, 43 and 15, respectively.Please refer to supplementary material for these evaluations.When training the proposed model for the UTI and TV-HIdatasets, due to the limited availability of training exampleswe first train the model on UCF101 training data and fine-tuned it on the training data from the specific datasets.

For the implementation of our proposed method weutilised Keras [5] with Theano [2] as the backend.

4.3. Performance Evaluation

4.3.1 Evaluation Protocol

To evaluate our model on each dataset, where possiblewe consider two settings for the number of input frames,namely the ‘Earliest’ and ‘Latest’ settings. For UCF101 andUTI, similar to [3] we consider 20% and 50% of the framesfor the ‘Earliest’ and ‘Latest’ settings, respectively; follow-ing [3] we do not use more than 50 frames for the ‘Latest’setting. For each dataset and setting, we resample the in-put videos such that all sequences have a constant numberof frames. Due to unavailability of baseline results and fol-lowing [45], for UCF101-24 we report evaluate using 50%of the frames from each video and for the TV-HI dataset, asin [14, 27], we consider only 1 seconds worth frames.

4.3.2 Comparison to the state-of-the-art methods

Evaluations for UCF101, UCF101-24, UTI and TV-HIdatasets are presented in Tables 1, to 4 respectively. Consid-ering the results, the authors of Multi stage LSTM [3] andRED [14] have introduced a new hand engineered loss thatencourages the early prediction of the action class. The au-thors of RBF-RNN [45] use a GAN learning process wherethe loss function is also automatically learnt. Similar tothe proposed architecture, the RBF-RNN [45] model alsoutilises the spatial representation of the scene through aDeep CNN model and tries to predict the future scene rep-resentations. However in contrast to the proposed architec-ture this method does not utilise temporal features, or jointlearning. We learn a context descriptor which effectivelycombines both spatial and temporal representations whichnot only aids the action classification but also anticipatesthe future representations more accurately. This led us toobtain superior results. In Tab. 2, the results for UCF101-24shows that our model is able to outperform RBF-RNN [45]by 0.9% while in Tab. 3 we outperform [45] on the UTIdataset by 1.3% at the earliest setting.

When comparing the performance gap between the earli-est and latest settings, our model has a smaller performancedrop compared to the baseline models. The gap for UCF101on our model is 1.4% while the gap for the Multi stageLSTM model [3] is 2.9%. GV and GTP synthesise thefuture representation of both visual and temporal streamswhile considering the current context. As such, the pro-posed model is able to better anticipate future actions, evenwith fewer frames. Our evaluations on multiple benchmarksfurther illustrate the generalisability of the proposed archi-tecture, with varying video lengths and dataset sizes.

4.4. Ablation Experiments

To further demonstrate the proposed AA-GAN method,we conducted an ablation study by strategically removing

Method Earliest LatestContext-aware + loss in [21] 30.6 71.1Context-aware + loss in [34] 22.6 73.1

Multi stage LSTM [3] 80.5 83.4Proposed 84.2 85.6

Table 1. Action anticipation results for UCF101 considering the‘Earliest’ 20% of frames and ‘Latest’ 50% of frames.

Method AccuracyTemporal Fusion [11] 86.0

ROAD [47] 92.0ROAD + BroxFlow [47] 90.0

RBF-RNN [45] 98.0Proposed 98.9

Table 2. Action anticipation results for UCF101-24 considering50% of frames from each video.

Method Earliest LatestS-SVM [48] 11.0 13.4

DP-SVM [48] 13.0 14.6CuboidBayes [42] 25.0 71.7CuboidSVM [44] 31.7 85.0

Context-aware+ loss in [21] 45.0 65.0Context-aware + loss in [34] 48.0 60.0

I-BoW [42] 65.0 81.7BP-SVM [28] 65.0 83.3D-BoW [42] 70.0 85.0

multi-stageLSTM [3] 84.0 90.0Future-dynamic [40] 89.2 91.9

RBF-RNN [45] 97.0 NAProposed 98.3 99.2

Table 3. Action anticipation results for UTI ‘Earliest’ 20% offrames and ‘Latest’ 50% of frames.

Method AccuracyVondrick et. al [50] 43.6

RED [14] 50.2Proposed 55.7

Table 4. Action anticipation results for TV Human Interactiondataset considering 1 second worth of frames from each video.

components of the proposed system. We evaluated sevennon-GAN based model variants and ten GAN-based vari-ants of the proposed AA-GAN model. Non-GAN basedmodels are further broken into two categories: models withand without future representation generators. Similarly, theGAN based models fall into two categories: those that doand do not learn tasks jointly Diagrams of these ablationmodels are available in the supplementary materials.

Non-GAN based models: These models do not utiliseany future representation generators, and are only trainedthrough classification loss.

(a) ηC,V : A model trained to classify using the context

feature extracted only from the visual input stream (V).

(b) ηC,TP : As per model (a), but using the temporal inputstream (TP).

(c) ηC,(V+TP ): As per (a), but using both data streams tocreate the context embedding.

Non-GAN based models with future representationgenerators: Here, we add future embedding generatorsto the previous set of models. The generators are trainedthrough mean squared error (i.e. no discriminator and noadversarial loss) while the classification is learnt throughcategorical cross entropy loss. The purpose of these modelsis to show how the joint learning can improve performance,and how a common embedding can serve both tasks.

(d) ηC,V + GV : Model with the future visual representa-tion generator (GV ) and fed only with the visual inputstream to train the classifier

(e) ηC,TP +GTP : As per (d), but receiving and predictingthe temporal input stream.

(f) ηC,(V+TP ) + GV + GTP : The model is composedof both generators, GV and GTP , and fed with bothvisual and temporal input streams.

(g) ηC,(V+TP ) + GV + GTP + Att: As per (f) but withthe use of attention to combine the streams.

GAN based models without joint training: Thesemethods are based on the GAN framework that generatesfuture representations and a classifier that anticipates theaction where these two tasks are learnt separately. We firsttrain the GAN model using the adversarial loss and oncethis model is trained, using the generated future embeddingsthe classifier anticipates the action.

(h) ηC,V +GANV \Joint: Use the GAN learning frame-work with only the visual input stream and cosine dis-tance based regularisation is used.

(i) ηC,TP +GANTP \Joint: As per (h), but with the tem-poral input stream

(j) AA-GAN \Joint Use the GAN learning frameworkwith both the visual and temporal input streams.

GAN based models with joint training: These modelstrain the deep future representation generators adversarially.The stated model variants are introduced by removing thedifferent components from the proposed model.

(k) ηC,V + GANV \(LR): The proposed approach withonly the visual input stream and without cosine dis-tance based regularisation.

(l) ηC,TP +GANTP \(LR): The proposed approach withonly the temporal input stream and without cosine dis-tance based regularisation.

(m) ηC,V +GANV : The proposed approach with only thevisual input stream. Cosine distance based regularisa-tion is used.

(n) ηC,TP + GANTP : The proposed approach with onlythe temporal input stream. Cosine distance based reg-ularisation is used.

(o) AA-GAN \(LR) : Proposed model without cosine dis-tance based regularisation.

(p) AA-GAN \(DR) : Similar to the proposed model,however GV and GTP predict pixel values for futurevisual and temporal frames instead of representationsextracted from the pre-trained feature extractor.

Method Accuracy(a) ηC,V 45.1

(b) ηC,TP 39.8(c) ηC,(V+TP ) 52.0(d) ηC,V +GV 54.7

(e) ηC,TP +GTP 52.4(f) ηC,(V+TP ) +GV +GTP 68.1

(g) ηC,(V+TP ) +GV +GTP + Att 68.8(h) ηC,V +GANV \Joint 98.1

(i) ηC,TP +GANTP \Joint 97.9(j) AA-GAN \Joint 98.3

(k) ηC,V +GANV \(LR) 96.0(l) ηC,TP +GANTP \(LR) 95.4

(m) ηC,V +GANV 98.4(n) ηC,TP +GANTP 98.1(o) AA-GAN \(LR) 98.7(p) AA-GAN \(DR) 95.9AA-GAN (proposed) 98.9

Table 5. Ablation results for UCF101-24 dataset for the ‘Latest’setting, which uses 50% of the frames from each video.

The evaluation results of the ablation models on theUCF101-24 test set are presented in Tab. 5.

Non-GAN based models (a to g): Model performanceclearly improves when using both data streams togetherover either one individually (see (c) vs (a) and (b); and (f)vs (d) and (e)). Hence, it is clear that both streams pro-vide different information cues to facilitate the prediction.Comparing the results of models that do not utilise the fu-ture representation generators to (d), we see that overseeingfuture representation does improve the results.

GAN based models without joint training (h to j):Comparing the non-GAN based methods with ablation

8 6 4 2 0 2 4 6 86

4

2

0

2

4

6

8

10

[1]

[4]

[2]

[1][1][3]

[4]

[4]

[4]

[3]

[3]

[4]

[4]

[0][4]

[4]

[1][2]

[2]

[4]

[2]

[1]

[4]

[4][2][0]

(a) AA-GAN

[1]

[4]

[2]

[1][1][3]

[4]

[4]

[4]

[3]

[3]

[4]

[4]

[0][4]

[4]

[1][2]

[2]

[4]

[2]

[1]

[4]

[4][2][0]

8 6 4 2 0 2 4 6 86

4

2

0

2

4

6

8

10

(b) Ablation model (g)(see Section 4.4)Figure 3. Projections of the discriminator hidden states for the forthe AA-GAN (a) and ablation model (g) in (b) before (in blue) andafter (in red) training. Ground truth action classes are in brackets.Insert indicates sample frames from the respective videos.

model (h), we see that a major performance boost isachieved through the GAN learning process, denoting theimportance of the automated loss function learning. Com-paring the performance of visual and temporal streams, weobserve that the visual stream is dominant, however com-bining both streams through the proposed attention mecha-nism captures complimentary information.

GAN based models with joint training (k to p): Com-paring models (h) and (i), which are single modal mod-els that do not use joint training, with models (m) and (n)which do, we can see the clear benefit offered by learningthe two complementary tasks together. This contradicts theobservation reported in [45], who use a classifier which wasconnected to the predicted future embeddings. We spec-ulate that by learning a compressed context representationfor both tasks we effectively propagate the effect of the ac-tion anticipation error through the encoding mechanisms,allowing this representation to be informative for both tasks.Finally, by coupling the GAN loss together with LR, wherethe cosine distance based regularisation is combined withthe exponential loss to encourage accurate long-term pre-dictions, we achieve state-of-the-art results. Furthermore

we compare the proposed AA-GAN model, where GV andGTP synthesise future visual and temporal representations,against ablation model (p) where GV and GTP synthesisepixel values for future frames. It is evident that the lattermodel fails to capture the semantic relationships betweenthe low-level pixel features and the action class, leading tothe derived context descriptor being less informative for ac-tion classification, reducing performance.

To demonstrate the discriminative nature of the learntcontext embeddings, Fig. 3 (a) visualises the embeddingspace before (in blue) and after (in red) training of the pro-posed context descriptor for 30 randomly selected examplesof the TV-HI test set. We extracted the learned context de-scriptor, Ct, and applied PCA [53] to generate 2D vectors.Ground truth action classes are indicated in brackets.

This clearly shows that the proposed context descrip-tor learns embeddings which are informative for both fu-ture representation generation and the segregation of actionclasses. From the inserts which show sample frames fromthe videos, visual similarities exist between the classes,hence the overlap in the embedding space before training.However after learning, the context descriptor has been ableto maximise the interclass distance while minimising thedistance within the class. Fig. 3 (b) shows an equivalentplot for the ablation model (g). Given the cluttered nature ofthe embeddings before and after learning, it is clear that theproposed GAN learning process makes a significant contri-bution to learning discriminative embeddings 2

4.5. Time Complexity

We evaluate the computational demands of the proposedAA-GAN model for the UTI dataset’s ‘Earliest’ setting.The model contains 43M trainable parameters, and gener-ates 500 predictions (including future visual and temporalpredictions and the action anticipation) in 1.64 seconds us-ing a single core of an Intel E5-2680 2.50 GHz CPU.

5. ConclusionIn this paper we propose a framework which jointly

learns to anticipate an action while also synthesising futurescene embeddings. We learn a context descriptor which fa-cilitates both of these tasks by systematically attending toindividual input streams and effectively extracts the salientfeatures. This method exhibits traits analogous to humanneurological behaviour in synthesising the future, and ren-ders an end to end learning platform. Additionally, weintroduced a cosine distance based regularisation methodto guide the generators in the synthesis task. Our evalua-tions demonstrate the superior performance of the proposedmethod on multiple public benchmarks.

2Additional qualitative evaluations showing generated future visual andtemporal representations are in the supplementary material.

References[1] Unaiza Ahsan, Chen Sun, and Irfan Essa. Discrimnet: Semi-

supervised action recognition from videos using genera-tive adversarial networks. arXiv preprint arXiv:1801.07230,2018.

[2] Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, ChristofAngermueller, Dzmitry Bahdanau, Nicolas Ballas, FredericBastien, Justin Bayer, Anatoly Belikov, Alexander Be-lopolsky, et al. Theano: A python framework for fastcomputation of mathematical expressions. arXiv preprintarXiv:1605.02688, 472:473, 2016.

[3] Mohammad Sadegh Aliakbarian, F Sadat Saleh, MathieuSalzmann, Basura Fernando, Lars Petersson, and Lars An-dersson. Encouraging lstms to anticipate actions veryearly. In IEEE International Conference on Computer Vision(ICCV), volume 1, 2017.

[4] Thomas Brox, Andres Bruhn, Nils Papenberg, and JoachimWeickert. High accuracy optical flow estimation based ona theory for warping. In European conference on computervision, pages 25–36. Springer, 2004.

[5] Francois Chollet et al. Keras. https://keras.io, 2015.[6] Vincent Delaitre, Ivan Laptev, and Josef Sivic. Recogniz-

ing human actions in still images: a study of bag-of-featuresand part-based representations. In British Machine Vision(BMVC) Conference, 2010. updated version, available athttp://www.di.ens.fr/willow/research/stillactions/.

[7] Alan Dix. Human-computer interaction. In Encyclopedia ofdatabase systems, pages 1327–1331. Springer, 2009.

[8] Jeff Donahue, Y Jia, O Vinyals, J Hoffman, N Zhang, ETzeng, and T Darrell. A deep convolutional activation fea-ture for generic visual recognition. arxiv preprint. arXivpreprint arXiv:1310.1531, 2013.

[9] Ahmet Ekin, A Murat Tekalp, and Rajiv Mehrotra. Auto-matic soccer video analysis and summarization. IEEE Trans-actions on Image processing, 12(7):796–807, 2003.

[10] Birgit Elsner and Bernhard Hommel. Effect anticipation andaction control. Journal of experimental psychology: humanperception and performance, 27(1):229, 2001.

[11] Zhaoxuan Fan, Tianwei Lin, Xu Zhao, Wanli Jiang, Tao Xu,and Ming Yang. An online approach for gesture recognitiontoward real-world applications. In International Conferenceon Image and Graphics, pages 262–272. Springer, 2017.

[12] Harshala Gammulle, Simon Denman, Sridha Sridharan, andClinton Fookes. Multi-Level Sequence GAN for Group Ac-tivity Recognition. In Asian Conference on Computer Vision,pages 331–346. Springer, 2018.

[13] Harshala Gammulle, Tharindu Fernando, Simon Denman,Sridha Sridharan, and Clinton Fookes. Coupled generativeadversarial network for continuous fine-grained action seg-mentation. In 2019 IEEE Winter Conference on Applicationsof Computer Vision (WACV), pages 200–209. IEEE, 2019.

[14] Jiyang Gao, Zhenheng Yang, and Ram Nevatia. Red: Re-inforced encoder-decoder networks for action anticipation.arXiv preprint arXiv:1707.04818, 2017.

[15] Georgia Gkioxari and Jitendra Malik. Finding action tubes.In The IEEE Conference on Computer Vision and Patternrecognition (CVPR), 2015.

[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In Advancesin neural information processing systems, pages 2672–2680,2014.

[17] Peter F Greve. The role of prediction in mental processing:A process approach. New Ideas in Psychology, 39:45–52,2015.

[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 770–778, 2016.

[19] Jian-Fang Hu, Wei-Shi Zheng, Lianyang Ma, Gang Wang,Jian-Huang Lai, and Jianguo Zhang. Early action predictionby soft regression. IEEE transactions on pattern analysisand machine intelligence, 2018.

[20] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A.Efros. Image-to-image translation with conditional adversar-ial networks. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), July 2017.

[21] Ashesh Jain, Avi Singh, Hema S Koppula, Shane Soh, andAshutosh Saxena. Recurrent neural networks for driveractivity anticipation via sensory-fusion architecture. InRobotics and Automation (ICRA), 2016 IEEE InternationalConference on, pages 3118–3125. IEEE, 2016.

[22] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolu-tional neural networks for human action recognition. IEEEtransactions on pattern analysis and machine intelligence,35(1):221–231, 2013.

[23] Christoph G Keller and Dariu M Gavrila. Will the pedestriancross? a study on pedestrian path prediction. IEEE Transac-tions on Intelligent Transportation Systems, 15(2):494–506,2014.

[24] Diederik Kingma and Jimmy Ba. Adam: A method forstochastic optimization. International Conference on Learn-ing Representations (ICLR), 2015.

[25] ByoungChul Ko, JuneHyeok Hong, and Jae-Yeal Nam. Hu-man action recognition in still images using action poseletsand a two-layer classification model. J. Vis. Lang. Comput.,28(C):163–175, June 2015.

[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In Advances in neural information processing sys-tems, pages 1097–1105, 2012.

[27] Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. A hi-erarchical representation for future action prediction. InEuropean Conference on Computer Vision, pages 689–704.Springer, 2014.

[28] Kennard Laviers, Gita Sukthankar, David W Aha, MatthewMolineaux, C Darken, et al. Improving offensive perfor-mance through opponent modeling. In AIIDE, 2009.

[29] Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, andGregory D Hager. Temporal convolutional networks for ac-tion segmentation and detection. In 2017 IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2017.

[30] Dong-Gyu Lee and Seong-Whan Lee. Prediction of partiallyobserved human activity based on pre-trained deep represen-tation. Pattern Recognition, 85:198–206, 2019.

https://keras.io

[31] Moritz Lehne and Stefan Koelsch. Toward a general psycho-logical model of tension and suspense. Frontiers in Psychol-ogy, 6:79, 2015.

[32] Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. Con-volutional sequence to sequence model for human dynamics.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 5226–5234, 2018.

[33] Xinyu Li, Yanyi Zhang, Jianyu Zhang, Yueyang Chen,Huangcan Li, Ivan Marsic, and Randall S Burd. Region-based activity recognition using conditional gan. In Pro-ceedings of the 2017 ACM on Multimedia Conference, pages1059–1067. ACM, 2017.

[34] Shugao Ma, Leonid Sigal, and Stan Sclaroff. Learning activ-ity progression in lstms for activity detection and early detec-tion. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 1942–1950, 2016.

[35] Michael Mathieu, Camille Couprie, and Yann LeCun. Deepmulti-scale video prediction beyond mean square error. In-ternational Conference on Learning Representations (ICLR),2016.

[36] Mehdi Mirza and Simon Osindero. Conditional generativeadversarial nets. arXiv preprint arXiv:1411.1784, 2014.

[37] Bingbing Ni, Xiaokang Yang, and Shenghua Gao. Progres-sively parsing interactional objects for fine grained action de-tection. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 1020–1028, 2016.

[38] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, TrevorDarrell, and Alexei A Efros. Context encoders: Featurelearning by inpainting. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages2536–2544, 2016.

[39] Alonso Patron-Perez, Marcin Marszalek, Ian Reid, and An-drew Zisserman. Structured learning of human interactionsin tv shows. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 34(12):2441–2453, 2012.

[40] Cristian Rodriguez, Basura Fernando, and Hongdong Li. Ac-tion anticipation by predicting future dynamic images. InECCV’18 workshop on Anticipating Human Behavior, 2018.

[41] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, Alexander C. Berg, andLi Fei-Fei. ImageNet Large Scale Visual Recognition Chal-lenge. International Journal of Computer Vision (IJCV),115(3):211–252, 2015.

[42] Michael S. Ryoo. Human activity prediction: Early recog-nition of ongoing activities from streaming videos. In Com-puter Vision (ICCV), 2011 IEEE International Conferenceon, pages 1036–1043. IEEE, 2011.

[43] Michael S. Ryoo and J. K. Aggarwal. UT-InteractionDataset, ICPR contest on Semantic Description of HumanActivities (SDHA), 2010.

[44] Michael S. Ryoo, Chia-Chih Chen, JK Aggarwal, and AmitRoy-Chowdhury. An overview of contest on semantic de-scription of human activities (sdha) 2010. In RecognizingPatterns in Signals, Speech, Images and Videos, pages 270–285. Springer, 2010.

[45] Yuge Shi, Basura Fernando, and Richard Hartley. Actionanticipation with rbf kernelized feature mapping rnn. In

European Conference on Computer Vision, pages 305–322.Springer, 2018.

[46] Karen Simonyan and Andrew Zisserman. Two-stream con-volutional networks for action recognition in videos. CoRR,abs/1406.2199, 2014.

[47] Gurkirt Singh, Suman Saha, Michael Sapienza, Philip HSTorr, and Fabio Cuzzolin. Online real-time multiple spa-tiotemporal action localisation and prediction. In ICCV,pages 3657–3666, 2017.

[48] Khurram Soomro, Haroon Idrees, and Mubarak Shah. On-line localization and prediction of actions and interactions.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 2018.

[49] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.Ucf101: A dataset of 101 human actions classes from videosin the wild. arXiv preprint arXiv:1212.0402, 2012.

[50] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. An-ticipating visual representations from unlabeled video. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 98–106, 2016.

[51] Jiabao Wang, Yang Li, Zhuang Miao, Yulong Xu, and GangTao. Learning deep discriminative features based on cosineloss function. Electronics Letters, 53(14):918–920, 2017.

[52] Nicolai Wojke and Alex Bewley. Deep cosine metric learn-ing for person re-identification. In 2018 IEEE Winter Con-ference on Applications of Computer Vision (WACV), pages748–756. IEEE, 2018.

[53] Svante Wold, Kim Esbensen, and Paul Geladi. Principalcomponent analysis. Chemometrics and intelligent labora-tory systems, 2(1-3):37–52, 1987.

[54] Donggeun Yoo, Namil Kim, Sunggyun Park, Anthony SPaek, and In So Kweon. Pixel-level domain transfer. InEuropean Conference on Computer Vision, pages 517–532.Springer, 2016.

[55] Kuo-Hao Zeng, William B Shen, De-An Huang, Min Sun,and Juan Carlos Niebles. Visual forecasting by imitating dy-namics in natural sequences. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 2999–3008, 2017.

[56] Mengmi Zhang, Keng Teck Ma, Joo Hwee Lim, Qi Zhao,and Jiashi Feng. Deep future gaze: Gaze anticipation onegocentric videos using adversarial networks. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 4372–4381, 2017.

predicting the future: a jointly learnt model for action ...our action anticipation method jointly...

Documents