arxiv:1808.07507v1 [cs.cv] 22 aug 2018 · 2018-08-24 · irfan essa georgia institute of technology...

11
Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition Unaiza Ahsan Georgia Institute of Technology [email protected] Rishi Madhok Carnegie Mellon University [email protected] Irfan Essa Georgia Institute of Technology [email protected] Abstract We propose a self-supervised learning method to jointly reason about spatial and temporal context for video recog- nition. Recent self-supervised approaches have used spa- tial context [9, 34] as well as temporal coherency [32] but a combination of the two requires extensive prepro- cessing such as tracking objects through millions of video frames [59] or computing optical flow to determine frame regions with high motion [30]. We propose to combine spa- tial and temporal context in one self-supervised framework without any heavy preprocessing. We divide multiple video frames into grids of patches and train a network to solve jigsaw puzzles on these patches from multiple frames. So the network is trained to correctly identify the position of a patch within a video frame as well as the position of a patch over time. We also propose a novel permutation strategy that outperforms random permutations while sig- nificantly reducing computational and memory constraints. We use our trained network for transfer learning tasks such as video activity recognition and demonstrate the strength of our approach on two benchmark video action recognition datasets without using a single frame from these datasets for unsupervised pretraining of our proposed video jigsaw network. 1. Introduction Unsupervised representation learning of visual data is a much needed line of research as it does not require manu- ally labeled large scale datasets. Especially for classifica- tion tasks in video, where the annotation process is tedious and sometimes hard to agree upon (where an action begins and ends for example) [46]. One proposed solution to this problem is self-supervised learning where auxiliary tasks are designed to exploit the inherent structure of unlabeled datasets and a network is trained to solve those tasks. Self- supervised tasks that exploit spatial context include predict- ing the location of one patch relative to another [9], solving a jigsaw puzzle of image patches [34], predicting an im- age’s color channels from grayscale [61, 28] among others. Self-supervision tasks on video data include video frame tuple order verification [32], sorting video frames [30] and tracking objects over time and training a Siamese network for similarity based learning [58]. Video data involves not just spatial context but also rich temporal structure in an im- age sequence. Attempts to combine the two have resulted in multi-task learning approaches [10] that result in some improvement over a single network. This work proposes a self-supervised task that jointly exploits spatial and tempo- ral context in videos by dividing multiple video frames into patches and shuffling them into a jigsaw puzzle problem. The network is trained to solve this puzzle that involves rea- soning over space and time. There are several studies that empirically validate that the earliest visual cues captured by infants’ brains are sur- face motions of objects [51]. These then go on and develop into perception involving local appearance and texture of objects. [51]. Studies have also pointed out that objects’ motion and their temporal transformation are important for the human visual system to learn the structure of objects [15, 60]. Motivated by these studies, there is recent work on unsupervised video representation learning via tracking objects through videos and training a Siamese network to learn a similarity metric on these object patches [58]. How- ever, the prerequisite of this approach is to track millions of objects through videos and extract the relevant patches. Keeping this in mind, we propose to learn such a structure of objects and their transformations over time by designing a self-supervised task which solves jigsaw puzzles compris- ing multiple video frame patches, without needing to explic- arXiv:1808.07507v1 [cs.CV] 22 Aug 2018

Upload: others

Post on 11-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1808.07507v1 [cs.CV] 22 Aug 2018 · 2018-08-24 · Irfan Essa Georgia Institute of Technology irfan@gatech.edu Abstract We propose a self-supervised learning method to jointly

Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for VideoAction Recognition

Unaiza AhsanGeorgia Institute of Technology

[email protected]

Rishi MadhokCarnegie Mellon [email protected]

Irfan EssaGeorgia Institute of Technology

[email protected]

Abstract

We propose a self-supervised learning method to jointlyreason about spatial and temporal context for video recog-nition. Recent self-supervised approaches have used spa-tial context [9, 34] as well as temporal coherency [32]but a combination of the two requires extensive prepro-cessing such as tracking objects through millions of videoframes [59] or computing optical flow to determine frameregions with high motion [30]. We propose to combine spa-tial and temporal context in one self-supervised frameworkwithout any heavy preprocessing. We divide multiple videoframes into grids of patches and train a network to solvejigsaw puzzles on these patches from multiple frames. Sothe network is trained to correctly identify the position ofa patch within a video frame as well as the position ofa patch over time. We also propose a novel permutationstrategy that outperforms random permutations while sig-nificantly reducing computational and memory constraints.We use our trained network for transfer learning tasks suchas video activity recognition and demonstrate the strengthof our approach on two benchmark video action recognitiondatasets without using a single frame from these datasetsfor unsupervised pretraining of our proposed video jigsawnetwork.

1. Introduction

Unsupervised representation learning of visual data is amuch needed line of research as it does not require manu-ally labeled large scale datasets. Especially for classifica-tion tasks in video, where the annotation process is tediousand sometimes hard to agree upon (where an action beginsand ends for example) [46]. One proposed solution to thisproblem is self-supervised learning where auxiliary tasks

are designed to exploit the inherent structure of unlabeleddatasets and a network is trained to solve those tasks. Self-supervised tasks that exploit spatial context include predict-ing the location of one patch relative to another [9], solvinga jigsaw puzzle of image patches [34], predicting an im-age’s color channels from grayscale [61, 28] among others.Self-supervision tasks on video data include video frametuple order verification [32], sorting video frames [30] andtracking objects over time and training a Siamese networkfor similarity based learning [58]. Video data involves notjust spatial context but also rich temporal structure in an im-age sequence. Attempts to combine the two have resultedin multi-task learning approaches [10] that result in someimprovement over a single network. This work proposes aself-supervised task that jointly exploits spatial and tempo-ral context in videos by dividing multiple video frames intopatches and shuffling them into a jigsaw puzzle problem.The network is trained to solve this puzzle that involves rea-soning over space and time.

There are several studies that empirically validate thatthe earliest visual cues captured by infants’ brains are sur-face motions of objects [51]. These then go on and developinto perception involving local appearance and texture ofobjects. [51]. Studies have also pointed out that objects’motion and their temporal transformation are important forthe human visual system to learn the structure of objects[15, 60]. Motivated by these studies, there is recent workon unsupervised video representation learning via trackingobjects through videos and training a Siamese network tolearn a similarity metric on these object patches [58]. How-ever, the prerequisite of this approach is to track millionsof objects through videos and extract the relevant patches.Keeping this in mind, we propose to learn such a structureof objects and their transformations over time by designinga self-supervised task which solves jigsaw puzzles compris-ing multiple video frame patches, without needing to explic-

arX

iv:1

808.

0750

7v1

[cs

.CV

] 2

2 A

ug 2

018

Page 2: arXiv:1808.07507v1 [cs.CV] 22 Aug 2018 · 2018-08-24 · Irfan Essa Georgia Institute of Technology irfan@gatech.edu Abstract We propose a self-supervised learning method to jointly

Figure 1: Video Jigsaw Task: The first row shows a tuple offrames of action “high jump”. Second row shows how we divide

each frame into a 2x2 grid of patches. The third row shows arandom permutation of the 12 patches which are input to thenetwork. The final row shows the jigsaw puzzle assembled

itly track objects over time. Our proposed method, trainedon a large scale video activity dataset also does not requireoptical flow based patch mining and we show empiricallythat a large unlabeled video dataset with a simple permu-tation sampling approach are enough to learn an effectiveunsupervised representation. Figure 1 shows our proposedapproach, which we call video jigsaw. Our contributions inthis paper are:

1. We propose a novel self-supervised task which dividesmultiple video frames into patches, creates jigsaw puz-zles out of these patches and the network is trained tosolve this task.

2. Our work exploits both spatial and temporal context inone joint framework without requiring explicit objecttracking in videos or optical flow based patch miningfrom video frames.

3. We propose a permutation strategy that constrains thesampled permutations and outperforms random per-mutations while being memory efficient.

4. We show via extensive experimental evaluation thefeasibility and effectiveness of our approach on videoaction recognition.

5. We demonstrate the domain transfer capability of ourproposed video jigsaw networks, given that our bestself-supervised model is trained on Kinetics [24] videoframes and we demonstrate competitive results onUCF101 [50] and HMDB51[27] datasets.

2. Related WorkUnsupervised representation learning is a well studied

problem in the literature for both images and videos. Thegoal is to learn a representation that is simpler in some way:it can be low-dimensional, sparse, and/or independent [17].One way to learn such a representation is to use a recon-struction objective. Autoencoders [20] are neural networksdesigned to reconstruct the input and produce it as its out-put. Denoising autoencoders [56] train a network to undorandom corruption of the input data. Other methods that usereconstruction to estimate the latent variables that can ex-plain the observed data include Deep Boltzmann Machines[45], stacked autoencoders [29, 5] and Restricted Boltz-mann Machines (RBMs) [21, 49]. Classical work (beforedeep learning) involved hand-designing features and featureaggregation for application such as object discovery in largedatasets [48, 44] and mid-level feature mining [8, 47, 54].

Unsupervised learning from videos include many learn-ing variants such as video frame prediction [60, 33, 52, 55,18] but we argue that predicting pixels is a much hardertask, especially if the end task is to learn high level motionand appearance changes in frames for activity recognition.Other unsupervised representation learning approaches in-clude exemplar CNNs [11], CliqueCNNs [4] and unsuper-vised similarity learning by clustering [3].

Unsupervised representations are generally learned tomake another learning task (of interest) easier [17]. Thisforms the basis of another line of work that has emerged,called ‘self-supervised learning’ [9, 32, 61, 28, 58, 10, 59,35]. Self-supervised learning aims to find structure in theunlabeled data by designing auxiliary tasks and pseudo la-bels to learn features that can explain the factors of vari-ation in the data. These features can then be useful forthe target task; in our case, video action recognition. Self-supervised learning can exploit several cues, some of whichare spatial context and temporal coherency. Other self-supervised learning tasks on videos use cues like ego-motion [63, 22, 1] as a supervisory signal and other modal-ities beyond raw pixels such as audio [37, 36] and robotmotion [2, 40, 41, 42]. We briefly cover relevant literaturefrom the spatial, temporal and combined contextual cues forself-supervised learning.

Spatial Context: These methods typically samplepatches from images or videos. Supervised tasks aredesigned around the arrangement of these patches andpseudo labels constructed. Doersch et al. [9] divide animage into a 3x3 grid, sample two patches from an imageand train a network to predict the location of the secondpatch relative to the first. This prediction task requiresno labels but learns an effective image representation.Noroozi and Favaro [34] also divide an image into a 3x3grid but they input all patches in a Siamese-like network

Page 3: arXiv:1808.07507v1 [cs.CV] 22 Aug 2018 · 2018-08-24 · Irfan Essa Georgia Institute of Technology irfan@gatech.edu Abstract We propose a self-supervised learning method to jointly

where the patches are shuffled and the task is to solvethis jigsaw puzzle task. They report that with just 100permutations, their network is able to learn a representationsuch that when finetuned on PASCAL VOC 2007 [13]for object detection and classification, it produces goodresults. Pathak et al. [39] devise an inpainting auxiliarytask where blocks of pixels from an image are removed andthe task is to predict the missing pixels. A related task isthe image colorization one [61, 28] where the network istrained to predict the color of the image which is availableas a ‘free signal’ with images. Zhang et al. [62] modifythe autoencoder architecture to predict raw data channelsas their self-supervised task and use the learnt features forsupervised tasks.

Temporal Coherency: These methods use temporal co-herency as a supervisory signal to train models and useabundant unlabeled video data instead of just images. Wangand Gupta [58] use detection and tracking methods to ex-tract object patches from videos and train a Siamese net-work with the prior that objects in nearby frames are similarwhereas other random object patches are dissimilar. Misraet al. [32] devise a sequence verification task where tuplesof video frames are shuffled and the network is trained onthe binary task of discriminating between correctly orderedand shuffled frames. Fernando et al. [14] design a taskwhere they take frames in correct temporal order and shuf-fled order, encode them and pass them as input to a networkwhich is then trained to predict the odd encoding out of therest; odd being the temporally shuffled one. Lee et al. [30]extract high motion tuples of four frames via optical flowand shuffle them. Their network learns to predict the permu-tation from which the frames were sampled from. Our workis highly related to approaches that shuffle video frames andtrain a network to learn the permutations. A key differencebetween our work and Lee et al. [30] is that they use only asingle 80 x 80 patch from a video frame and shuffle it withthree other patches from different frames. We sample a gridof patches from each frame and shuffle them with other mul-tiple patches from other frames. Instead of the binary task oftuple verification like Misra et al. [32], our self-supervisedtask is to predict the exact permutation of the patches, muchlike the jigsaw puzzle task of Noroozi and Favaro [34] —only on videos. Some recent approaches have used tempo-ral coherency-based self-supervision on video sequences tomodel fine-grained human poses and activities [31] and an-imal behavior [7]. Our model is not specialized for motorskill learning like [7] and we do not require bounding boxesfor humans in the video frames as in [31].

Combining Multiple Cues: Since our approach com-bines spatial and temporal context into a single task, it ispertinent to mention recent approaches to combine multi-ple supervisory cues. Doersch and Zisserman [10] combine

four self-supervised tasks in a multi-task training frame-work. The tasks include context prediction [9], colorization[61], exemplar-based learning [12] and motion segmenta-tion [38]. Their experiments prove that naively combiningdifferent tasks does not yield improved results. They pro-pose a lasso regularization scheme to capture only usefulfeatures from the trained network. Our work does not re-quire a complex model for combining the spatial and tem-poral context prediction tasks for self-supervised learning.Wang et al. [59] train a Siamese network to recognize ifan object patch belongs to a similar category (but differentobject) or it belongs to the same object, only later in time.This work attempts to combine spatial and temporal con-text but requires preprocessing to discover the tracked ob-ject patches. Our work constructs the spatiotemporal taskfrom video frames automatically without requiring graphconstruction or visual detection and tracking. There is alsorecent work on using synthetic imagery and its ‘free anno-tations’ to learn visual representations [43] by combiningmultiple self-supervised tasks. A related approach to oursis that of [53] where the authors devise two tasks for the net-work to train on in a multi-task framework. One is spatialplacement task where a network learns to identify if a an im-age patch overlaps with a person bounding box or not. Thesecond task is an ordering one where a network is trainedto identify the correct sequence of two frames in a Siamesenetwork setting much like [32]. The key difference betweentheir work and ours is that our network does not do multi-task learning and predicts a much richer set of labels (that is,the shuffled configuration of patches) as compared to binaryclassification.

3. The Video Jigsaw Puzzle ProblemWe present the video jigsaw puzzle task in this section.

Our goal is to create a task that not only forces a network tolearn part-based appearance of complex activities but also,how those parts change over time. For this, we divide avideo frame into 2 × 2 grid of patches. For a tuple of threevideo frames, this results in 3× (2× 2) = 12 total patchesper video. We number the patches from 1 to 12 and shuf-fle them. Note that there are 12! = 479001600 ways toshuffle these patches. We use a small but diverse subset ofthese patches’ permutations, selecting them based on theirHamming Distance from the previously sampled permuta-tions [34]. We use two sampling strategies in our experi-ments which we will describe in more detail. The networkis trained to predict the correct order of patches. Our videojigsaw task is illustrated in Figure 1.

3.1. Training Video Jigsaw Network

Our training strategy follows a line of recent works onself-supervised learning on large scale image and videodatasets [34, 30]. Typically, the self-supervised task is con-

Page 4: arXiv:1808.07507v1 [cs.CV] 22 Aug 2018 · 2018-08-24 · Irfan Essa Georgia Institute of Technology irfan@gatech.edu Abstract We propose a self-supervised learning method to jointly

Figure 2: Our full video jigsaw network training pipeline.

structed by defining pseudo labels — in our case, the per-muted order of patches. Then, each patch, after undergoingpreprocessing, is input to a multi-stream Siamese-like net-work. Each stream, up till the first fully connected layer,shares parameters and operates independently on the framepatches. After the first fully connected layer (fc6), the fea-ture representations are concatenated and input to anotherfully connected layer (fc7). The final fully connected layertransforms the features to aN dimensional output, whereNis the number of permutations. A softmax over this outputreturns the most likely permutation the frame patches weresampled from. Our detailed training network is shown inFigure 2.

3.2. Generating Video Jigsaw Puzzles

We describe here the strategy to generate puzzles fromthe video frame patches. Noroozi and Favaro [34] proposedto generate permutations of 9 image patches by maximizingthe Hamming distance between the sampled permutationsand the subsequently sampled permutations. They iterateover all possible permutations of 9 patches till they end upwith N permutations; in their case, N = 100. In our case,since each video frame is divided into 4 patches and thereare 3 frames in a tuple, it is not possible to sample permu-tations from all possible permutations (which is 12!) dueto memory constraints. To reimplement [34]’s approach,we devise a computationally heavy but memory-efficientmeans to generate 100 permutations from 12! possibilities.More details on how we generate these permutations are de-scribed in the supplementary material. This way, we gener-ate the Hamming-distance based permutations as suggestedby [34].

The permutation sampling approach described abovetreats all video frame patches as one giant image — thus,

the patch belonging to the first frame may get shuffled tothe last frame’s position (to maximize Hamming distancebetween the permutations). We treat this permutation sam-pling approach as an (expensive) baseline but propose an-other sampling strategy to minimize compute and memoryconstraints. Our proposed approach can scale to any num-ber of permutations. We generate permutations with a 2×2grid per frame. Our proposed approach forces the sampledpermutations not only to obey the Hamming distance cri-teria but also to respect spatial coherence in video frames.This scales down computational and memory requirementsdramatically while giving similar or better performance ontransfer learning tasks. Our proposed permutation samplingapproach is given in Algorithm 1 and visually presented inFigure 3.

Explanation of Algorithm 1: With the constraint of spa-tial coherence i.e. patches within a frame constrained to staytogether, the full space of hashes consists of (np!)nf × nf !possibilities. After generating the first hash randomly (lines2, 3 and 5), each next hash Λh h ∈ 2, . . . N is picked bymaximizing over the full space the average Hamming dis-tance from previously generated hashes. We divide the fullspace into subsets of np! hashes. Iterating through each sub-set Λ′′ (lines 10-11), we store the best hash from the subsetinto Λ′ along with its distance metric into Dmax (lines 16-20). When the full space is traversed, the best from thegood ones (Λ′) is chosen as the new hash (lines 21-22).Lines 4, 6 and 10-15 describe how each subset Λ′′ is con-structed. Λ′′ contains all np! permutations of patches withinthe first frame but only a particular permutation of patchesfrom the other frames. For memory efficiency, it is suffi-cient to only create one matrix λ1 that has all patch permu-tations within the first frame i.e. it is not necessary to createλi i ∈ 2, . . . , nf as done in line 4. This is because the

Page 5: arXiv:1808.07507v1 [cs.CV] 22 Aug 2018 · 2018-08-24 · Irfan Essa Georgia Institute of Technology irfan@gatech.edu Abstract We propose a self-supervised learning method to jointly

Figure 3: Our proposed permutation sampling strategy. We randomly permute the patches within each frame in a tuple, then we permutethe frames. Since the number of patches per frame is 4, there are 4! = 24 unique ways to shuffle these patches within a frame. We repeat

this for all frames in the tuple and finally select the top N permutations based on Hamming distance. This strategy preserves spatialcoherence, preserves diversity between permutations, takes a fraction of the time and memory as compared to the algorithm of [34] and

results in either comparable or better performance in the transfer learning tasks

former is reused in every iteration but only one row fromthe latter is used to create λi, the matrix of repeated rows(line 14) which can be achieved by picking the correspond-ing row from λ1 and adding the offset np(i − 1) to eachelement of the row.

4. Experiments

In this section we describe in detail our experiments onvideo action recognition using the video jigsaw networkand a comprehensive ablation study, justifying our designchoices and conclusions. The datasets we use for trainingthe video jigsaw network are UCF101 [50] and Kinetics[24]. The datasets we evaluate on are UCF101 [50] andHMDB51 [27] for video action recognition.

4.1. Datasets

UCF101 [50] is a benchmark video action recogni-tion dataset consisting of 101 action categories and 13,320videos; around 9.5k videos are used for training and 3.5kvideos are for testing. HMDB51 [27] consists of around7000 vidoes of 51 action categories, out of which 70% be-long to training set and 30% are in the test set. Kineticsdataset [24] is a large scale human action video dataset con-sisting of 400 action categories and more than 400 videosper action category.

4.2. Video Jigsaw Network Training

Tuple Sampling Strategy For our unsupervised pre-training step on UCF101, we use the frame tuples (4

frames/tuple) provided by the authors of [30]. They ex-tracted optical flow based regions from these frame tuplesand used them in the temporal sequence sorting task [30].We do not use the optical flow based regions from theframes but only use the tuples as a whole. For a given frametuple f1, f2, f3, f4, we further sample three frames in thefollowing way:[(f1, f2, f3), (f2, f3, f4), (f1, f3, f4), (f1, f2, f4)]. Hence,we end up with around 900,000 frame tuples from UCF101dataset to train our video jigsaw network on. In Kineticsdataset, each video is 10 seconds long. We create our tuplesby sampling the 1st, 5th and 10th frames from each video.The reason we do not sample further (as we did in the caseof UCF101 dataset) is simply that Kinetics dataset is verylarge and diverse with more than 400 videos per class. Thisis not true for UCF101 dataset. Note that we do not use anyfurther preprocessing to generate the frame tuples for ourvideo jigsaw network. Previous approaches have used ex-pensive detection and tracking methods [58] or optical flowcomputation to sample the high motion patches [30].

Implementation Details We use Caffe [23] deep learningframework for all our experiments and CaffeNet [26] as ourbase network, only with 12 streams for 12 patches per tuple.Our video jigsaw puzzles are generated on the fly accordingto the permutation matrix Λ generated before training be-gins. Each row of Λ corresponds to a unique permutation of12 patches. The video frame patches are shuffled accordingto the sampled permutation from Λ and input to the net-work. The network is trained to predict the index in Λ fromwhich the permutation was sampled. Each video frame is

Page 6: arXiv:1808.07507v1 [cs.CV] 22 Aug 2018 · 2018-08-24 · Irfan Essa Georgia Institute of Technology irfan@gatech.edu Abstract We propose a self-supervised learning method to jointly

Algorithm 1: Sampling Permutations with Spatial Co-herence

Input: Number of permutations N , patches per framenp, number of frames nf

Output: Permutation Matrix Λ1 function generatePerm(N,np, nf )2 for i = 1 : nf do3 λi ← random permutation of

{np(i− 1) + 1, . . . , npi}4 λi ← all permutations of

{np(i− 1) + 1, . . . , npi} : [λi1, . . . , λinp!

]>

5 Λ← [λ>1 . . . λ>nf

]> with sub-vectors λ>irearranged in a random order

6 F ← all permutations of{1, . . . , nf} : [F1, . . . , Fnf !]

>

7 for h = 2 : N do8 Dmax ← ∅9 Λ′ ← ∅

10 for f = 1 : nf ! do11 for i = 1 : (np!)nf−1 do12 for j = 2 : nf do

13 k ←⌈

((i−1) mod (np!)j−1)+1

(np!)j−2

⌉14 λ

j← [λjk . . . λ

jk]> ∈ Rnp!×np

15 Λ′′ ← arrange [λ1λ2 . . . λnf ]> inorder Ff

16 D ← Hamming(Λ,Λ′′)

17 D ← 1h−11

>D

18 Dmax ←[Dmax maxkDk

]19 j ← argmaxkDk

20 Λ′ ←[Λ′ Λ′′j

]21 j ← argmaxkDmax(k)

22 Λ←[Λ Λ′j

]23 return Λ

cropped to 224× 224, then divided into a 2× 2 grid. Eachgrid is 112× 112 pixels and we randomly sample a 64× 64patch from it. This strategy ensures that the network cannot learn the location of the patches from low level appear-ance and texture details. We normalize each patch inde-pendently from others, to have zero mean and unit standarddeviation. This is also done to prevent the network fromlearning low-level details (also called ‘network shortcuts’in the self-supervision literature). Each patch is input tothe multi-stream video jigsaw network as depicted in Fig-ure 2. We use a batch size of 150 and train the network withStochastic Gradient Descent (SGD) using an initial learningrate of 0.000128, which decreases by 10 every 128,000 iter-

ations. Each layer in our network is initialized with xavierinitialization [16]. We train the network for 500,000 itera-tions (approximately 80 epochs) using a Titan X GPU. Ourtraining converges in around 62 hours.

Progressive Training Approach We borrow principlesfrom curriculum learning [6] to train our video jigsaw net-work with an easy jigsaw puzzle task first and then train itfor a harder task. We define an easy jigsaw puzzle task asone which has lower N as compared to a harder task as thenetwork has to learn fewer configurations of the patches inthe video frames. So instead of starting from scratch forsay, N = 500, we initialize the network’s weights with theweights of the network with N = 250.

Avoiding Network Shortcuts As mentioned in recentself-supervised approaches [9, 34, 35], it is imperative todeal with the self-supervised network’s tendency to learnthe patch locations via low level details such as due tochromatic aberration. Typical solutions to this problem arechannel swapping [30], color normalization [34], leaving agap between sampled patches and training with a percent-age of images in grayscale rather than color [35]. All theseapproaches aim to make the patch location learning taskharder for the network. Our video jigsaw network incor-porates these techniques to avoid network shortcuts. Ourpatch size is kept 64× 64 sampled from within a 112× 112window. Around half of the total video frames are ran-domly projected to grayscale and we normalize each sam-pled patch independently. Our experiments using thesetechniques result in a drop in performance in video jigsawpuzzle solving accuracy but the transfer learning accuracyincreases.

Choice of Video Jigsaw Training Dataset As men-tioned, we train video jigsaw networks using UCF101 andKinetics datasets. Our results using the two datasets areshown in Table 1. We show video jigsaw task accuracy(VJ Acc) and the finetuning accuracy on UCF101 (FinetuneAcc) for pretraining with both datasets. N is the numberof permutations. We can note two things from the table.Using Kinetics results in a worse video jigsaw solving per-formance, but results in a better generalization and transferlearning. Our finetuning results are consistently better withKinetics pretraining as compared to training on UCF101.This shows that a large-scale diverse dataset like Kinet-ics is able to generalize to a completely different dataset(UCF101). One possible reason behind the reduced perfor-mance of UCF101 dataset is the fact that we oversamplefrom it. This results in an easy task for the video jigsawnetwork to learn the low-level details of the video frameappearances and rapidly decrease the training loss. How-ever, this would not result in a good transfer learning per-formance. To test this hypothesis, we use the reduced ver-sion of the UCF101 dataset (without any oversampling),

Page 7: arXiv:1808.07507v1 [cs.CV] 22 Aug 2018 · 2018-08-24 · Irfan Essa Georgia Institute of Technology irfan@gatech.edu Abstract We propose a self-supervised learning method to jointly

Pretraining DatasetVJ Acc (%)(N = 100)

Finetune Acc (%)(N = 100)

VJ Acc (%)(N = 250)

Finetune Acc (%)(N = 250)

UCF101 97.6 44.0 84.6 42.6Kinetics 61.6 44.6 44.0 49.0

Table 1: Comparison between UCF101 and Kinetics datasets forvideo jigsaw training

comprising just 200,000 frame tuples and train video jig-saw networks for N = 500 and N = 1000. The resultsare shown in Table 2. As is shown, even without oversam-pling, UCF101-based pretraining does not perform as wellas Kinetics dataset.

Pretrained On VJ Acc (%) Finetune Ac (%) VJ Acc (%) Finetune Ac (%)Kinetics 40.3 49.2 29.4 54.7UCF101-no oversampling 63.3 46.5 58 46.4

N = no. of permutations N = 500 N = 500 N = 1000 N = 1000

Table 2: Comparison between Kinetics and the original UCF101frame tuples as pretraining dataset for video jigsaw network

Choice of Number of Permutations We vary the numberof permutations N a video jigsaw network has to learn. Westart with N = 100 and take it up to N = 1000. As weincrease the number of permutations (see Table 3), the net-work finds it harder to learn the configuration of the patches,but the generalization improves. This experiment is runwith Kinetics dataset trained on video jigsaw network.

Permutation Generation Strategy We compare the per-formance of our proposed permutation strategy which en-forces spatial coherence (referred to as Psp) between per-muted patches — with the proposed approach of [34] (re-ferred to as Porig). We show results for this comparisonin Figure 4. As the bar chart shows, for various numberof permutations, our proposed spatial coherency preservingmethod either outperforms the original random permutationgeneration strategy or is comparable to it, while being manytimes faster to generate.

Patch Size We also compare the performance of our videojigsaw method trained on different frame patch sizes. Table4 shows that the finetuning accuracy increases with the in-crease in patch size but does not give much improvementbeyond a patch size of 80× 80.

4.3. Finetuning for Action Recognition

Once the video jigsaw network is trained, we use theconvolutional layers’ weights to initialize a standard Caf-feNet [26] architecture and use it to finetune on UCF101 andHMDB51 datasets. For UCF101, we sample 25 equidistantframes per video and compute frame-based accuracy as ourfinetuning evaluation measure. For HMDB51 we sample1 frame per second from each video and use them for the

No. of permutations VJ Acc (%) Finetuning Ac (%)100 61.6 44.6250 44.0 49.0500 47.6 48.1

1000 29.4 54.7

Table 3: As we increase N , the video jigsaw performancedecreases but the finetuning accuracy increases

Patch size Finetuning Ac (%)64 54.780 55.4

100 54.1

Table 4: As we increase patch size, the video jigsaw finetuningaccuracy on UCF101 dataset increases

finetuning experiment. With our best model and parameters(pretrained on Kinetics dataset), results are given in Table 5for test split 1 of both UCF101 and HMDB51 datasets.

Table 5 shows our video jigsaw pretraining approachoutperforming recent unsupervised pretraining approacheswhen finetuning on HMDB51 dataset. On UCF101 dataset,our finetuning accuracy is comparable to the state of the art.The method of Fernando et al. uses a different input fromours (stacks of frame differences) whereas we use RGBframes to form the jigsaw puzzles. All other approachesoperate on RGB video frames or frame patches hence wecan fairly compare with them. The methods of Lee et al.[30] and Misra et al. [32] are pretrained on UCF101 datasetwhereas our best network is trained on Kinetics dataset.This again shows the domain transfer capability of a large

Figure 4: Comparison between the permutation strategy proposedby [34] (Porig) and our proposed sampling approach (Psp) on thevideo jigsaw task (indicated by VJ Acc) and the finetuning taskon UCF101 (indicated by FN Acc) for various different number

of permutations N . Our approach consistently performs better orcomparable to the approach of [34] while saving memory and

computational costs. Figure is best viewed in color

Page 8: arXiv:1808.07507v1 [cs.CV] 22 Aug 2018 · 2018-08-24 · Irfan Essa Georgia Institute of Technology irfan@gatech.edu Abstract We propose a self-supervised learning method to jointly

Pretraining UCF101 Acc (%) HMDB51 Acc (%)

random 40.0 16.3

ImageNet (with labels) 67.7 28.0

Fernando et al. [14] 60.3 32.5

Hadsell et al. [19] 45.7 16.3

Mobahi et al. [33] 45.4 15.9

Wang and Gupta [58] 40.7 15.6

Misra et al. [32] 50.9 19.8

Lee et al. [30] 56.3 22.1

Vondrick et al. [57] 52.1 -

Video Jigsaw Network (ours) 55.4 27.0

Table 5: Finetuning results on UCF101 and HMDB51 of ourproposed video jigsaw network (pretrained on Kinetics dataset

with N = 1000 permutations — compared to the state of the artapproaches. Note that all these results are computed using

CaffeNet architecture. Our method gives superior or comparableperformance to the state of the art unsupervised learning +

finetuning approaches that use RGB frames for training

scale dataset like Kinetics, compared to UCF101. Ourmethod achieves this without doing any expensive tracking[58] or optical flow based patch or frame mining such as[32, 30]. This means that our approach requires large scalediverse unlabeled video dataset to work. We used 3 framesper video from Kinetics dataset — hence we were only us-ing about 400,000 tuples for our video jigsaw training. Webelieve that using a larger dataset would lead to better per-formance, given that our approach is close to the state ofthe art. Another point to note is that methods which per-form well on UCF101 such as Lee et al. [30] and Misra etal. [32] do not perform that well on HMDB51, whereas ourmethod actually generalizes well, given that it is pretrainedon a completely different dataset.

Method Supervision Classification

ImageNet 1000 class labels 78.2%

Random [39] none 53.3%Doersch et al. [9] ImageNet context 55.3%Jigsaw Puzzle [34] ImageNet context 67.6%Counting [35] ImageNet context 67.7%

Wang and Gupta [58] 100k videos, VOC2012 62.8%Agrawal et al. [1] egomotion (KITTI, SF) 54.2%Misra et al. [32] UCF101 videos 54.3%Lee et al. [30] UCF101 videos 63.8%Pathak et al. [38] MS COCO + segments 61.0%Video Jigsaw Network (ours) Kinetics videos 63.6%

Table 6: PASCAL VOC 2007 classification results compared withother methods. Other results taken from [35] and [30]

Figure 5: Visualization of first 40 learned conv1 filters of our bestperforming video jigsaw model

4.4. Results on PASCAL VOC 2007 Dataset

The PASCAL VOC 2007 dataset consists of 20 objectclasses with 5011 images in the train set and 4952 imagesin the test set. Multiple objects can be present in a singleimage and the classification task is to detect whether an ob-ject is present in a given image or not. We evaluate ourvideo jigsaw network on this dataset by initializing a Caf-feNet with our video jigsaw network’s trained convolutionallayers’ weights. The fully-connected layers’ weights arerandomly sampled from a Gaussian distribution with zeromean and 0.001 standard deviation. Our finetuning schemefollows the one suggested by [25]. Our classification resultson the Pascal VOC 2007 test set are shown in Table 6.

Our trained network generalizes well not only acrossdatasets but also across tasks. Our video jigsaw networkis trained on Kinetics videos and not on object-centric im-ages, yet performs competitively against the state-of-the-artimage-based semi-supervised approaches and outperformsmost of the video-based semi-supervised methods.

4.5. Visualization Experiments

We show first 40 conv1 filter weights of our best videojigsaw model in Figure 5 which show oriented edgeslearned by our model. Note that training this model doesnot use activity labels. We also perform a qualitative re-trieval experiment on the video jigsaw model finetuned onPascal VOC dataset. Results are shown in Figure 6. Wenote that the retrieved images returned by the model matchthe query image which qualitatively shows that our modeltrained on unlabeled videos is able to identify objects in stillimages.

5. Conclusion

We propose a self-supervised learning task where spa-tial and temporal contexts are exploited jointly. Our frame-work is not dependent on heavy preprocessing steps suchas object tracking or optical flow based patch mining. Wedemonstrate via extensive experimental evaluations that ourapproach performs competitively on video activity recog-nition, outperforming the state of the art in self-supervisedvideo action recognition on HMDB51 dataset. We also pro-

Page 9: arXiv:1808.07507v1 [cs.CV] 22 Aug 2018 · 2018-08-24 · Irfan Essa Georgia Institute of Technology irfan@gatech.edu Abstract We propose a self-supervised learning method to jointly

Figure 6: Retrieval Experiment on PASCAL VOC dataset usingour model

pose a permutation generation strategy which respects spa-tial coherency and demonstrate that even for shuffling 12patches, diverse permutations can be generated extremelyefficiently via our proposed approach.

References[1] P. Agrawal, J. Carreira, and J. Malik. Learning to see by

moving. In Computer Vision (ICCV), 2015 IEEE Interna-tional Conference on, pages 37–45. IEEE, 2015.

[2] P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine.Learning to poke by poking: Experiential learning of intu-itive physics. In Advances in Neural Information ProcessingSystems, pages 5074–5082, 2016.

[3] M. A. Bautista, A. Sanakoyeu, and B. Ommer. Deep unsu-pervised similarity learning using partially ordered sets. InProceedings of IEEE Computer Vision and Pattern Recogni-tion, 2017.

[4] M. A. Bautista, A. Sanakoyeu, E. Tikhoncheva, and B. Om-mer. Cliquecnn: Deep unsupervised exemplar learning. InAdvances in Neural Information Processing Systems, pages3846–3854, 2016.

[5] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle.Greedy layer-wise training of deep networks. In Advancesin neural information processing systems, pages 153–160,2007.

[6] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Cur-riculum learning. In Proceedings of the 26th annual interna-tional conference on machine learning, pages 41–48. ACM,2009.

[7] B. Brattoli, U. Buchler, A.-S. Wahl, M. E. Schwab, andB. Ommer. Lstm self-supervision for detailed behavior anal-ysis. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 6466–6475, 2017.

[8] C. Doersch, A. Gupta, and A. A. Efros. Mid-level visual ele-ment discovery as discriminative mode seeking. In Advancesin neural information processing systems, pages 494–502,2013.

[9] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised vi-sual representation learning by context prediction. In Pro-

ceedings of the IEEE International Conference on ComputerVision, pages 1422–1430, 2015.

[10] C. Doersch and A. Zisserman. Multi-task self-supervisedvisual learning. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2051–2060, 2017.

[11] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Ried-miller, and T. Brox. Discriminative unsupervised featurelearning with exemplar convolutional neural networks. IEEEtransactions on pattern analysis and machine intelligence,38(9):1734–1747, 2016.

[12] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, andT. Brox. Discriminative unsupervised feature learning withconvolutional neural networks. In Advances in Neural Infor-mation Processing Systems, pages 766–774, 2014.

[13] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams,J. Winn, and A. Zisserman. The pascal visual object classeschallenge: A retrospective. International journal of com-puter vision, 111(1):98–136, 2015.

[14] B. Fernando, H. Bilen, E. Gavves, and S. Gould. Self-supervised video representation learning with odd-one-outnetworks. In 2017 IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 5729–5738. IEEE,2017.

[15] P. Foldiak. Learning invariance from transformation se-quences. Neural Computation, 3(2):194–200, 1991.

[16] X. Glorot and Y. Bengio. Understanding the difficulty oftraining deep feedforward neural networks. In Proceedingsof the Thirteenth International Conference on Artificial In-telligence and Statistics, pages 249–256, 2010.

[17] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deeplearning, volume 1. MIT press Cambridge, 2016.

[18] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun.Unsupervised learning of spatiotemporally coherent metrics.In Proceedings of the IEEE international conference on com-puter vision, pages 4086–4093, 2015.

[19] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduc-tion by learning an invariant mapping. In Computer visionand pattern recognition, 2006 IEEE computer society con-ference on, volume 2, pages 1735–1742. IEEE, 2006.

[20] G. E. Hinton and R. R. Salakhutdinov. Reducing thedimensionality of data with neural networks. science,313(5786):504–507, 2006.

[21] G. E. Hinton and T. J. Sejnowski. Learning and releamingin boltzmann machines. Parallel distributed processing: Ex-plorations in the microstructure of cognition, 1(282-317):2,1986.

[22] D. Jayaraman and K. Grauman. Learning image representa-tions tied to ego-motion. In Proceedings of the IEEE Inter-national Conference on Computer Vision, pages 1413–1421,2015.

[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-tional architecture for fast feature embedding. arXiv preprintarXiv:1408.5093, 2014.

[24] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi-jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al.

Page 10: arXiv:1808.07507v1 [cs.CV] 22 Aug 2018 · 2018-08-24 · Irfan Essa Georgia Institute of Technology irfan@gatech.edu Abstract We propose a self-supervised learning method to jointly

The kinetics human action video dataset. arXiv preprintarXiv:1705.06950, 2017.

[25] P. Krahenbuhl, C. Doersch, J. Donahue, and T. Darrell. Data-dependent initializations of convolutional neural networks.arXiv preprint arXiv:1511.06856, 2015.

[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in neural information processing systems, pages1097–1105, 2012.

[27] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.Hmdb: a large video database for human motion recogni-tion. In Computer Vision (ICCV), 2011 IEEE InternationalConference on, pages 2556–2563. IEEE, 2011.

[28] G. Larsson, M. Maire, and G. Shakhnarovich. Learning rep-resentations for automatic colorization. In European Confer-ence on Computer Vision, pages 577–593. Springer, 2016.

[29] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparsecoding algorithms. In Advances in neural information pro-cessing systems, pages 801–808, 2007.

[30] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang. Unsuper-vised representation learning by sorting sequences. In 2017IEEE International Conference on Computer Vision (ICCV),pages 667–676. IEEE, 2017.

[31] T. Milbich, M. Bautista, E. Sutter, and B. Ommer. Unsuper-vised video understanding by reconciliation of posture simi-larities. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 4394–4404, 2017.

[32] I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and learn:unsupervised learning using temporal order verification. InEuropean Conference on Computer Vision, pages 527–544.Springer, 2016.

[33] H. Mobahi, R. Collobert, and J. Weston. Deep learning fromtemporal coherence in video. In Proceedings of the 26th An-nual International Conference on Machine Learning, pages737–744. ACM, 2009.

[34] M. Noroozi and P. Favaro. Unsupervised learning of visualrepresentations by solving jigsaw puzzles. In European Con-ference on Computer Vision, pages 69–84. Springer, 2016.

[35] M. Noroozi, H. Pirsiavash, and P. Favaro. Representationlearning by learning to count. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 5898–5906, 2017.

[36] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adel-son, and W. T. Freeman. Visually indicated sounds. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 2405–2413, 2016.

[37] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, andA. Torralba. Ambient sound provides supervision for vi-sual learning. In European Conference on Computer Vision,pages 801–816. Springer, 2016.

[38] D. Pathak, R. Girshick, P. Dollar, T. Darrell, and B. Hariha-ran. Learning features by watching objects move. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 2701–2710, 2017.

[39] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.Efros. Context encoders: Feature learning by inpainting.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 2536–2544, 2016.

[40] L. Pinto, J. Davidson, and A. Gupta. Supervision via com-petition: Robot adversaries for learning tasks. In Roboticsand Automation (ICRA), 2017 IEEE International Confer-ence on, pages 1601–1608. IEEE, 2017.

[41] L. Pinto, D. Gandhi, Y. Han, Y.-L. Park, and A. Gupta. Thecurious robot: Learning visual representations via physicalinteractions. In European Conference on Computer Vision,pages 3–18. Springer, 2016.

[42] L. Pinto and A. Gupta. Learning to push by grasping: Usingmultiple tasks for effective learning. In Robotics and Au-tomation (ICRA), 2017 IEEE International Conference on,pages 2161–2168. IEEE, 2017.

[43] Z. Ren and Y. J. Lee. Cross-domain self-supervised multi-task feature learning using synthetic imagery. arXiv preprintarXiv:1711.09082, 2017.

[44] B. C. Russell, W. T. Freeman, A. A. Efros, J. Sivic, andA. Zisserman. Using multiple segmentations to discover ob-jects and their extent in image collections. In Computer Vi-sion and Pattern Recognition, 2006 IEEE Computer SocietyConference on, volume 2, pages 1605–1614. IEEE, 2006.

[45] R. Salakhutdinov and H. Larochelle. Efficient learning ofdeep boltzmann machines. In Proceedings of the Thir-teenth International Conference on Artificial Intelligenceand Statistics, pages 693–700, 2010.

[46] G. A. Sigurdsson, O. Russakovsky, and A. Gupta. What ac-tions are needed for understanding human actions in videos?arXiv preprint arXiv:1708.02696, 2017.

[47] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discoveryof mid-level discriminative patches. In Computer Vision–ECCV 2012, pages 73–86. Springer, 2012.

[48] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T.Freeman. Discovering objects and their location in images.In Computer Vision, 2005. ICCV 2005. Tenth IEEE Interna-tional Conference on, volume 1, pages 370–377. IEEE, 2005.

[49] P. Smolensky. Information processing in dynamical systems:Foundations of harmony theory. Technical report, COL-ORADO UNIV AT BOULDER DEPT OF COMPUTERSCIENCE, 1986.

[50] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A datasetof 101 human actions classes from videos in the wild. arXivpreprint arXiv:1212.0402, 2012.

[51] E. S. Spelke. Principles of object perception. Cognitive sci-ence, 14(1):29–56, 1990.

[52] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsu-pervised learning of video representations using lstms. InInternational conference on machine learning, pages 843–852, 2015.

[53] O. Sumer, T. Dencker, and B. Ommer. Self-supervised learn-ing of pose embeddings from spatiotemporal relations invideos. In 2017 IEEE International Conference on ComputerVision (ICCV), pages 4308–4317. IEEE, 2017.

[54] J. Sun and J. Ponce. Learning discriminative part detec-tors for image classification and cosegmentation. In Com-puter Vision (ICCV), 2013 IEEE International Conferenceon, pages 3400–3407. IEEE, 2013.

[55] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolu-tional learning of spatio-temporal features. In European con-ference on computer vision, pages 140–153. Springer, 2010.

Page 11: arXiv:1808.07507v1 [cs.CV] 22 Aug 2018 · 2018-08-24 · Irfan Essa Georgia Institute of Technology irfan@gatech.edu Abstract We propose a self-supervised learning method to jointly

[56] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.Extracting and composing robust features with denoising au-toencoders. In Proceedings of the 25th international confer-ence on Machine learning, pages 1096–1103. ACM, 2008.

[57] C. Vondrick, H. Pirsiavash, and A. Torralba. Generatingvideos with scene dynamics. In Advances In Neural Infor-mation Processing Systems, pages 613–621, 2016.

[58] X. Wang and A. Gupta. Unsupervised learning of visual rep-resentations using videos. arXiv preprint arXiv:1505.00687,2015.

[59] X. Wang, K. He, and A. Gupta. Transitive invariance forself-supervised visual representation learning. arXiv preprintarXiv:1708.02901, 2017.

[60] L. Wiskott and T. J. Sejnowski. Slow feature analysis: Un-supervised learning of invariances. Neural computation,14(4):715–770, 2002.

[61] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-tion. In European Conference on Computer Vision, pages649–666. Springer, 2016.

[62] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders:Unsupervised learning by cross-channel prediction. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 1058–1067, 2017.

[63] Y. Zhou and T. L. Berg. Temporal perception and predic-tion in ego-centric video. In Proceedings of the IEEE Inter-national Conference on Computer Vision, pages 4498–4506,2015.