arxiv:2003.02824v3 [cs.cv] 18 mar 2020domain video classification datasets along with the...

16
Action Segmentation with Joint Self-Supervised Temporal Domain Adaptation Min-Hung Chen 1* Baopu Li 2 Yingze Bao 2 Ghassan AlRegib 1 Zsolt Kira 1 1 Georgia Institute of Technology 2 Baidu USA Abstract Despite the recent progress of fully-supervised action segmentation techniques, the performance is still not fully satisfactory. One main challenge is the problem of spatio- temporal variations (e.g. different people may perform the same activity in various ways). Therefore, we exploit unlabeled videos to address this problem by reformulat- ing the action segmentation task as a cross-domain prob- lem with domain discrepancy caused by spatio-temporal variations. To reduce the discrepancy, we propose Self- Supervised Temporal Domain Adaptation (SSTDA), which contains two self-supervised auxiliary tasks (binary and se- quential domain prediction) to jointly align cross-domain feature spaces embedded with local and global temporal dynamics, achieving better performance than other Do- main Adaptation (DA) approaches. On three challeng- ing benchmark datasets (GTEA, 50Salads, and Breakfast), SSTDA outperforms the current state-of-the-art method by large margins (e.g. for the F1@25 score, from 59.6% to 69.1% on Breakfast, from 73.4% to 81.5% on 50Salads, and from 83.6% to 89.1% on GTEA), and requires only 65% of the labeled training data for comparable performance, demonstrating the usefulness of adapting to unlabeled tar- get videos across variations. The source code is available at https://github.com/cmhungsteve/SSTDA. 1. Introduction The goal of action segmentation is to simultaneously segment videos by time and predict an action class for each segment, leading to various applications (e.g. human activity analyses). While action classification has shown great progress given the recent success of deep neural net- works [41, 29, 28], temporally locating and recognizing ac- tion segments in long videos is still challenging. One main challenge is the problem of spatio-temporal variations of human actions across videos [17]. For example, different people may make tea in different personalized styles even if the given recipe is the same. The intra-class variations * Work done during an internship at Baidu USA SSTDA Source Model Target Model Source Target Fully-Supervised Learning Labels Unlabeled Videos Unlabeled Videos Labels Action Predictions Spatio-Temporal Feature Embedding Video Inputs Domain Discrepancy Figure 1: An overview of the proposed Self-Supervised Temporal Domain Adaptation (SSTDA) for action segmen- tation. “Source” refers to the data with labels, and “Tar- get” refers to the data without access to labels. SSTDA can effectively adapts the source model trained with standard fully-supervised learning to a target domain by diminish- ing the discrepancy of embedded feature spaces between the two domains caused by spatio-temporal variations. SSTDA only requires unlabeled videos from both domains with the standard transductive setting, which eliminates the need of additional labels to obtain the final target model. cause degraded performance by directly deploying a model trained with different groups of people. Despite significant progress made by recent methods based on temporal convolution with fully-supervised learn- ing [21, 7, 24, 9], the performance is still not fully satisfac- tory (e.g. the best accuracy on the Breakfast dataset is still lower than 70%). One method to improve the performance is to exploit knowledge from larger-scale labeled data [2]. However, manually annotating precise frame-by-frame ac- tions is time-consuming and challenging. Another way is to design more complicated architectures but with higher costs of model complexity. Thus, we aim to address the spatio- temporal variation problem with unlabeled data, which are comparatively easy to obtain. To achieve this goal, we propose to diminish the distributional discrepancy caused by spatio-temporal variations by exploiting auxiliary unla- beled videos with the same types of human activities per- formed by different people. More specifically, to extend the framework of the main video task for exploiting auxiliary 1 arXiv:2003.02824v3 [cs.CV] 18 Mar 2020

Upload: others

Post on 28-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

Action Segmentation with Joint Self-Supervised Temporal Domain Adaptation

Min-Hung Chen1∗ Baopu Li2 Yingze Bao2 Ghassan AlRegib1 Zsolt Kira11Georgia Institute of Technology 2Baidu USA

Abstract

Despite the recent progress of fully-supervised actionsegmentation techniques, the performance is still not fullysatisfactory. One main challenge is the problem of spatio-temporal variations (e.g. different people may performthe same activity in various ways). Therefore, we exploitunlabeled videos to address this problem by reformulat-ing the action segmentation task as a cross-domain prob-lem with domain discrepancy caused by spatio-temporalvariations. To reduce the discrepancy, we propose Self-Supervised Temporal Domain Adaptation (SSTDA), whichcontains two self-supervised auxiliary tasks (binary and se-quential domain prediction) to jointly align cross-domainfeature spaces embedded with local and global temporaldynamics, achieving better performance than other Do-main Adaptation (DA) approaches. On three challeng-ing benchmark datasets (GTEA, 50Salads, and Breakfast),SSTDA outperforms the current state-of-the-art method bylarge margins (e.g. for the F1@25 score, from 59.6% to69.1% on Breakfast, from 73.4% to 81.5% on 50Salads, andfrom 83.6% to 89.1% on GTEA), and requires only 65%of the labeled training data for comparable performance,demonstrating the usefulness of adapting to unlabeled tar-get videos across variations. The source code is availableat https://github.com/cmhungsteve/SSTDA.

1. Introduction

The goal of action segmentation is to simultaneouslysegment videos by time and predict an action class foreach segment, leading to various applications (e.g. humanactivity analyses). While action classification has showngreat progress given the recent success of deep neural net-works [41, 29, 28], temporally locating and recognizing ac-tion segments in long videos is still challenging. One mainchallenge is the problem of spatio-temporal variations ofhuman actions across videos [17]. For example, differentpeople may make tea in different personalized styles evenif the given recipe is the same. The intra-class variations

∗Work done during an internship at Baidu USA

SSTDA

SourceModel

TargetModel

Source Target

Fully-SupervisedLearning

Labels Unlabeled Videos Unlabeled Videos Labels

Action Predictions

Spatio-Temporal Feature Embedding

Video Inputs

Domain Discrepancy

Figure 1: An overview of the proposed Self-SupervisedTemporal Domain Adaptation (SSTDA) for action segmen-tation. “Source” refers to the data with labels, and “Tar-get” refers to the data without access to labels. SSTDA caneffectively adapts the source model trained with standardfully-supervised learning to a target domain by diminish-ing the discrepancy of embedded feature spaces between thetwo domains caused by spatio-temporal variations. SSTDAonly requires unlabeled videos from both domains with thestandard transductive setting, which eliminates the need ofadditional labels to obtain the final target model.

cause degraded performance by directly deploying a modeltrained with different groups of people.

Despite significant progress made by recent methodsbased on temporal convolution with fully-supervised learn-ing [21, 7, 24, 9], the performance is still not fully satisfac-tory (e.g. the best accuracy on the Breakfast dataset is stilllower than 70%). One method to improve the performanceis to exploit knowledge from larger-scale labeled data [2].However, manually annotating precise frame-by-frame ac-tions is time-consuming and challenging. Another way is todesign more complicated architectures but with higher costsof model complexity. Thus, we aim to address the spatio-temporal variation problem with unlabeled data, which arecomparatively easy to obtain. To achieve this goal, wepropose to diminish the distributional discrepancy causedby spatio-temporal variations by exploiting auxiliary unla-beled videos with the same types of human activities per-formed by different people. More specifically, to extend theframework of the main video task for exploiting auxiliary

1

arX

iv:2

003.

0282

4v3

[cs

.CV

] 1

8 M

ar 2

020

Page 2: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

data [48, 20], we reformulate our main task as an unsuper-vised domain adaptation (DA) problem with the transduc-tive setting [32, 6], which aims to reduce the discrepancybetween source and target domains without access to thetarget labels.

Recently, adversarial-based DA approaches [11, 12, 40,47] show progress in reducing the discrepancy for imagesusing a domain discriminator equipped with adversarialtraining. However, videos also suffer from domain dis-crepancy along the temporal direction [4], so using image-based domain discriminators is not sufficient for action seg-mentation. Therefore, we propose Self-Supervised Tem-poral Domain Adaptation (SSTDA), containing two self-supervised auxiliary tasks: 1) binary domain prediction,which predicts a single domain for each frame-level feature,and 2) sequential domain prediction, which predicts the per-mutation of domains for an untrimmed video. Throughadversarial training with both auxiliary tasks, SSTDA canjointly align cross-domain feature spaces that embed lo-cal and global temporal dynamics, to address the spatio-temporal variation problem for action segmentation, asshown in Figure 1. To support our claims, we compare ourmethod with other popular DA approaches and show bet-ter performance, demonstrating the effectiveness for align-ing temporal dynamics by SSTDA. Finally, we evaluateour approaches on three datasets with high spatio-temporalvariations: GTEA [10], 50Salads [37], and the Breakfastdataset [18]. By exploiting unlabeled target videos withSSTDA, our approach outperforms the current state-of-the-art methods by large margins and achieve comparable per-formance using only 65% of labeled training data.

In summary, our contributions are three-fold:

1. Self-Supervised Sequential Domain Prediction: Wepropose a novel self-supervised auxiliary task, whichpredicts the permutation of domains for long videos,to facilitate video domain adaptation. To the best ofour knowledge, this is the first self-supervised methoddesigned for cross-domain action segmentation.

2. Self-Supervised Temporal Domain Adaptation(SSTDA): By integrating two self-supervised auxil-iary tasks, binary and sequential domain prediction,our proposed SSTDA can jointly align local and globalembedded feature spaces across domains, outperform-ing other DA methods.

3. Action Segmentation with SSTDA: By integratingSSTDA for action segmentation, our approach out-performs the current state-of-the-art approach by largemargins, and achieve comparable performance by us-ing only 65% of labeled training data. Moreover, dif-ferent design choices are analyzed to identify the keycontributions of each component.

2. Related Works

Action Segmentation methods proposed recently are builtupon temporal convolution networks (TCN) [21, 7, 24, 9]because of their ability to capture long-range dependenciesacross frames and faster training compared to RNN-basedmethods. With the multi-stage pipeline, MS-TCN [9] per-forms hierarchical temporal convolutions to effectively ex-tract temporal features and achieve the state-of-the-art per-formance for action segmentation. In this work, we utilizeMS-TCN as the baseline model and integrate the proposedself-supervised modules to further boost the performancewithout extra labeled data.Domain Adaptation (DA) has been popular recently espe-cially with the integration of deep learning. With the two-branch (source and target) framework for most DA works,finding a common feature space between source and targetdomains is the ultimate goal, and the key is to design thedomain loss to achieve this goal [6].

Discrepancy-based DA [25, 26, 27] is one of the majorclasses of methods where the main goal is to reduce the dis-tribution distance between the two domains. Adversarial-based DA [11, 12] is also popular with similar conceptsas GANs [13] by using domain discriminators. With care-fully designed adversarial objectives, the domain discrimi-nator and the feature extractor are optimized through min-max training. Some works further improve the performanceby assigning pseudo-labels to target data [34, 44]. Fur-thermore, Ensemble-based DA [36, 22] incorporates mul-tiple target branches to build an ensemble model. Recently,Attention-based DA [42, 19] assigns attention weights todifferent regions of images for more effective DA.

Unlike images, video-based DA is still under-explored.Most works concentrate on small-scale video DAdatasets [39, 46, 15]. Recently, two larger-scale cross-domain video classification datasets along with thestate-of-the-art approach are proposed [3, 4]. Moreover,some authors also proposed novel frameworks to utilizeauxiliary data for other video tasks, including object detec-tion [20] and action localization [48]. These works differfrom our work by either different video tasks [20, 3, 4] oraccess to the labels of auxiliary data [48].Self-Supervised Learning has become popular in recentyears for images and videos given the ability to learn in-formative feature representations without human supervi-sion. The key is to design an auxiliary task (or pretexttask) that is related to the main task and the labels canbe self-annotated. Most of the recent works for videosdesign auxiliary tasks based on spatio-temporal orders ofvideos [23, 43, 16, 1, 45]. Different from these works,our proposed auxiliary task predicts temporal permutationfor cross-domain videos, aiming to address the problem ofspatio-temporal variations for action segmentation.

2

Page 3: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

SSTDA module

Single-Stage Temporal Convolution Network

(SS-TCN)

. . . . . .

. . . . . .

. . . . . .

Multi-layer Temporal Convolution

𝑮𝒇

ℒ𝑦𝑮𝒚

ෝ𝒚

𝒇

ℒ𝑙𝑑

. . . . . .

Input frame-level features

Output frame-level features

ℒ𝑔𝑑

Figure 2: Illustration of the baseline model and the integra-tion with our proposed SSTDA. The frame-level features fare obtained by applying the temporal convolution networkGf to the inputs, and converted to the corresponding pre-dictions y using a fully-connected layer Gy to calculate theprediction loss Ly . The SSTDA module is integrated with fto calculate the local and global domain losses, Lld and Lgd

for optimizing f during training (see details in Section 3.2).Here we only show one stage in our multi-stage model.

3. Technical ApproachIn this section, the baseline model which is the current

state-of-the-art for action segmentation, MS-TCN [9], is re-viewed first (Section 3.1). Then the novel temporal domainadaptation scheme consisting of two self-supervised aux-iliary tasks, binary domain prediction (Section 3.2.1) andsequential domain prediction (Section 3.2.2), is proposed,followed by the final action segmentation model.

3.1. Baseline Model

Our work is built on the current state-of-the-art modelfor action segmentation, multi-stage temporal convolutionalnetwork (MS-TCN) [9]. For each stage, a single-stageTCN (SS-TCN) applies a multi-layer TCN, Gf , to derivethe frame-level features f = {f1, f2, ..., fT }, and makesthe corresponding predictions y = {y1, y2, ..., yT } using afully-connected layer Gy . By following [9], the predictionloss Ly is calculated based on the predictions y, as shownin the left part of Figure 2. Finally, multiple stages of SS-TCNs are stacked to enhance the temporal receptive fields,constructing the final baseline model, MS-TCN, where eachstage takes the predictions from the previous stage as inputs,and makes predictions for the next stage.

3.2. Self-Supervised Temporal Domain Adaptation

Despite the promising performance of MS-TCN on ac-tion segmentation over previous methods, there is stilla large room for improvement. One main challenge is

the problem of spatio-temporal variations of human ac-tions [17], causing the distributional discrepancy across do-mains [6]. For example, different subjects may performthe same action completely differently due to personalizedspatio-temporal styles. Moreover, collecting annotated datafor action segmentation is challenging and time-consuming.Thus, such challenges motivate the need to learn domain-invariant feature representations without full supervision.Inspired by the recent progress of self-supervised learning,which learns informative features that can be transferred tothe main target tasks without external supervision (e.g. hu-man annotation), we propose Self-Supervised TemporalDomain Adaptation (SSTDA) to diminish cross-domaindiscrepancy by designing self-supervised auxiliary tasks us-ing unlabeled videos.

To effectively transfer knowledge, the self-supervisedauxiliary tasks should be closely related to the main task,which is cross-domain action segmentation in this paper.Recently, adversarial-based DA approaches [11, 12] showprogress in addressing cross-domain image problems usinga domain discriminator with adversarial training where do-main discrimination can be regarded as a self-supervisedauxiliary task since domain labels are self-annotated. How-ever, directly applying image-based DA for video tasks re-sults in sub-optimal performance due to the temporal infor-mation being ignored [4]. Therefore, the question becomes:How should we design the self-supervised auxiliary tasksto benefit cross-domain action segmentation? More specifi-cally, the answer should address both cross-domain and ac-tion segmentation problems.

To address this question, we first apply an auxiliary taskbinary domain prediction to predict the domain for eachframe where the frame-level features are embedded with lo-cal temporal dynamics, aiming to address the cross-domainproblems for videos in local scales. Then we propose anovel auxiliary task sequential domain prediction to tem-porally segment domains for untrimmed videos where thevideo-level features are embedded with global temporal dy-namics, aiming to fully address the above question. Finally,SSTDA is achieved locally and globally by jointly applyingthese two auxiliary tasks, as illustrated in Figure 3.

In practice, since the key for effective video DA is tosimultaneously align and learn temporal dynamics, insteadof separating the two processes [4], we integrate SSTDAmodules to multiple stages instead of the last stage only,and the single-stage integration is illustrated in Figure 2.

3.2.1 Local SSTDA

The main goal of action segmentation is to learn frame-levelfeature representations that encode spatio-temporal infor-mation so that the model can exploit information from mul-tiple frames to predict the action for each frame. Therefore,

3

Page 4: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

BinaryDomain

Prediction

[Source] or [Target]

SequentialDomain

Prediction

[Source, Target, Target, Source]

orVideoframe

Local SSTDA Global SSTDA

Source

Target

Domain-permutedvideos

Originalvideos

Segment

Permute+

Figure 3: The two self-supervised auxiliary tasks inSSTDA: 1) binary domain prediction: discriminate singleframe, 2) sequential domain prediction: predict a sequenceof domains for an untrimmed video. These two tasks con-tribute to local and global SSTDA, respectively.

we first learn domain-invariant frame-level features with theauxiliary task binary domain prediction (Figure 3 left).Binary Domain Prediction: For a single stage, we feedthe frame-level features from source and target domains fS

and fT , respectively, to an additional shallow binary do-main classifier Gld, to discriminate which domain the fea-tures come from. Since temporal convolution from previouslayers encodes information from multiple adjacent framesto each frame-level feature, those frames contribute to thebinary domain prediction for each frame. Through adver-sarial training with a gradient reversal layer (GRL) [11, 12],which reverses the gradient signs during back-propagation,Gf will be optimized to gradually align the feature distri-butions between the two domains. Here we note Gld as Gld

equipped with GRL, as shown in Figure 4.Since this work is built on MS-TCN, integrating Gld

with proper stages is critical for effective DA. From our in-vestigation, the best performance happens when Glds areintegrated into middle stages. See Section 4.3 for details.

The overall loss function becomes a combination of thebaseline prediction loss Ly and the local domain loss Lld

with reverse sign, which can be expressed as follows:

L =

Ns∑Ly −

Ns∑βlLld

(1)

Lld =1

T

T∑j=1

Lld(Gld(fj), dj) (2)

where Ns is the total stage number in MS-TCN, Ns is thenumber of stages integrated with Gld, and T is the totalframe number of a video. Lld is a binary cross-entropy lossfunction, and βl is the trade-off weight for local domain lossLld, obtained by following the common strategy as [11, 12].

3.2.2 Global SSTDA

Although frame-level features f is learned using the con-text and dependencies from neighbor frames, the tempo-ral receptive fields of f are still limited, unable to rep-resent full videos. Solely integrating DA into f cannotfully address spatio-temporal variations for untrimmed longvideos. Therefore, in addition to binary domain predic-tion for frame-level features, we propose the second self-supervised auxiliary task for video-level features: sequen-tial domain prediction, which predicts a sequence of do-mains for video clips, as shown in the right part of Figure 3.This task is a temporal domain segmentation problem, aim-ing to predict the correct permutation of domains for longvideos consisting of shuffled video clips from both sourceand target domains. Since this goal is related to both cross-domain and action segmentation problems, sequential do-main prediction can effectively benefit our main task.

More specifically, we first divide fS and fT intotwo sets of segments FS = {fS

a ,fSb , ...} and F T =

{fTa ,f

Tb , ...}, respectively, and then learn the correspond-

ing two sets of segment-level feature representations V S ={vSa , vSb , ...} and V T = {vTa , vTb , ...} with Domain Atten-tive Temporal Pooling (DATP). All features v are then shuf-fled and combined in random order and fed to a sequentialdomain classifierGgd equipped with GRL (noted as Ggd) topredict the permutation of domains, as shown in Figure 4.Domain Attentive Temporal Pooling (DATP): The moststraightforward method to obtain a video-level feature isto aggregate frame-level features using temporal pooling.However, not all the frame-level features contribute thesame to the overall domain discrepancy, as mentioned in[4]. Hence, we assign larger attention weights wj (calcu-lated using Ggd in local SSTDA) to the features which havelarger domain discrepancy so that we can focus more onaligning those features. Finally, the attended frame-levelfeatures are aggregated with temporal pooling to generatethe video-level feature v, which can be expressed as:

v =1

T ′

T ′∑j=1

wj · fj (3)

where T ′ is the number of frames in a video segment. Formore details, please refer to the supplementary.Sequential Domain Prediction: By separately apply-ing DATP to both source and target segments, respec-tively, a set of segment-level feature representations V ={vSa , vSb , ..., vTa , vTb , ...} are obtained. We then shuffle all thefeatures in V and concatenate them into a feature to repre-sent a long and untrimmed video V ′, which contains videosegments from both domains in random order. Finally, V ′

is fed into a sequential domain classifier Ggd to predict thepermutation of domains for the video segments. For exam-ple, if V ′ = [vSa , v

Ta , v

Tb , v

Sb ], the goal of Ggd is to predict

4

Page 5: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

Source

Inputframe-level

features

Target

𝒇𝒂𝑺

𝒇𝒃𝑺

𝒇𝒂𝑻

𝒇𝒃𝑻

𝑣𝑎𝑆

𝑣𝑏𝑆

𝑣𝑎𝑇

𝑣𝑏𝑇

𝑣𝑎𝑆

𝑣𝑏𝑆

𝑣𝑎𝑇

𝑣𝑏𝑇

RandomPermutation

Sequential Domain

Classifier

𝐺𝑔𝑑

GRL

[0,0,1,1][1,1,0,0]

[0,1,1,0]

…Segment

Segment

𝒇𝑺

𝒇𝑻

Binary Domain

Classifier[0][1]

𝐺𝑙𝑑

GRL ℒ𝑙𝑑

𝑤

DATP

DATP

ℒ𝑔𝑑𝐺𝑙𝑑 𝐺𝑔𝑑

𝑤

𝑽′

𝑭𝑺

𝑭𝑻

𝑽𝑺

𝑽𝑻

𝑮𝒇

𝑮𝒇

Figure 4: The overview of the proposed Self-Supervised Temporal Domain Adaptation (SSTDA). The inputs from the twodomains are first encoded with local temporal dynamics usingGf to obtain the frame-level features fS and fT , respectively.We apply local SSTDA on all f using binary domain prediction Gld. Besides, fS and fT are evenly divided into multiplesegments to learn segment-level features V S and V T by DATP, respectively. Finally, the global SSTDA is applied on V ′,which is generated by concatenating shuffled V S and V T , using sequential domain prediction Ggd. Lld and Lgd are thedomain losses from Gld and Ggd, respectively. w corresponds to the attention weights for DATP, which are calculated formthe outputs of Gld. Here we use 8-frame videos and 2 segments as an example for this figure. Best views in colors.

the permutation as [0, 1, 1, 0]. Ggd is a multi-class classifierwhere the class number corresponds to the total number ofall possible permutations of domains, and the complexityofGgd is determined by the segment number for each video(more analyses in Section 4.3). The outputs ofGgd are usedto calculate the global domain loss Lgd as below:

Lgd = Lgd(Ggd(V′)), yd) (4)

where Lgd is also a standard cross-entropy loss functionwhere the class number is determined by the segment num-ber. Through adversarial training with GRL, sequential do-main prediction also contributes to optimizing Gf to alignthe feature distributions between the two domains.

There are some self-supervised learning works alsoproposing the concepts of temporal shuffling [23, 45]. How-ever, they predict temporal orders within one domain, aim-ing to learn general temporal information for video fea-tures. Instead, our method predicts temporal permutationfor cross-domain videos, which are shown with a dual-

branch pipeline in Figure 4, and integrate with binary do-main prediction to effectively address both cross-domainand action segmentation problems.

3.2.3 Local-Global Joint Training.

Finally, we also adopt a strategy from [42] to minimizethe class entropy for the frames that are similar across do-mains by adding a domain attentive entropy (DAE) lossLae. Please refer to the supplementary for more details.

By adding the global domain loss Lgd (Equation (4)) andthe attentive entropy loss Lae into Equation (1), the overallloss of our final proposed Self-Supervised Temporal Do-main Adaptation (SSTDA) can be expressed as follows:

L =

Ns∑Ly −

Ns∑(βlLld + βgLgd − µLae)

(5)

where βg and µ are the weights for Lgd and Lae, respec-tively.

5

Page 6: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

GTEA 50Salads Breakfastsubject # 4 25 52class # 11 17 48video # 28 50 1712

leave-#-subject-out 1 5 13

Table 1: The statistics of action segmentation datasets.

4. ExperimentsTo validate the effectiveness of the proposed methods in

reducing spatial-temporal discrepancy for action segmen-tation, we choose three challenging datasets: GTEA [10],50Salads [37], and Breakfast [18], which separate the train-ing and validation sets by different people (noted as sub-jects) with leave-subjects-out cross-validation for evalua-tion, resulting in large domain shift problem due to spatio-temporal variations. Therefore, we regard the training set asSource domain, and the validation set as Target domain withthe standard transductive unsupervised DA protocol [32, 6].See the supplementary for more implementation details.

4.1. Datasets and Evaluation Metrics

The overall statistics of the three datasets are listed inTable 1. Three widely used evaluation metrics are chosenas follows [21]: frame-wise accuracy (Acc), segmental editscore, and segmental F1 score at the IoU threshold k%, de-noted as F1@k (k = {10, 25, 50}). While Acc is the mostcommon metric, edit and F1 score both consider the tem-poral relation between predictions and ground truths, betterreflecting the performance for action segmentation.

4.2. Experimental Results

We first investigate the effectiveness of our approachesin utilizing unlabeled target videos for action segmentation.We choose MS-TCN [9] as the backbone model since it isthe current state of the art for this task. “Source only” meansthe model is trained only with source labeled videos, i.e.,the baseline model. And then our approach is compared toother methods with the same transductive protocol. Finally,we compare our method to the most recent action segmen-tation methods on all three datasets, and investigate how ourmethod can reduce the reliance on source labeled data.Self-Supervised Temporal Domain Adaptation: First weinvestigate the performance of local SSTDA by integratingthe auxiliary task binary domain prediction with the base-line model. The results on all three datasets are improvedsignificantly, as shown in Table 2. For example, on theGTEA dataset, our approach outperforms the baseline by4.3% for F1@25, 3.2% for the edit score and 3.6% for theframe-wise accuracy. Although local SSTDA mainly workson the frame-level features, the temporal information is stillencoded using the context from neighbor frames, helping

GTEA F1@{10, 25, 50} Edit AccSource only (MS-TCN)† 86.5 83.6 71.9 81.3 76.5

Local SSTDA 89.6 87.9 74.4 84.5 80.1SSTDA‡ 90.0 89.1 78.0 86.2 79.850Salads F1@{10, 25, 50} Edit Acc

Source only (MS-TCN)† 75.4 73.4 65.2 68.9 82.1Local SSTDA 79.2 77.8 70.3 72.0 82.8

SSTDA‡ 83.0 81.5 73.8 75.8 83.2Breakfast F1@{10, 25, 50} Edit Acc

Source only (MS-TCN)† 65.3 59.6 47.2 65.7 64.7Local SSTDA 72.8 67.8 55.1 71.7 70.3

SSTDA‡ 75.0 69.1 55.2 73.7 70.2

Table 2: The experimental results for our approaches onthree benchmark datasets. “SSTDA” refers to the full modelwhile “Local SSTDA” only contains binary domain predic-tion. †We achieve higher performance than reported in [9]when using the released code, so use that as the baselineperformance for the whole paper. ‡Global SSTDA requiresoutputs from local SSTDA, so it is not evaluated alone.

address the variation problem for videos across domains.Despite the improvement from local SSTDA, integrating

DA into frame-level features cannot fully address the prob-lem of spatio-temporal variations for long videos. There-fore, we integrate our second proposed auxiliary task se-quential domain prediction for untrimmed long videos.By jointly training with both auxiliary tasks, SSTDA canjointly align cross-domain feature spaces embedding withlocal and global temporal dynamics, and further improveover local SSTDA with significant margins. For example,on the 50Salads dataset, it outperforms local SSTDA by3.8% for F1@10, 3.7% for F1@25, 3.5% for F1@50, and3.8% for the edit score, as shown in Table 2.

One interesting finding is that local SSTDA contributesto most of the frame-wise accuracy improvement forSSTDA because it focuses on aligning frame-level featurespaces. On the other hand, sequential domain predictionbenefits aligning video-level feature spaces, contributing tofurther improvement for the other two metrics, which con-sider temporal relation for evaluation.Learning from Unlabeled Target Videos: We also com-pare SSTDA with other popular approaches [12, 27, 34, 44,36, 22, 45] to validate the effectiveness of reducing spatio-temporal discrepancy with the same amount of unlabeledtarget videos. For the fair comparison, we integrate all thesemethods with the same baseline model, MS-TCN. For moreimplementation details, please refer to the supplementary.

Table 3 shows that our proposed SSTDA outperformsall the other investigated DA methods in terms of the twometrics that consider temporal relation. We conjecture themain reason is that all these DA approaches are designedfor cross-domain image problems. Although they are in-

6

Page 7: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

F1@{10, 25, 50} EditSource only (MS-TCN) 86.5 83.6 71.9 81.3

VCOP [45] 87.3 85.9 70.1 82.2DANN [12] 89.6 87.9 74.4 84.5

JAN [27] 88.7 87.6 73.1 83.1MADA [34] 88.6 86.7 75.8 83.5MSTN [44] 89.9 88.2 75.9 84.7MCD [36] 88.1 86.3 73.4 82.7SWD [22] 89.0 87.3 73.8 84.4SSTDA 90.0 89.1 78.0 86.2

Table 3: The comparison of different methods that can learninformation from unlabeled target videos (on GTEA). Allthe methods are integrated with the same baseline modelMS-TCN for fair comparison. Please refer to the supple-mentary for the results on other datasets.

tegrated with frame-level features which encode local tem-poral dynamics, the limited temporal receptive fields pre-vent them from fully addressing temporal domain discrep-ancy. Instead, the sequential domain prediction in SSTDAis directly applied to the whole untrimmed video, helpingto globally align the cross-domain feature spaces that em-bed longer temporal dynamics, so that spatio-temporal vari-ations can be reduced more effectively.

We also compare with the most recent video-based self-supervised learning method, [45], which can also learn tem-poral dynamics from unlabeled target videos. However, theperformance is even worse than other DA methods, imply-ing that temporal shuffling within single domain does noteffectively benefit cross-domain action segmentation.Comparison with Action Segmentation Methods: Herewe compare the recent methods to SSTDA trained with twosettings: 1) fully source labels, and 2) weakly source labels.

The first setting means we have labels for all the framesin source videos, and SSTDA outperforms all the previousmethods on the three datasets with respect to all evaluationmetrics. For example, SSTDA outperforms currently thestate-of-the-art fully-supervised method, MS-TCN [9], bylarge margins (e.g. 8.1% for F1@25, 8.6% for F1@50, and6.9% for the edit score on 50Salads; 9.5% for F1@25, 8.0%for F1@50, and 8.0% for the edit score on Breakfast), asdemonstrated in Table 4. Since no additional labeled datais used, these results indicate how our proposed SSTDA ad-dress the spatio-temporal variation problem with unlabeledvideos to improve the action segmentation performance.

Given the significant improvement by exploiting unla-beled target videos, it implies the potential to train withfewer number of labeled frames using SSTDA, which isour second setting. In this setting, we drop labeled framesfrom source domains with uniform sampling for training,and evaluate on the same length of validation data. Ourexperiment indicates that by integrating with SSTDA, only

GTEA F1@{10, 25, 50} Edit AccLCDC [30] 75.4 - - 72.8 65.3TDRN [24] 79.2 74.4 62.7 74.1 70.1

MS-TCN [9]† 86.5 83.6 71.9 81.3 76.5SSTDA (65%) 85.2 82.6 69.3 79.6 75.7

SSTDA 90.0 89.1 78.0 86.2 79.850Salads F1@{10, 25, 50} Edit Acc

TDRN [24] 72.9 68.5 57.2 66.0 68.1LCDC [30] 73.8 - - 66.9 72.1

MS-TCN [9]† 75.4 73.4 65.2 68.9 82.1SSTDA (65%) 77.7 75.0 66.2 69.3 80.7

SSTDA 83.0 81.5 73.8 75.8 83.2Breakfast F1@{10, 25, 50} Edit AccTCFPN [8] - - - - 52.0GRU [35] - - - - 60.6

MS-TCN [9]† 65.3 59.6 47.2 65.7 64.7SSTDA (65%) 69.3 62.9 49.4 69.0 65.8

SSTDA 75.0 69.1 55.2 73.7 70.2

Table 4: Comparison with the most recent action segmen-tation methods on all three datasets. SSTDA (65%) meanstraining with 65% of total labeled training data. †Resultsfrom running the official code, as explained in Table 2.

F1@{10, 25, 50} Edit AccSource only 86.5 83.6 71.9 81.3 76.5{S1} 88.6 86.2 73.6 84.2 78.7{S2} 89.1 87.2 74.4 84.3 79.1{S3} 89.2 87.3 72.3 83.8 78.9{S4} 88.1 86.4 73.0 83.0 78.8{S1, S2} 89.0 85.8 73.5 84.8 79.5{S2, S3} 89.6 87.9 74.4 84.5 80.1{S3, S4} 88.3 86.8 73.9 83.6 78.6

Table 5: The experimental results of design choice for localSSTDA (on GTEA). {Sn}: add Gld to the nth stage of MS-TCN, where smaller n implies closer to inputs.

65% of labeled training data are required to achieve compa-rable performance with MS-TCN, as shown in the “SSTDA(65%)” row in Table 4. For the full experiments about la-beled data reduction, please refer to the supplementary.

4.3. Ablation Study and Analysis

Design Choice for Local SSTDA: Since we develop ourapproaches upon MS-TCN [9], it raises the question: Howto effectively integrate binary domain prediction to a multi-stage architecture? To answer this, we first integrate Gld

into each stage and the results show that the best perfor-mance happens when the Gld is integrated into middlestages, such as S2 or S3, as shown in Table 5. S1 is nota good choice for DA because it corresponds to low-levelfeatures with less discriminability where DA shows limitedeffects [25], and represents less temporal receptive fields for

7

Page 8: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

Segment # F1@{10, 25, 50} Edit Acc1 89.4 87.7 75.4 85.3 79.22 90.0 89.1 78.0 86.2 79.83 89.7 87.6 75.4 85.2 79.2

Table 6: The experimental results for different segmentnumbers of sequential domain prediction (on GTEA).

stirtakebackground open pour close putscoop

Ground Truth

MS-TCN

Local SSTDA

SSTDA

Figure 5: The visualization of temporal action segmentationfor our methods with color-coding (input example: makecoffee). “MS-TCN” is the baseline model without any DAmethods. We only highlight the action segments that aredifferent from the ground truth for clear comparison.

videos. However, higher stages (e.g. S4) are not always bet-ter. We conjecture that it is because the model fits more tothe source data, causing difficulty for DA. In our case, inte-grating Gld into S2 provides the best overall performance.

We also integrate binary domain prediction with multiplestages. However, multi-stage DA does not always guaranteeimproved performance. For example, {S1, S2} has worseresults than {S2} in terms of F1@{10, 25, 50}. Since {S2}and {S3} provide the best single-stage DA performance, weuse {S2, S3}, which performs the best, as the final modelfor all our approaches in all the experiments.Design Choice for Global SSTDA: The most critical de-sign decision for the sequential domain prediction is thesegment number for each video. In our implementation, wedivide one source video into m segments and do so for onetarget video, and then apply Ggd to predict the permutationof domains for these 2m video segments. Therefore, thecategory number of Ggd equals the number of all permu-tations (2m)!/(m!)2. In other words, the segment numberm determine the complexity of the self-supervised auxil-iary task. For example, m = 3 leads to a 20-way classifier,and m = 4 results in a 70-way classifier. Since a goodself-supervised task should be neither naive nor over com-plicated [31], we choosem = 2 as our final decision, whichis supported by our experiments as shown in Table 6.Segmentation Visualization: It is also common to evalu-ate the qualitative performance to ensure that the predictionresults are aligned with human vision. First, we compareour approaches with the baseline model MS-TCN [9] andthe ground truth, as shown in Figure 5. MS-TCN fails to

stirtakebackground open pour close putscoop

Ground Truth

Source only

DANN

SSTDA

JAN

MADA

MSTN

MCD

SWD

Figure 6: The visualization of temporal action segmentationfor different DA methods (same input as Figure 5). “Sourceonly” represents the baseline model, MS-TCN. Only thesegments different from the ground truth are highlighted.

detect some pour actions in the first half of the video, andfalsely classify close as take in the latter part of the video.With local SSTDA, our approach can detect close in the lat-ter part of the video. Finally, with full SSTDA, our pro-posed method also detects all pour action segments in thefirst half of video. We then compare SSTDA with other DAmethods, and Figure 6 shows that our result is the closest tothe ground truth. The others either incorrectly detect someactions or make incorrect classification. For more qualita-tive results, please refer to the supplementary.

5. Conclusions and Future WorkIn this work, we propose a novel approach to effec-

tively exploit unlabeled target videos to boost performancefor action segmentation without target labels. To addressthe problem of spatio-temporal variations for videos acrossdomains, we propose Self-Supervised Temporal DomainAdaptation (SSTDA) to jointly align cross-domain featurespaces embedded with local and global temporal dynam-ics by two self-supervised auxiliary tasks, binary and se-quential domain prediction. Our experiments indicate thatSSTDA outperforms other DA approaches by aligning tem-poral dynamics more effectively. We also validate the pro-posed SSTDA on three challenging datasets (GTEA, 50Sal-ads, and Breakfast), and show that SSTDA outperforms thecurrent state-of-the-art method by large margins and onlyrequires 65% of the labeled training data to achieve thecomparable performance, demonstrating the usefulness ofadapting to unlabeled videos across variations. For the fu-ture work, we plan to apply SSTDA to more challengingvideo tasks (e.g. spatio-temporal action localization [14]).

8

Page 9: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

6. AppendixIn the supplementary material, we would like to show

more details about the technical approach, implementation,and experiments.

6.1. Technical Approach Details

Domain Attentive Temporal Pooling (DATP): Temporalpooling is one of the most common methods to aggre-gate frame-level features into video-level features for eachvideo. However, not all the frame-level features contributethe same to the overall domain discrepancy. Therefore, in-spired by [4, 5], we assign larger attention weights to thefeatures which have larger domain discrepancy so that wecan focus more on aligning those features, achieving moreeffective domain adaptation.

More specifically, we utilize the entropy criterion to gen-erate the domain attention value for each frame-level featurefj as below:

wj = 1−H(dj) (6)

where dj is the output from the learned domain classifierGld used in local SSTDA. H(p) = −

∑k pk · log(pk) is

the entropy function to measure uncertainty. wj increaseswhen H(dj) decreases, which means the domains can bedistinguished well. We also add a residual connection formore stable optimization. Finally, we aggregate the at-tended frame-level features with temporal pooling to gen-erate the video-level feature v, which is noted as DomainAttentive Temporal Pooling (DATP), as illustrated in the leftpart of Figure 7 and can be expressed as:

v =1

T ′

T ′∑j=1

(wj + 1) · fj (7)

where +1 refers to the residual connection, and wj + 1 isequal to wj in the main paper. T ′ is the number of framesused to generate a video-level feature.

Local SSTDA is necessary to calculate the attentionweights for DATP. Without this mechanisms, frames willbe aggregated in the same way as temporal pooling withoutcross-domain consideration, which is already demonstratedsub-optimal for cross-domain video tasks [4, 5].Domain Attentive Entropy (DAE): Minimum entropy reg-ularization is a common strategy to perform more refinedclassifier adaptation. However, we only want to minimizeclass entropy for the frames that are similar across domains.Therefore, inspired by [42], we attend to the frames whichhave low domain discrepancy, corresponding to high do-main entropy H(dj). More specifically, we adopt the Do-main Attentive Entropy (DAE) module to calculate the atten-tive entropy loss Lae, which can be expressed as follows:

Lae =1

T

T∑j=1

(H(dj) + 1) ·H(yj) (8)

𝐻( መ𝑑)

Video-level feature

……𝑓1 𝑓2 𝑓𝑇′

𝐺𝑙𝑑Domain

Attentionෝ𝒘

𝑣

ℒ𝑎𝑒

𝐻( መ𝑑)

DATP

DAE

𝐻(ෝ𝒚)Class Entropy

ෝ𝒚Class Prediction𝑮𝒚

Figure 7: The details of DATP (left) and DAE (right). Bothmodules take the domain entropy H(d), which is calcu-lated from the domain prediction d, to calculate the atten-tion weights. With the residual connection, DATP attends tothe frame-level features for aggregating into the final video-level feature v (arrow thickness represents assigned atten-tion values), and DAE attends to the class entropy H(y) toobtain the attentive entropy loss Lae.

where d and y is the output of Gld and Gy , respectively.T is the total frame number of a video. We also apply theresidual connection for stability, as shown in the right partof Figure 7.Full Architecture: Our method is built upon the state-of-the-art action segmentation model, MS-TCN [9], whichtakes input frame-level feature representations and gener-ates the corresponding output frame-level class predictionsby four stages of SS-TCN. In our implementation, we con-vert the second and third stages into Domain Adaptive TCN(DA-TCN) by integrating each SS-TCN with the followingthree parts: 1) Gld (for binary domain prediction), 2) DATPand Ggd (for sequential domain prediction), and 3) DAE,bringing three corresponding loss functions, Lld, Lgd andLae, respectively, as illustrated in Figure 8. The final lossfunction can be formulated as below:

L =

4∑s=1

Ly(s) −3∑

s=2

(βlLld(s) + βgLgd(s) − µLae(s)) (9)

where βl, βg and µ are the weights for Lld, Lgd and Lae,respectively, obtained by the methods described in Sec-tion 6.2. s is the stage index in MS-TCN.

6.2. Experiments

Datasets and Evaluation Metrics: The detailed statisticsand the evaluation protocols of the three datasets are listedin Table 7. We follow [21] to use the following three metricsfor evaluation:

1. Frame-wise accuracy (Acc): Acc is one of the mosttypical evaluation metrics for action segmentation, butit does not consider the temporal dependencies of theprediction, causing the inconsistency between quali-tative assessment and frame-wise accuracy. Besides,

9

Page 10: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

I3D

I3D

I3D

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

SS-TCN

SS-TCN

SS-TCN

Input frame-level

features

Output frame-level predictions

Multi-layer Temporal

Convolution

𝑮𝒇

𝑮𝒚

ෝ𝒚𝒇

Input video frames

𝐺𝑙𝑑

ℒ𝑙𝑑𝐺𝑔𝑑ℒ𝑔𝑑

DAE

ℒ𝑎𝑒

ℒ𝑦

Domain Adaptive TCN (DA-TCN)

DATP𝐻( መ𝑑) 𝐻( መ𝑑)

DA

-TCN

Figure 8: The overall architecture of the proposed SSTDA. By equipping the network with a local adversarial domainclassifier Gld, a global adversarial domain classifier Ggd, a domain attentive temporal pooling (DATP) module, and a domainattentive entropy (DAE) module, we convert a SS-TCN into a DA-TCN, and stack multiple SS-TCNs and DA-TCNs to buildthe final architecture. Lld and Lgd is the local and global domain loss, respectively. Ly is the prediction loss and Lae is theattentive entropy loss. The domain entropyH(d) is used to calculate the attention weights for DATP and DAE. An adversarialdomain classifier G refers to a domain classifier G equipped with a gradient reversal layer (GRL).

GTEA 50Salads Breakfastsubject # 4 25 52class # 11 17 48video # 28 50 1712

avg. length (min.) 1 6.4 2.7avg. action #/video 20 20 6

cross-validation 4-fold 5-fold 4-foldleave-#-subject-out 1 5 13

Table 7: The statistics of action segmentation datasets.

long action classes have higher impact on this metricthan shorter action classes, making it not able to reflectover-segmentation errors.

2. Segmental edit score (Edit): The edit score penalizesover-segmentation errors by measuring the ordering ofpredicted action segments independent of slight tem-poral shifts.

3. Segmental F1 score at the IoU threshold k% (F1@k):F1@k also penalizes over-segmentation errors whileignoring minor temporal shifts between the predictions

and ground truth. The scores are determined by the to-tal number of actions but do not depend on the dura-tion of each action instance, which is similar to meanaverage precision (mAP) with intersection-over-union(IoU) overlap criteria. F1@k becomes popular recentlysince it better reflects the qualitative results.

Implementation and Optimization: Our implementationis based on the PyTorch [33, 38] framework. We extractI3D [2] features for the video frames and use these featuresas inputs to our model. The video frame rates are the sameas [9]. For GTEA and Breakfast datasets we use a videotemporal resolution of 15 frames per second (fps), while for50Salads we downsampled the features from 30 fps to 15fps to be consistent with the other datasets. For fair com-parison, we adopt the same architecture design choices ofMS-TCN [9] as our baseline model. The whole model con-sists of four stages where each stage contains ten dilatedconvolution layers. We set the number of filters to 64 in allthe layers of the model and the filter size is 3. For optimiza-tion, we utilize the Adam optimizer and a batch size equalto 1, following the official implementation of MS-TCN [9].Since the target data size is smaller than the source data,each target data is loaded randomly multiple times in each

10

Page 11: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

50Salads m% F1@{10, 25, 50} Edit Acc

SSTDA

100% 83.0 81.5 73.8 75.8 83.295% 81.6 80.0 73.1 75.6 83.285% 81.0 78.9 70.9 73.8 82.175% 78.9 76.5 68.6 71.7 81.165% 77.7 75.0 66.2 69.3 80.7

MS-TCN 100% 75.4 73.4 65.2 68.9 82.1GTEA m% F1@{10, 25, 50} Edit Acc

SSTDA100% 90.0 89.1 78.0 86.2 79.865% 85.2 82.6 69.3 79.6 75.7

MS-TCN 100% 86.5 83.6 71.9 81.3 76.5Breakfast m% F1@{10, 25, 50} Edit Acc

SSTDA100% 75.0 69.1 55.2 73.7 70.265% 69.3 62.9 49.4 69.0 65.8

MS-TCN 100% 65.3 59.6 47.2 65.7 64.7

Table 8: The comparison of SSTDA trained with less la-beled training data. m in the first row indicates the percent-age of labeled training data used to train a model.

epoch during training. For the weighting of loss functions,we follow the common strategy as [11, 12] to gradually in-crease βl and βg from 0 to 1. The weighting α for smooth-ness loss is 0.15 as in [9] and µ is chosen as 1 × 10−2 viathe grid-search.Less Training Labeled Data: To investigate the poten-tial to train with a fewer number of labeled frames usingSSTDA, we drop labeled frames from source domains withuniform sampling for training, and evaluate on the samelength of validation data. Our experiment on the 50Saladsdataset shows that by integrating with SSTDA, the perfor-mance does not drop significantly with the decrease in la-beled training data, indicating the alleviation of reliance onlabeled training data. Finally, only 65% of labeled trainingdata are required to achieve comparable performance withMS-TCN, as shown in Table 8. We then evaluate the pro-posed SSTDA on GTEA and Breakfast with the same per-centage of labeled training data, and also get comparable orbetter performance.

Table 8 also indicates the results without additional la-beled training data, which contain discriminative informa-tion that can directly boost the performance for action seg-mentation. The additional trained data are all unlabeled, sothey cannot be directly trained with standard prediction loss.Therefore, we propose SSTDA to exploit unlabeled data to:1) further improve the strong baseline, MS-TCN, withoutadditional training labels, and 2) achieve comparable per-formance with this strong baseline using only 65% of labelsfor training.Comparison with Other Approaches: We compare ourproposed SSTDA with other approaches by integrating thesame baseline architecture with other popular DA meth-ods [12, 27, 34, 44, 36, 22] and a state-of-the-art video-

50Salads F1@{10, 25, 50} EditSource only (MS-TCN) 75.4 73.4 65.2 68.9

VCOP [45] 75.8 73.8 65.9 68.4DANN [12] 79.2 77.8 70.3 72.0

JAN [27] 80.9 79.4 72.4 73.5MADA [34] 79.6 77.4 70.0 72.4MSTN [44] 79.3 77.6 71.5 72.1MCD [36] 78.2 75.5 67.1 70.8SWD [22] 78.2 76.2 67.4 71.6SSTDA 83.0 81.5 73.8 75.8

Breakfast F1@{10, 25, 50} EditSource only (MS-TCN) 65.3 59.6 47.2 65.7

VCOP [45] 68.5 62.9 50.1 67.9DANN [12] 72.8 67.8 55.1 71.7

JAN [27] 70.2 64.7 52.0 70.0MADA [34] 71.0 65.4 52.8 71.2MSTN [44] 69.6 63.6 51.5 69.2MCD [36] 70.4 65.1 52.4 69.7SWD [22] 68.6 63.2 50.6 69.1SSTDA 75.0 69.1 55.2 73.7

Table 9: The comparison of different methods that can learninformation from unlabeled target videos (on 50Salads andBreakfast). All the methods are integrated with the samebaseline model MS-TCN for fair comparison.

based self-supervised approach [45]. For fair comparison,all the methods are integrated with the second and thirdstages, as our proposed SSTDA, where the single-stage in-tegration methods are described as follows:

1. DANN [12]: We add one discriminator, which is thesame as Gld, equipped a gradient reversal layer (GRL)to the final frame-level features f .

2. JAN [27]: We integrate Joint Maximum Mean Dis-crepancy (JMMD) to the final frame-level features fand the class prediction y.

3. MADA [34]: Instead of a single discriminator, we addmultiple discriminators according to the class numberto calculate the domain loss for each class. All theclass-based domain losses are weighted with predic-tion probabilities and then summed up to obtain thefinal domain loss.

4. MSTN [44]: We utilize pseudo-labels to cluster thedata from the source and target domains, and calcu-late the class centroids for the source and target do-main separately. Then we compute the semantic lossby calculating mean squared error (MSE) between thesource and target centroids. The final loss contains theprediction loss, the semantic loss, and the domain lossas DANN [12].

5. MCD [36]: We apply another classifier G′y and follow

11

Page 12: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

the adversarial training procedure of Maximum Clas-sifier Discrepancy to iteratively optimize the genera-tor (Gf in our case) and the classifier (Gy). The L1-distance is used as the discrepancy loss.

6. SWD [22]: The framework is similar to MCD, but wereplace the L1-distance with the Wasserstein distanceas the discrepancy loss.

7. VCOP [45]: We divide f into three segments and com-pute the segment-level features with temporal pooling.After temporal shuffling the segment-level features,pairwise features are computed and concatenated intothe final feature representing the video clip order. Thefinal features are then fed into a shallow classifier topredict the order.

The experimental results on 50Salads and Breakfast bothindicate that our proposed SSTDA outperforms all thesemethods, as shown in Table 9.

The performance of the most recent video-based self-supervised learning method [45] on 50Salads and Break-fast also show that temporal shuffling within single domainwithout considering the relation across domains does noteffectively benefit cross-domain action segmentation, re-sulting in even worse performance than other DA methods.Instead, our proposed self-supervised auxiliary tasks makepredictions on cross-domain data, leading to cross-domaintemporal relation reasoning instead of predicting within-domain temporal orders, achieving significant improvementin the performance of our main task, action segmentation.

6.3. Segmentation Visualization

Here we show more qualitative segmentation resultsfrom all three datasets to compare our methods with thebaseline model, MS-TCN [9]. All the results (Figure 9 forGTEA, Figure 10 for 50Salads, and Figure 11 for Break-fast) demonstrate that the improvement over the baselineby only local SSTDA is sometimes limited. For example,local SSTDA falsely detects the pour action in Figure 9b,falsely classifies cheese-related actions as cucumber-relatedactions in Figure 10b, and falsely detects the stir milk actionin Figure 11b. However, by jointly aligning local and globaltemporal dynamics with SSTDA, the model is effectivelyadapted to the target domain, reducing the above mentionedincorrect predictions and achieving better segmentation.

References[1] Unaiza Ahsan, Rishi Madhok, and Irfan Essa. Video jigsaw:

Unsupervised learning of spatiotemporal context for videoaction recognition. In IEEE Winter Conference on Applica-tions of Computer Vision (WACV), 2019. 2

[2] Joao Carreira and Andrew Zisserman. Quo vadis, actionrecognition? a new model and the kinetics dataset. In

IEEE conference on Computer Vision and Pattern Recogni-tion (CVPR), 2017. 1, 10

[3] Min-Hung Chen, Zsolt Kira, and Ghassan AlRegib. Tempo-ral attentive alignment for video domain adaptation. CVPRWorkshop on Learning from Unlabeled Videos, 2019. 2

[4] Min-Hung Chen, Zsolt Kira, Ghassan AlRegib, JaekwonWoo, Ruxin Chen, and Jian Zheng. Temporal attentive align-ment for large-scale video domain adaptation. In IEEE Inter-national Conference on Computer Vision (ICCV), 2019. 2,3, 4, 9

[5] Min-Hung Chen, Baopu Li, Yingze Bao, and Ghassan Al-Regib. Action segmentation with mixed temporal domainadaptation. In The IEEE Winter Conference on Applicationsof Computer Vision (WACV), 2020. 9

[6] Gabriela Csurka. A comprehensive survey on domain adap-tation for visual applications. In Domain Adaptation in Com-puter Vision Applications, pages 1–35. Springer, 2017. 2, 3,6

[7] Li Ding and Chenliang Xu. Tricornet: A hybrid temporalconvolutional and recurrent network for video action seg-mentation. arXiv preprint arXiv:1705.07818, 2017. 1, 2

[8] Li Ding and Chenliang Xu. Weakly-supervised action seg-mentation with iterative soft boundary assignment. In IEEEConference on Computer Vision and Pattern Recognition(CVPR), 2018. 7

[9] Yazan Abu Farha and Jurgen Gall. Ms-tcn: Multi-stagetemporal convolutional network for action segmentation. InIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), 2019. 1, 2, 3, 6, 7, 8, 9, 10, 11, 12

[10] Alireza Fathi, Xiaofeng Ren, and James M Rehg. Learningto recognize objects in egocentric activities. In IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),2011. 2, 6

[11] Yaroslav Ganin and Victor Lempitsky. Unsupervised domainadaptation by backpropagation. In International Conferenceon Machine Learning (ICML), 2015. 2, 3, 4, 11

[12] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas-cal Germain, Hugo Larochelle, Francois Laviolette, MarioMarchand, and Victor Lempitsky. Domain-adversarial train-ing of neural networks. The Journal of Machine LearningResearch (JMLR), 17(1):2096–2030, 2016. 2, 3, 4, 6, 7, 11

[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In Advances inNeural Information Processing Systems (NeurIPS), 2014. 2

[14] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Car-oline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan,George Toderici, Susanna Ricco, Rahul Sukthankar, et al.Ava: A video dataset of spatio-temporally localized atomicvisual actions. In IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2018. 8

[15] Arshad Jamal, Vinay P Namboodiri, Dipti Deodhare, andKS Venkatesh. Deep domain adaptation in action space. InBritish Machine Vision Conference (BMVC), 2018. 2

[16] Dahun Kim, Donghyeon Cho, and In So Kweon. Self-supervised video representation learning with space-time cu-bic puzzles. In AAAI Conference on Artificial Intelligence(AAAI), 2019. 2

12

Page 13: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

Ground Truth

MS-TCN

Local SSTDA

SSTDA

stirtakebackground open pour close putscoop

(a) Make coffee

stirtakebackground open pour close putscoop

Ground Truth

MS-TCN

Local SSTDA

SSTDA

(b) Make honey coffee

Figure 9: The visualization of temporal action segmentation for our methods with color-coding on GTEA. The video snap-shots and the segmentation visualization are in the same temporal order (from left to right). We only highlight the actionsegments that are significantly different from the ground truth for clear comparison. “MS-TCN” represents the baselinemodel trained with only source data.

[17] Yu Kong and Yun Fu. Human action recognition and predic-tion: A survey. arXiv preprint arXiv:1806.11230, 2018. 1,3

[18] Hilde Kuehne, Ali Arslan, and Thomas Serre. The languageof actions: Recovering the syntax and semantics of goal-directed human activities. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2014. 2, 6

[19] Vinod Kumar Kurmi, Shanu Kumar, and Vinay P Nambood-iri. Attending to discriminative certainty for domain adap-tation. In IEEE conference on Computer Vision and Pattern

Recognition (CVPR), 2019. 2[20] Avisek Lahiri, Sri Charan Ragireddy, Prabir Biswas, and

Pabitra Mitra. Unsupervised adversarial visual level domainadaptation for learning video object detectors from images.In IEEE Winter Conference on Applications of Computer Vi-sion (WACV), 2019. 1, 2

[21] Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, andGregory D Hager. Temporal convolutional networks for ac-tion segmentation and detection. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2017. 1,

13

Page 14: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

Ground Truth

MS-TCN

Local SSTDA

SSTDA

action start

add oiladd vinegar add pepperadd salt

mix dressing

peel cucumber

place cucumber into bowl

cut cucumbercut tomato place tomato into bowl

cut cheese place cheese into bowl cut lettuce place lettuce into bowl

mix ingredients

serve salad onto plate action end

(a) Subject 02

action start

add oil add vinegar add peppermix dressing

peel cucumber place cucumber into bowlcut cucumber cut tomato

place tomato into bowl cut cheese place cheese into bowlcut lettuce place lettuce into bowl

mix ingredients

serve salad onto plate add dressing action end

Ground Truth

MS-TCN

Local SSTDA

SSTDA

(b) Subject 04

Figure 10: The visualization of temporal action segmentation for our methods with color-coding on 50Salads. The videosnapshots and the segmentation visualization are in the same temporal order (from left to right). We only highlight the actionsegments that are different from the ground truth for clear comparison. Both examples correspond to the same activity Makesalad, but they are performed by different subjects, i.e., people.

2, 6, 9[22] Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and

Daniel Ulbricht. Sliced wasserstein discrepancy for unsuper-vised domain adaptation. In IEEE conference on Computer

14

Page 15: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

background

Ground Truth

MS-TCN

Local SSTDA

SSTDA

take bowl pour cereals pour milk stir cereals

(a) Make Cereal

background

Ground Truth

MS-TCN

Local SSTDA

SSTDA

take cup spoon powder pour milk stir milk

(b) Make milk

Figure 11: The visualization of temporal action segmentation for our methods with color-coding on Breakfast. The videosnapshots and the segmentation visualization are in the same temporal order (from left to right). We only highlight the actionsegments that are different from the ground truth for clear comparison.

Vision and Pattern Recognition (CVPR), 2019. 2, 6, 7, 11,12

[23] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sort-ing sequences. In IEEE International Conference on Com-puter Vision (ICCV), 2017. 2, 5

[24] Peng Lei and Sinisa Todorovic. Temporal deformable resid-ual networks for action segmentation in videos. In IEEEConference on Computer Vision and Pattern Recognition(CVPR), 2018. 1, 2, 7

[25] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jor-dan. Learning transferable features with deep adaptation net-works. In International Conference on Machine Learning(ICML), 2015. 2, 7

[26] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael IJordan. Unsupervised domain adaptation with residual trans-fer networks. In Advances in Neural Information ProcessingSystems (NeurIPS), 2016. 2

[27] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael IJordan. Deep transfer learning with joint adaptation net-

15

Page 16: arXiv:2003.02824v3 [cs.CV] 18 Mar 2020domain video classification datasets along with the state-of-the-art approach are proposed [3,4]. Moreover, some authors also proposed novel

works. In International Conference on Machine Learning(ICML), 2017. 2, 6, 7, 11

[28] Chih-Yao Ma, Min-Hung Chen, Zsolt Kira, and GhassanAlRegib. Ts-lstm and temporal-inception: Exploiting spa-tiotemporal dynamics for activity recognition. Signal Pro-cessing: Image Communication, 71:76–87, 2019. 1

[29] Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghas-san AlRegib, and Hans Peter Graf. Attend and interact:higher-order object interactions for video understanding. InIEEE conference on Computer Vision and Pattern Recogni-tion (CVPR), 2018. 1

[30] Khoi-Nguyen C Mac, Dhiraj Joshi, Raymond A Yeh, JinjunXiong, Rogerio S Feris, and Minh N Do. Learning motionin feature space: Locally-consistent deformable convolutionnetworks for fine-grained action detection. In IEEE Interna-tional Conference on Computer Vision (ICCV), 2019. 7

[31] Mehdi Noroozi and Paolo Favaro. Unsupervised learning ofvisual representations by solving jigsaw puzzles. In Euro-pean Conference on Computer Vision (ECCV), 2016. 8

[32] Sinno Jialin Pan, Qiang Yang, et al. A survey on transferlearning. IEEE Transactions on Knowledge and Data Engi-neering (TKDE), 22(10):1345–1359, 2010. 2, 6

[33] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin, AlbanDesmaison, Luca Antiga, and Adam Lerer. Automatic dif-ferentiation in pytorch. In Advances in Neural InformationProcessing Systems Workshop (NeurIPSW), 2017. 10

[34] Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and JianminWang. Multi-adversarial domain adaptation. In AAAI Con-ference on Artificial Intelligence (AAAI), 2018. 2, 6, 7, 11

[35] Alexander Richard, Hilde Kuehne, and Juergen Gall. Weaklysupervised action learning with rnn based fine-to-coarsemodeling. In IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), 2017. 7

[36] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tat-suya Harada. Maximum classifier discrepancy for unsuper-vised domain adaptation. In IEEE conference on ComputerVision and Pattern Recognition (CVPR), 2018. 2, 6, 7, 11

[37] Sebastian Stein and Stephen J McKenna. Combining em-bedded accelerometers with computer vision for recognizingfood preparation activities. In ACM international joint con-ference on Pervasive and ubiquitous computing (UbiComp),2013. 2, 6

[38] Benoit Steiner, Zachary DeVito, Soumith Chintala, SamGross, Adam Paszke, Francisco Massa, Adam Lerer, Gre-gory Chanan, Zeming Lin, Edward Yang, et al. Pytorch:An imperative style, high-performance deep learning li-brary. Advances in Neural Information Processing Systems(NeurIPS), 2019. 10

[39] Waqas Sultani and Imran Saleemi. Human action recognitionacross datasets by foreground-weighted histogram decompo-sition. In IEEE conference on Computer Vision and PatternRecognition (CVPR), 2014. 2

[40] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Dar-rell. Adversarial discriminative domain adaptation. In IEEEConference on Computer Vision and Pattern Recognition(CVPR), 2017. 2

[41] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-ing He. Non-local neural networks. In IEEE conference onComputer Vision and Pattern Recognition (CVPR), 2018. 1

[42] Ximei Wang, Liang Li, Weirui Ye, Mingsheng Long, andJianmin Wang. Transferable attention for domain adaptation.In AAAI Conference on Artificial Intelligence (AAAI), 2019.2, 5, 9

[43] Donglai Wei, Joseph J Lim, Andrew Zisserman, andWilliam T Freeman. Learning and using the arrow of time.In IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), 2018. 2

[44] Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen.Learning semantic representations for unsupervised domainadaptation. In International Conference on Machine Learn-ing (ICML), 2018. 2, 6, 7, 11

[45] Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, andYueting Zhuang. Self-supervised spatiotemporal learning viavideo clip order prediction. In IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), 2019. 2, 5, 6,7, 11, 12

[46] Tiantian Xu, Fan Zhu, Edward K Wong, and Yi Fang.Dual many-to-one-encoder-based transfer learning for cross-dataset human action recognition. Image and Vision Com-puting, 55:127–137, 2016. 2

[47] Weichen Zhang, Wanli Ouyang, Wen Li, and Dong Xu. Col-laborative and adversarial network for unsupervised domainadaptation. In IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2018. 2

[48] Xiao-Yu Zhang, Haichao Shi, Changsheng Li, Kai Zheng,Xiaobin Zhu, and Lixin Duan. Learning transferable self-attentive representations for action recognition in untrimmedvideos with weak supervision. In AAAI Conference on Arti-ficial Intelligence (AAAI), 2019. 1, 2

16