asymmetric tri-training for unsupervised domain adaptation

Asymmetric Tri-training

for Unsupervised Domain Adaptation

Kuniaki Saito1, Yoshitaka Ushiku1 and Tatsuya Harada1,2

1: The University of Tokyo, 2:RIKEN

ICML 2017 (8/6～8/11), Sydney

Background: Domain Adaptation (DA)

rucksack

keyboard

bicycle

Source TargetSource Target

• Supervised learning with a lot of samples

– Cost to collect samples in various domain

– Classifiers suffer from the change of domain

• The purpose of DA– Training a classifier using source domain that works well on target domain

• Unsupervised Domain Adaptation– Labeled source samples and unlabeled target samples

Related Work

• Applications on computer vision

– Domain transfer + Generative Adversarial Networks

– This paper: a novel approach w/o generative models

• Training CNN for domain adaptation

– Matching hidden features

of different domains[Long+, ICML 2015][Ganin+, ICML 2014]

Real faces to illusts [Taigman+, ICLR 2017] Artificial images to real images [Bousmalis+, CVPR 2017]

No Adapt AdaptedSource Target

Class A

Class B

Theorem [Ben David+, Machine Learning 2010]

•– Related work: regard as being sufficiently small

• Distribution matching approaches aim to minimize

• There is no guarantee that is small enough

– Proposed method: minimizes by reducing error on target samples

• absence of labeled samples

→ We propose to give pseudo-labels to target samples

Theoretical Insight

How much features are discriminative

: Divergence between domains

Error on source domainError on target domain

?

p1

p2

pt

S+Tl

Tl

S : source samplesTl : pseudo-labeled target samples

InputX

F1

F2

Ft

ŷ : Pseudo-label for target sample

y : Label for source sample

F

S+Tl

F1 ,F2 : Labeling networks

Ft : Target specific network

F : Shared network

Proposed Architecture

p1

p2

pt

S+Tl

Tl


InputX

F1

F2

Ft



F

S+Tl

F is updated using

gradients from F1,F2,Ft

Proposed Architecture

p1

p2

pt

S

S


Input

X

F1

F2

Ft



F

S

All networks are trained

using only source samples.

1. Initial training

p1

p2

TInput

X

F1

F2

F

T

If F1 and F2 agree on their predictions, and either of their

probability is larger than threshold value, corresponding

labels are given to the target sample.

T : Target samples

2. Labeling target samples

F1, F2 : source and pseudo-labeled

samples

Ft: pseudo-labeled ones

F : learn from all gradients

p1

p2

pt

S+Tl

Tl


Input

X

F1

F2

Ft



F

S+Tl

3. Retraining network using pseudo-labeled target samples

p1

p2

pt

S+Tl

Tl


Input

X

F1

F2

Ft



F

S+Tl

Repeat the 2nd step and 3rd step

until convergence!

3. Retraining network using pseudo-labeled target samples

Overall objective

Overall Objective l1 |W T

1W 2 |+L1 +L2 +L3

W1

W2

p1

p2

pt

S+Tl

F1

F2

Ft

F

S+Tl

Tl

L1

L2

L3

CrossEntropy

To force F1 and F2 to learn from different features.

Experiments

• Four adaptation scenarios between digits datasets

– MNIST, SVHN, SYN DIGIT (synthesized digits)

• One adaptation scenario between traffic signs datasets

– GTSRB (real traffic signs), SYN SIGN (synthesized signs)

• Other experiments are omitted due to the time limit…

– Adaptation on Amazon Reviews

GTSRB SYN SIGNS

SYN DIGITSSVHN

MNISTMNIST-M

Accuracy on Target Domain

• Our method outperformed other methods.

– The effect of BN is obvious in some settings.

– The effect of weight constraint is not obvious.Source MNIST MNIST SVHN SYNDIG SYN NUM

Method Target MN-M SVHN MNIST SVHN GTSRB

Source Only (w/o BN) 59.1 37.2 68.1 84.1 79.2

Source Only (with BN) 57.1 34.9 70.1 85.5 75.7

DANN [Ganin et al., 2014] 81.5 35.7 71.1 90.3 88.7

MMD [Long et al., 2015 ICML] 76.9 - 71.1 88.0 91.1

DSN [Bousmalis et al, 2016 NIPS] 83.2 - 82.7 91.2 93.1

K-NN Labeling [Sener et al., 2016 NIPS] 86.7 40.3 78.8 - -

Ours (w/o BN) 85.3 39.8 79.8 93.1 96.2

Ours (w/o Weight constraint) 94.2 49.7 86.0 92.4 94.0

Ours 94.0 52.8 86.8 92.9 96.2

Summary and Future Work

• Summary

– Problem presentation for domain adaptation

– Proposal of Asymmetric tri-training

– Effectiveness is shown in experiments

• Future work

– Evaluate our method on fine-tuning of pre-trained model

For more details, please refer to…

Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric Tri-training for Unsupervised Domain Adaptation. International Conference on Machine Learning (ICML), 2017.ICML

Supplemental materials

Relationship with Tri-training

• Tri-training [Zhou et al., 2005]

– Use three classifiers equally

• Use two classifiers to give labels to unlabeled samples

• Train one classifiers by the labeled samples

• Repeat in all combination of classifiers

• Our proposed method

– Use three classifiers asymmetrically

• Use fixed two classifiers to give labels

• Train a fixed one classifier using the pseudo-labeled samples

Accuracy during training

Blue: (correctly labeled samples)/(labeled samples))

Initially, the accuracy is high and gradually decreases.

Red: Accuracy of learned network. It gradually increases.

Green: The number of labeled samples.

A-distance between domains

• A-distance

– Calculated by domain classifier’s error

• Proposed method does not make the divergence small.

– Minimizing the divergence is not a only way to achieve a good

adaptation !!

Analysis by gradient stopping

p1

p2

pt

S+Tl

Tl

F2

Ft

F

S+Tl

F1

asymmetric tri-training for unsupervised domain adaptation

Technology