clusterfit: improving generalization of visual...

Post on 12-Mar-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ClusterFit: Improving Generalization of Visual Representations

Xueting Yan*, Ishan Misra*, Abhinav Gupta, Deepti Ghadiyaram†, Dhruv Mahajan† CVPR 2020

STRUCT Group Seminar Presenter: Wenjing Wang

2020.05.17

2

OUTLINE

➤ Authorship

➤ Background

➤ Proposed Method

➤ Experimental Results

➤ Conclusion

9

OUTLINE

➤ Authorship

➤ Background

➤ Proposed Method

➤ Experimental Results

➤ Conclusion

10

BACKGROUND

➤ Background

➤ Overview of the proposed method

➤ Compared with existing methods

11

BACKGROUND

➤ Background

➤ Overview of the proposed method

➤ Compared with existing methods

12

BACKGROUND

➤ Weak or self-supervision pre-training

13

BACKGROUND

➤ Weakly supervised learning

• Defining the proxy tasks using the associated meta-data

• Hashtags predictions

• Search queries prediction

• GPS

• Word or n-grams predictions

14

BACKGROUND

➤ Self-supervised Learning

• Defining the proxy tasks without extra data

• Domain agnostic

• Domain-specific information, e.g. spatial structure

• Color and illumination

• Temporal structure

15

BACKGROUND

➤ Weak or self-supervision pre-training

• Pre-training proxy not well-aligned with the transfer tasks

• Label noise: polysemy (apple the fruit vs. Apple Inc.), linguistic ambiguity, lack of visualness of tags (#love)

• The last layer is more “aligned” with the proxy objective

➤ This paper: avoid overfitting to the proxy objective

• Smoothing the feature space learned via proxy objectives

16

BACKGROUND

➤ Background

➤ Overview of the proposed method

➤ Compared with existing methods

17

BACKGROUND

➤ Proposed method: ClusterFit

• Step 1. Cluster: feature clustering

• Step 2.Fit: predict cluster assignments

18

BACKGROUND

➤ Proposed method: ClusterFit

19

BACKGROUND

➤ Background

➤ Overview of the proposed method

➤ Compared with existing methods

➤ Cluster-based self-supervised learning

• DeepCluster [1]

• DeeperCluster [2]

20

BACKGROUND

[1] Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze: Deep Clustering for Unsupervised Learning of Visual Features. ECCV 2018 [2] Mathilde Caron, Piotr Bojanowski, Julien Mairal, Armand Joulin: Unsupervised Pre-Training of Image Features on Non-Curated Data. ICCV 2019

21

BACKGROUND

➤ Cluster-based self-supervised learning

• DeepCluster [1]

• DeeperCluster [2]

[1] Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze: Deep Clustering for Unsupervised Learning of Visual Features. ECCV 2018 [2] Mathilde Caron, Piotr Bojanowski, Julien Mairal, Armand Joulin: Unsupervised Pre-Training of Image Features on Non-Curated Data. ICCV 2019

22

BACKGROUND

➤ Cluster-based self-supervised learning

• DeepCluster [1], DeeperCluster [2]

• Require alternate optimization

➤ This paper

• No alternate optimization

• More stable and computationally efficient

[1] Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze: Deep Clustering for Unsupervised Learning of Visual Features. ECCV 2018 [2] Mathilde Caron, Piotr Bojanowski, Julien Mairal, Armand Joulin: Unsupervised Pre-Training of Image Features on Non-Curated Data. ICCV 2019

23

BACKGROUND

➤ Model Distillation

• Transferring knowledge from a teacher to a student

➤ This paper

• Distilling knowledge from a higher capacity teacher model Npre to a lower-capacity student model Ncf

24

OUTLINE

➤ Authorship

➤ Background

➤ Proposed Method

➤ Experimental Results

➤ Conclusion

25

PROPOSED METHOD

➤ ClusterFit

• Use the second-last layer of to extract features

• Cluster features using k-means into K groups

• Train a new network from scratch with the K-labels

26

PROPOSED METHOD

➤ Why?

• ClusterFit: a lossy compression scheme

• Captures the essential visual invariances in the feature space

• Gives the ‘re-learned’ network an opportunity to learn features that are less sensitive to the original pre-training objective → making them more transferable.

27

PROPOSED METHOD

➤ Notes

• Npre is trained on Dpre

• Ncf is trained on Dcf

• Dtar is the target dataset

28

PROPOSED METHOD

➤ Control Experiment using Synthetic Noise

• Adding varying amounts (p%) of uniform random label noise

• Npre: pre-train on noisy label, then fixed and train linear classifiers

• Dpre = Dcf = ImageNet-1K

• Npre = Ncf = ResNet-50

• Dtar = ImageNet-1K, ImageNet-9K, iNaturalist

29

PROPOSED METHOD

➤ Control Experiment using Synthetic Noise

30

PROPOSED METHOD

➤ Control Experiment using Synthetic Noise

31

OUTLINE

➤ Authorship

➤ Background

➤ Proposed Method

➤ Experimental Results

➤ Conclusion

32

EXPERIMENTAL RESULTS

➤ Benchmarking

➤ Analysis of ClusterFit

33

EXPERIMENTAL RESULTS

➤ Benchmarking

➤ Analysis of ClusterFit

34

EXPERIMENTAL RESULTS

➤ Compared Methods

• Distillation

• A weighted average of 2 loss functions:

• (a) cross-entropy with soft targets computed using Npre

• (b) cross-entropy with labels in weakly-supervised setup

• Prototype

• Unlike random cluster initialization

• Use label information in Dcf to initialize cluster centers

• Longer pre-training (Npre 2×)

35

EXPERIMENTAL RESULTS

➤ Benchmarking

• Weakly-supervised images

• Weakly-supervised videos

• Self-supervised images

36

EXPERIMENTAL RESULTS

➤ Weakly-Supervised Images

• Dpre = Dcf = IG-ImageNet-1B

• Npre = Ncf = ResNet-50

• Dtar = ImageNet-1K, ImageNet-9K, Places365, iNaturalist

37

EXPERIMENTAL RESULTS

➤ Weakly-Supervised Images

• Results

• ImageNet-1K: the hand-crafted label alignment of the IG-ImageNet-1B with ImageNet-1K

38

EXPERIMENTAL RESULTS

➤ Weakly-Supervised Videos

• Dpre = Dcf = IG-Verb-19M

• Npre = Ncf = R(2+1)D-34 [1]

• Dtar = Kinetics, Sports1M, Something-Something V1

[1] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri: A Closer Look at Spatiotemporal Convolutions for Action Recognition. CVPR 2018

39

EXPERIMENTAL RESULTS

➤ Weakly-Supervised Videos

• Results

40

EXPERIMENTAL RESULTS

➤ Self-Supervised Images

• Dpre = Dcf = ImageNet-1K, JigSaw & Rotation

• Npre = Ncf = ResNet-50

• Dtar = VOC07, ImageNet-1K, Places205, iNaturalist

41

EXPERIMENTAL RESULTS

➤ Self-Supervised Images

• Results

42

43

EXPERIMENTAL RESULTS

➤ Self-Supervised Images

• Layer-wise results

44

EXPERIMENTAL RESULTS

➤ Benchmarking

➤ Analysis of ClusterFit

45

EXPERIMENTAL RESULTS

➤ Relative model capacity of Npre and Ncf

• Dpre = IG-Verb-19M, Dcf = IG-Verb-62M

• Ncf = R(2+1)D-18 (33M parameters)

• Npre = R(2+1)D-18, R(2+1)D-34 (64M parameters)

46

EXPERIMENTAL RESULTS

➤ Relative model capacity of Npre and Ncf

• Results

47

EXPERIMENTAL RESULTS

➤ Unsupervised vs. Per-Label Clustering

• Per-label clustering:

• Cluster videos belonging to each label into kl clusters

48

EXPERIMENTAL RESULTS

➤ Properties of Dpre

• Number of labels

• IG-Verb-62M (438 weak verb labels)

• Label number: 10, 30, 100, 438

• Reducing the number of labels implies reduced content diversity

49

EXPERIMENTAL RESULTS

➤ Properties of Dpre

• Results

50

OUTLINE

➤ Authorship

➤ Background

➤ Proposed Method

➤ Experimental Results

➤ Conclusion

51

CONCLUSION

➤ First clustering the original feature space and re-learning a new model on cluster assignments

➤ Improves the generalizability for weakly- and self-supervised learning

top related