clusterfit: improving generalization of visual...
TRANSCRIPT
ClusterFit: Improving Generalization of Visual Representations
Xueting Yan*, Ishan Misra*, Abhinav Gupta, Deepti Ghadiyaram†, Dhruv Mahajan† CVPR 2020
STRUCT Group Seminar Presenter: Wenjing Wang
2020.05.17
2
OUTLINE
➤ Authorship
➤ Background
➤ Proposed Method
➤ Experimental Results
➤ Conclusion
9
OUTLINE
➤ Authorship
➤ Background
➤ Proposed Method
➤ Experimental Results
➤ Conclusion
10
BACKGROUND
➤ Background
➤ Overview of the proposed method
➤ Compared with existing methods
11
BACKGROUND
➤ Background
➤ Overview of the proposed method
➤ Compared with existing methods
12
BACKGROUND
➤ Weak or self-supervision pre-training
13
BACKGROUND
➤ Weakly supervised learning
• Defining the proxy tasks using the associated meta-data
• Hashtags predictions
• Search queries prediction
• GPS
• Word or n-grams predictions
14
BACKGROUND
➤ Self-supervised Learning
• Defining the proxy tasks without extra data
• Domain agnostic
• Domain-specific information, e.g. spatial structure
• Color and illumination
• Temporal structure
15
BACKGROUND
➤ Weak or self-supervision pre-training
• Pre-training proxy not well-aligned with the transfer tasks
• Label noise: polysemy (apple the fruit vs. Apple Inc.), linguistic ambiguity, lack of visualness of tags (#love)
• The last layer is more “aligned” with the proxy objective
➤ This paper: avoid overfitting to the proxy objective
• Smoothing the feature space learned via proxy objectives
16
BACKGROUND
➤ Background
➤ Overview of the proposed method
➤ Compared with existing methods
17
BACKGROUND
➤ Proposed method: ClusterFit
• Step 1. Cluster: feature clustering
• Step 2.Fit: predict cluster assignments
18
BACKGROUND
➤ Proposed method: ClusterFit
19
BACKGROUND
➤ Background
➤ Overview of the proposed method
➤ Compared with existing methods
➤ Cluster-based self-supervised learning
• DeepCluster [1]
• DeeperCluster [2]
20
BACKGROUND
[1] Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze: Deep Clustering for Unsupervised Learning of Visual Features. ECCV 2018 [2] Mathilde Caron, Piotr Bojanowski, Julien Mairal, Armand Joulin: Unsupervised Pre-Training of Image Features on Non-Curated Data. ICCV 2019
21
BACKGROUND
➤ Cluster-based self-supervised learning
• DeepCluster [1]
• DeeperCluster [2]
[1] Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze: Deep Clustering for Unsupervised Learning of Visual Features. ECCV 2018 [2] Mathilde Caron, Piotr Bojanowski, Julien Mairal, Armand Joulin: Unsupervised Pre-Training of Image Features on Non-Curated Data. ICCV 2019
22
BACKGROUND
➤ Cluster-based self-supervised learning
• DeepCluster [1], DeeperCluster [2]
• Require alternate optimization
➤ This paper
• No alternate optimization
• More stable and computationally efficient
[1] Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze: Deep Clustering for Unsupervised Learning of Visual Features. ECCV 2018 [2] Mathilde Caron, Piotr Bojanowski, Julien Mairal, Armand Joulin: Unsupervised Pre-Training of Image Features on Non-Curated Data. ICCV 2019
23
BACKGROUND
➤ Model Distillation
• Transferring knowledge from a teacher to a student
➤ This paper
• Distilling knowledge from a higher capacity teacher model Npre to a lower-capacity student model Ncf
24
OUTLINE
➤ Authorship
➤ Background
➤ Proposed Method
➤ Experimental Results
➤ Conclusion
25
PROPOSED METHOD
➤ ClusterFit
• Use the second-last layer of to extract features
• Cluster features using k-means into K groups
• Train a new network from scratch with the K-labels
26
PROPOSED METHOD
➤ Why?
• ClusterFit: a lossy compression scheme
• Captures the essential visual invariances in the feature space
• Gives the ‘re-learned’ network an opportunity to learn features that are less sensitive to the original pre-training objective → making them more transferable.
27
PROPOSED METHOD
➤ Notes
• Npre is trained on Dpre
• Ncf is trained on Dcf
• Dtar is the target dataset
28
PROPOSED METHOD
➤ Control Experiment using Synthetic Noise
• Adding varying amounts (p%) of uniform random label noise
• Npre: pre-train on noisy label, then fixed and train linear classifiers
• Dpre = Dcf = ImageNet-1K
• Npre = Ncf = ResNet-50
• Dtar = ImageNet-1K, ImageNet-9K, iNaturalist
29
PROPOSED METHOD
➤ Control Experiment using Synthetic Noise
30
PROPOSED METHOD
➤ Control Experiment using Synthetic Noise
31
OUTLINE
➤ Authorship
➤ Background
➤ Proposed Method
➤ Experimental Results
➤ Conclusion
32
EXPERIMENTAL RESULTS
➤ Benchmarking
➤ Analysis of ClusterFit
33
EXPERIMENTAL RESULTS
➤ Benchmarking
➤ Analysis of ClusterFit
34
EXPERIMENTAL RESULTS
➤ Compared Methods
• Distillation
• A weighted average of 2 loss functions:
• (a) cross-entropy with soft targets computed using Npre
• (b) cross-entropy with labels in weakly-supervised setup
• Prototype
• Unlike random cluster initialization
• Use label information in Dcf to initialize cluster centers
• Longer pre-training (Npre 2×)
35
EXPERIMENTAL RESULTS
➤ Benchmarking
• Weakly-supervised images
• Weakly-supervised videos
• Self-supervised images
36
EXPERIMENTAL RESULTS
➤ Weakly-Supervised Images
• Dpre = Dcf = IG-ImageNet-1B
• Npre = Ncf = ResNet-50
• Dtar = ImageNet-1K, ImageNet-9K, Places365, iNaturalist
37
EXPERIMENTAL RESULTS
➤ Weakly-Supervised Images
• Results
• ImageNet-1K: the hand-crafted label alignment of the IG-ImageNet-1B with ImageNet-1K
38
EXPERIMENTAL RESULTS
➤ Weakly-Supervised Videos
• Dpre = Dcf = IG-Verb-19M
• Npre = Ncf = R(2+1)D-34 [1]
• Dtar = Kinetics, Sports1M, Something-Something V1
[1] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri: A Closer Look at Spatiotemporal Convolutions for Action Recognition. CVPR 2018
39
EXPERIMENTAL RESULTS
➤ Weakly-Supervised Videos
• Results
40
EXPERIMENTAL RESULTS
➤ Self-Supervised Images
• Dpre = Dcf = ImageNet-1K, JigSaw & Rotation
• Npre = Ncf = ResNet-50
• Dtar = VOC07, ImageNet-1K, Places205, iNaturalist
41
EXPERIMENTAL RESULTS
➤ Self-Supervised Images
• Results
42
43
EXPERIMENTAL RESULTS
➤ Self-Supervised Images
• Layer-wise results
44
EXPERIMENTAL RESULTS
➤ Benchmarking
➤ Analysis of ClusterFit
45
EXPERIMENTAL RESULTS
➤ Relative model capacity of Npre and Ncf
• Dpre = IG-Verb-19M, Dcf = IG-Verb-62M
• Ncf = R(2+1)D-18 (33M parameters)
• Npre = R(2+1)D-18, R(2+1)D-34 (64M parameters)
46
EXPERIMENTAL RESULTS
➤ Relative model capacity of Npre and Ncf
• Results
47
EXPERIMENTAL RESULTS
➤ Unsupervised vs. Per-Label Clustering
• Per-label clustering:
• Cluster videos belonging to each label into kl clusters
48
EXPERIMENTAL RESULTS
➤ Properties of Dpre
• Number of labels
• IG-Verb-62M (438 weak verb labels)
• Label number: 10, 30, 100, 438
• Reducing the number of labels implies reduced content diversity
49
EXPERIMENTAL RESULTS
➤ Properties of Dpre
• Results
50
OUTLINE
➤ Authorship
➤ Background
➤ Proposed Method
➤ Experimental Results
➤ Conclusion
51
CONCLUSION
➤ First clustering the original feature space and re-learning a new model on cluster assignments
➤ Improves the generalizability for weakly- and self-supervised learning