semi-supervised learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ml_2016/lecture/semi (v3).pdfย ยท...
TRANSCRIPT
Semi-supervised Learning
Introduction
โข Supervised learning: ๐ฅ๐ , เท๐ฆ๐ ๐=1๐
โข E.g.๐ฅ๐: image, เท๐ฆ๐: class labels
โข Semi-supervised learning: ๐ฅ๐ , เท๐ฆ๐ ๐=1๐ , ๐ฅ๐ข ๐ข=๐
๐ +๐
โข A set of unlabeled data, usually U >> R
โข Transductive learning: unlabeled data is the testing data
โข Inductive learning: unlabeled data is not the testing data
โข Why semi-supervised learning?
โข Collecting data is easy, but collecting โlabelledโ data is expensive
โข We do semi-supervised learning in our lives
Why semi-supervised learning helps?
Labelled data
Unlabeled data
cat dog
(Image of cats and dogs without labeling)
Why semi-supervised learning helps?
The distribution of the unlabeled data tell us something.
Usually with some assumptions
Who knows?
Outline
Semi-supervised Learning for Generative Model
Low-density Separation Assumption
Smoothness Assumption
Better Representation
Semi-supervised Learning for Generative Model
Supervised Generative Model
โข Given labelled training examples ๐ฅ๐ โ ๐ถ1, ๐ถ2โข looking for most likely prior probability P(Ci) and class-
dependent probability P(x|Ci)
โข P(x|Ci) is a Gaussian parameterized by ๐๐ and ฮฃ
๐1, ฮฃ
๐2, ฮฃ
๐ ๐ถ1|๐ฅ =๐ ๐ฅ|๐ถ1 ๐ ๐ถ1
๐ ๐ฅ|๐ถ1 ๐ ๐ถ1 + ๐ ๐ฅ|๐ถ2 ๐ ๐ถ2
With ๐ ๐ถ1 , ๐ ๐ถ2 ,๐1,๐2, ฮฃ
Decision Boundary
Semi-supervised Generative Modelโข Given labelled training examples ๐ฅ๐ โ ๐ถ1, ๐ถ2
โข looking for most likely prior probability P(Ci) and class-dependent probability P(x|Ci)
โข P(x|Ci) is a Gaussian parameterized by ๐๐ and ฮฃ
Decision Boundary
The unlabeled data ๐ฅ๐ข help re-estimate ๐ ๐ถ1 , ๐ ๐ถ2 , ๐1,๐2, ฮฃ
๐1, ฮฃ
๐2, ฮฃ
Semi-supervised Generative Modelโข Initialization:๐ = ๐ ๐ถ1 ,๐ ๐ถ2 ,๐1,๐2, ฮฃ
โข Step 1: compute the posterior probability of unlabeled data
โข Step 2: update model
๐๐ ๐ถ1|๐ฅ๐ข
๐ ๐ถ1 =๐1 +ฯ๐ฅ๐ข ๐ ๐ถ1|๐ฅ
๐ข
๐
๐1 =1
๐1
๐ฅ๐โ๐ถ1
๐ฅ๐ +1
ฯ๐ฅ๐ข ๐ ๐ถ1|๐ฅ๐ข
๐ฅ๐ข
๐ ๐ถ1|๐ฅ๐ข ๐ฅ๐ข โฆโฆ
Back to step 1
Depending on model ๐
๐: total number of examples๐1: number of examples belonging to C1
The algorithm converges eventually, but the initialization influences the results.
E
M
Why?
โข Maximum likelihood with labelled data
โข Maximum likelihood with labelled + unlabeled data
๐๐๐๐ฟ ๐ =
๐ฅ๐
๐๐๐๐๐ ๐ฅ๐ , เท๐ฆ๐
๐๐๐๐ฟ ๐ =
๐ฅ๐
๐๐๐๐๐ ๐ฅ๐ , เท๐ฆ๐ +
๐ฅ๐ข
๐๐๐๐๐ ๐ฅ๐ข
๐๐ ๐ฅ๐ข = ๐๐ ๐ฅ๐ข|๐ถ1 ๐ ๐ถ1 + ๐๐ ๐ฅ๐ข|๐ถ2 ๐ ๐ถ2
(๐ฅ๐ข can come from either C1 and C2)
Closed-form solution
Solved iteratively
๐ = ๐ ๐ถ1 ,๐ ๐ถ2 ,๐1,๐2, ฮฃ
= ๐๐ ๐ฅ๐| เท๐ฆ๐ ๐ เท๐ฆ๐๐๐ ๐ฅ๐ , เท๐ฆ๐
Semi-supervised LearningLow-density Separation
้้ปๅณ็ฝโBlack-or-whiteโ
Self-training
โข Given: labelled data set = ๐ฅ๐ , เท๐ฆ๐ ๐=1๐ , unlabeled data set
= ๐ฅ๐ข ๐ข=๐๐ +๐
โข Repeat:
โข Train model ๐โ from labelled data set
โข Apply ๐โ to the unlabeled data set
โข Obtain ๐ฅ๐ข, ๐ฆ๐ข ๐ข=๐๐ +๐
โข Remove a set of data from unlabeled data set, and add them into the labeled data set
Independent to the model
How to choose the data set remains open
Regression?
Pseudo-label
You can also provide a weight to each data.
Self-training
โข Similar to semi-supervised learning for generative model
โข Hard label v.s. Soft label
Considering using neural network
New target for ๐ฅ๐ข is 10
Class 1
70% Class 130% Class 2
๐โ (network parameter) from labelled data
New target for ๐ฅ๐ข is 0.70.3
Doesnโt work โฆ
It looks like class 1, then it is class 1.0.70.3
๐โ๐ฅ๐ข
Hard
Soft
Entropy-based Regularization
๐โ๐ฅ๐ข ๐ฆ๐ข
Distribution
๐ฆ๐ข
1 2 3 4 5
๐ฆ๐ข
1 2 3 4 5
๐ฆ๐ข
1 2 3 4 5
Good!
Good!
Bad!
๐ธ ๐ฆ๐ข = โ
๐=1
5
๐ฆ๐๐ข ๐๐ ๐ฆ๐
๐ข
Entropy of ๐ฆ๐ข : Evaluate how concentrate the distribution ๐ฆ๐ข is
๐ธ ๐ฆ๐ข = 0
๐ธ ๐ฆ๐ข = 0
๐ธ ๐ฆ๐ข
= ๐๐5
= โ๐๐1
5
As small as possible
๐ฟ =
๐ฅ๐
๐ถ ๐ฆ๐ , เท๐ฆ๐
+๐
๐ฅ๐ข
๐ธ ๐ฆ๐ข
labelled data
unlabeled data
Outlook: Semi-supervised SVM
Find a boundary that can provide the largest margin and least error
Enumerate all possible labels for the unlabeled data
Thorsten Joachims, โTransductive Inference for Text Classification using Support Vector Machinesโ, ICML, 1999
Semi-supervised LearningSmoothness Assumption
่ฟๆฑ่ ่ตค๏ผ่ฟๅขจ่ ้ปโYou are known by the company you keepโ
Smoothness Assumption
โข Assumption: โsimilarโ ๐ฅ has the same เท๐ฆ
โข More precisely:
โข x is not uniform.
โข If ๐ฅ1 and ๐ฅ2 are close in a high density region, เท๐ฆ1 and เท๐ฆ2 are the same.
connected by a high density path
Source of image: http://hips.seas.harvard.edu/files/pinwheel.png
๐ฅ1
๐ฅ2
๐ฅ3
๐ฅ1 and ๐ฅ2 have the same label
๐ฅ2 and ๐ฅ3 have different labels
Smoothness Assumption
โindirectlyโ similarwith stepping stones
(The example is from the tutorial slides of Xiaojin Zhu.)
similar?Not similar?
Source of image: http://www.moehui.com/5833.html/5/
Smoothness Assumption
โข Classify astronomy vs. travel articles
(The example is from the tutorial slides of Xiaojin Zhu.)
Smoothness Assumption
โข Classify astronomy vs. travel articles
(The example is from the tutorial slides of Xiaojin Zhu.)
Cluster and then Label
Class 1
Class 2
Cluster 1
Cluster 2
Cluster 3
Class 1
Class 2
Class 2
Using all the data to learn a classifier as usual
Graph-based Approach
โข How to know ๐ฅ1 and ๐ฅ2 are close in a high density region (connected by a high density path)
Represented the data points as a graph
E.g. Hyperlink of webpages, citation of papers
Graph representation is nature sometimes.
Sometimes you have to construct the graph yourself.
Graph-based Approach- Graph Constructionโข Define the similarity ๐ ๐ฅ๐ , ๐ฅ๐ between ๐ฅ๐ and ๐ฅ๐
โข Add edge:
โข K Nearest Neighbor
โข e-Neighborhood
โข Edge weight is proportional to s ๐ฅ๐ , ๐ฅ๐
๐ ๐ฅ๐ , ๐ฅ๐ = ๐๐ฅ๐ โ๐พ ๐ฅ๐ โ ๐ฅ๐2
Gaussian Radial Basis Function:
The image is from the tutorial slides of Amarnag Subramanyaand Partha Pratim Talukdar
Graph-based Approach
Class 1
Class 1
Class 1
Class 1
Class 1
Propagate through the graph
The labelled data influence their neighbors.
x
Graph-based Approach
โข Define the smoothness of the labels on the graph
๐ =1
2
๐,๐
๐ค๐,๐ ๐ฆ๐ โ ๐ฆ๐2
Smaller means smoother
x1
x2x3
x4
23
1
1
๐ฆ1 = 1
๐ฆ2 = 1
๐ฆ3 = 1
๐ฆ4 = 0
x1
x2x3
x4
23
1
1
๐ฆ1 = 0
๐ฆ2 = 1
๐ฆ3 = 1
๐ฆ4 = 0๐ = 0.5 ๐ = 3
For all data (no matter labelled or not)
Graph-based Approach
โข Define the smoothness of the labels on the graph
= ๐๐๐ฟ๐
L: (R+U) x (R+U) matrix
Graph Laplacian
๐ฟ = ๐ท โ๐
y: (R+U)-dim vector
๐ = โฏ๐ฆ๐โฏ๐ฆ๐โฏ๐ป
๐ =
0 22 0
3 01 0
3 10 0
0 11 0
D =
5 00 3
0 00 0
0 00 0
5 00 1
๐ =1
2
๐,๐
๐ค๐,๐ ๐ฆ๐ โ ๐ฆ๐2
Graph-based Approach
โข Define the smoothness of the labels on the graph
= ๐๐๐ฟ๐๐ =1
2
๐,๐
๐ค๐,๐ ๐ฆ๐ โ ๐ฆ๐2 Depending on
network parameters
๐ฟ =
๐ฅ๐
๐ถ ๐ฆ๐ , เท๐ฆ๐ +๐๐
J. Weston, F. Ratle, and R. Collobert, โDeep learning via semi-supervised embedding,โ ICML, 2008
As a regularization term
smoothsmooth
smooth
Semi-supervised LearningBetter Representation
ๅป่ชๅญ่๏ผๅ็น็บ็ฐก
Looking for Better Representation
โข Find the latent factors behind the observation
โข The latent factors (usually simpler) are better representations
observation
Better representation (Latent factor)
(In unsupervised learning part)
Reference
http://olivier.chapelle.cc/ssl-book/
Acknowledgement
โขๆ่ฌๅ่ญฐ้ๅๅญธๆๅบๆๅฝฑ็ไธ็้ฏๅญ
โขๆ่ฌไธๅ้ๅๅญธๆๅบๆๅฝฑ็ไธ็้ฏๅญ