applied machine learning - molgen.mpg.deapplied machine learning annalisa marsico owl rna...
TRANSCRIPT
Applied Machine Learning
Annalisa MarsicoOWL RNA Bionformatics group
Max Planck Institute for Molecular GeneticsFree University of Berlin
27 Mai, SoSe 2015
Supervised vs Unsupervised Learning
Typical ScenarioWe have an outcome quantitative (price of a stock, risk factor..) orcategorical (heart attack yes or no) that we want to predict based on somefeatures. We have a training set of data and build a prediction model, a learner, able to predict the outcome of new unseen objects
- Supervised learning: the presence of the outcome variable is guiding thelearning process- Unsupervised learning: we have only features, no outcome
- Task is rather to describe the data
Supervised Learning
• Find the connection between two sets of observations: the input set and the output set
• Given {풙풏,푦 }, where 풙풏 ∈ 푷 (feature space) find a function f(from a set of hypothesis H), such that ∀푛 ∈ [1, . .푁]
푦 = 푓(풙풏)
f linear function, polynomial function, a classifier (e.g. logistic regression), a NN, a SVM, a random forest…
Unsupervised learning• find a structure in the data
• Given X ={xn} measurements / obervations /featuresfind a model M such that p(M|X) is maximizedi.e. Find the process that is most likely to have generate the data
Gaussian Mixture Models
푝 풙 = 흅 푁 풙 흁 ,휮 )
Gaussian Mixture models
• Choose a number of clusters 퐾• Initialize the 퐾 priors 휋 , the 퐾 means 휇 and the covariances휮
• Repeat until convergence
• compute the probability p(k|풙 ) of each datapoint 풙 to belong to cluster k
• update parameters of the model (cluster priors 휋 , mean휇 and covariances 휮 by taking the weighted average
number / location / variance of all points, where the weightof point 풙 is p(k|풙 ) )
EM algorithm
E-step
M-step
Gaussian Mixture models
Semi-supervised learning
• Middle-ground between supervised and unsupervised learning
• We are in the following situation:– Set of instances X ={xn} drawn from some unknown probability
distribution p(X)– Wish to learn a target function 푓:푋 ⟶ 푌 (or a lerner) given a set H of
possible hypothesis given:퐿 = { 푥 , 푦 , … 푥 , 푦 } labeled examples
U= { 푥 , 푦 , … 푥 , 푦 } unlabeled examplesUsually 푙 ≪ 푛
Wish to find the hypothesis (function) with the lowest error푓 = 푎푟푔푚푖푛 ∈ 퐸 푤 ,퐸 푤 ≡ 푃[ℎ 푥 ≠ 푓 푥 ]
Why the name?
Supervised learning• Classification, regression {(푥 : ,푦 : )}
Semi-supervised classification / regression푥 : ,푦 : ,푥 ,
Semi-supervised clustering 푥 : ,푦 : , 푥 ,
Unsupervised learning• Clustering, mixture models 푥 :
Why bother?
There are cases where labeled examples are really few• People want better performance ‘for free’• Unlabeled data are cheap, labeled data might be hard to
get– Drug design, i.e. effect of a drug on protein activity– Genomics: real binding sites vs non real binding sites– SNPs data for disease classification
Hard-to-get labelsImage categorization of “eclipse”: one could in principle label 1000+ manually
Hard-to-get labels
Nonetheless, the number of images to classify might be huge..
We will show how to improve classification by using the unlabeled examples
Semi-supervised mixture models
• First approach: modify the posterior probability p(k|풙 ) of each data point in the E-step (set p(k|풙 )=1 if point is labeled as positive, 0 otherwise). Example on the blackboard (miRNA promoters)
• Second approach: train classifier on labeled examples and estimate the parameters (M-step). Compute ‘expected’ class for unlabeled examples (E-step). Example on the blackboard.
How can unlabeled data ever help?
This is only one of the many ways to use unlabeled data
Why the name?
Supervised learning• Classification, regression {(푥 : ,푦 : )}
Semi-supervised classification / regression푥 : ,푦 : ,푥 ,
Semi-supervised clustering 푥 : ,푦 : , 푥 ,
Unsupervised learning• Clustering, mixture models 푥 :
Semi-supervised regression / classification
GoalUsing both labeled and unlabeled data to build better learners, than using each one alone
– Boost the performance of a learning algorithm when only a small set of labeled data is available
Example 1 – web page classification
We want a system which automatically classifies web pages into faculty (academic) web pages vs other pages
– Labeled examples: pages labeled by hand– Unlabeled examples: millions of pages on the web– Nice features: web pages can have multiple representations
(can be described by different kind of information)
Example1: Redundantly sufficient features
my advisorDr. Bernhard Renard
Setting: redundant features, i.e. description of each example can be partitioned into distinct views
Example: Redundantly sufficient features
my advisorDr. Bernhard Renard
Example: Redundantly sufficient features
Co-training, multi-view learning, Co-regularization – Main idea
In some settings data features are redundant– We can train different classifiers on disjoint features– Classifiers should agree on unlabeled examples– We can use unlabeled data to constraint joint training of both
classifiers– Different algorithms differ in the way they constraint the
classifiers
The Co-training algorithm – Variant 1
Assumptions:– Either view of the features is sufficient for the
learning task– Compatibility assumption (a strong one):
classifiers in each view agree on labels of most unlabeled examples
– Independence assumption: views are independent given the class labels (conditional independence)
The Co-training algorithm – Variant 1
Given:• Labeled data L• Unlabeled data U
Loop• Train g1 (hyperlink classifier) using L• Train g2 (page classifier) using L• Sample N1 points from U, let g1 label p1 positives and
n1 negatives• Sample N2 points from U, let g2 label p2 positives and
n2 negatives• Add self-labeled (N1+N2) examples to L
A. Blum & T. Mitchell, 1998 ‘Combining Labeled and Labeled Data with Co-training’
The Co-training algorithm – Variant 2
Given:• Labeled data L• Unlabeled data U
Loop• Train g1 (hyperlink classifier) using L• Train g2 (page classifier) using L• Sample N1 points from U, let g1 label p1 positives and
n1 negatives• Sample N2 points from U, let g2 label p2 positives and
n2 negatives• Add self-labeled n1 < N1 (examples where g1 is more confident)
and n2 < N2 (examples where g2 is more confident) to LA. Blum & T. Mitchell, 1998 ‘Combining Labeled and Labeled Data with Co-training’
The Co-training algorithm – ResultsWeb-pages classification task
– Experiment: 1051 web pages– 203 (25%) left out (randomly) for testing– 12 out of the remaining 848 are labeled (L)– 5 experiments conducted using different test+training splits
A. Blum & T. Mitchell, 1998
Plot of the error on the test setvs the number of iterations. Thanksto co-training the errors converge after a while
The Co-training algorithm – Intuition and limitations
• If the hyperlinks classifier finds a page it is highly confident about, then it can share it with the other classifier (the page classifier). This is how they benefit from each other.
• Starting from a set of labeled examples, they co-train each other at each iteration (defined as ‘greedy’ EM)
• Limitation: even if each classifier picks examples he is more confident about, it ignores the other view
Multi-view learning (Co-regularization)- Motivation
• It relaxes the assumption of compatibility
– If classifiers don’t agree on unlabeled examples, then we have noisy data.
– What is a way to reduce noise (variance) in the data?
Multi-view learning (Co-regularization)
Framework where classifiers are learnt in each view through forms of multi-view regularization
– We are in the case where 푔 푥 ≠ 푔 (푥)– Joint regularization to minimize the disagreement between them
< 휃 ,휃 >← 푎푟푔푚푖푛( , ) (푦 −푔 (푥 ;휃 ))∈
+
+ (푦 −푔 (푥 ;휃 )) +∈
+ (푔 푥 ;휃 − 푔 푥 ;휃 )∈
Penalizationterm
Co-regularization – Intuition and Limitations
• Build on the assumption that instead of making g1 and g2agree on examples afterwards, must agree into the objective function we are optimizing.
• Unlabeled incorporated into regularization
• The algorithms are not greedy (Mmm..)
• I can decide how much compatibility I want to force (i.e. how much I want to penalize disagreement). Any idea how?
• Limitations: No good implementation so far
Essential ingredients for Co-training / Co-regularization - Summary
• A big number of unlabeled examples• Multi-view (particularly suitable for biology)• Conditional independence• A good way to solve your optimization
problem when you have joint regularization
Semi-supervised learning in Biology
• Regulatory elements prediction (e.g. promoter)
• Protein function prediction• Diseases hard to diagnose (few labels)• Long non-coding RNA function prediction• miRNA target prediction..