efficient learning with active learning, auxiliary...
TRANSCRIPT
Efficient Learning with Active Learning, Auxiliary Information and Multiple
Annotators
PhD ProposalQuang Nguyen
Committee:Dr. Milos Hauskrecht (Advisor)
Dr. Janyce WiebeDr. Gregory Cooper
Dr. Jingtao Wang
Outline
• Introduction
• Learning with auxiliary information
– Framework
– Experiments
• Learning with multiple annotators
– Framework
– Experiments
• Future works
2
Supervised Learning Framework
New Example(New patient)
Prediction(Disease or not)
Annotation(Patients Labels)
LearningUnlabeled
Data (Patients)
Model(Classifier)
• Objective: build an efficient learning framework• Generalize well on future data• With limited amount of annotated training data
3
Data Annotation
• Labeling often requires human experts → time consuming and costly
• How to reduce the number of examples to label ?
Learning with auxiliary information (first part of the thesis)
• One annotator may not label all examples
Multi-annotator learning (second part of the thesis)
Patient records Diagnoses(class labels)Training
dataLabeling
disease/no disease
LabsMedicationsNotes……...
4
Outline
• Introduction
• Learning with auxiliary information
– Framework
– Experiments
• Learning with multiple annotators
– Framework
– Experiments
• Future works
5
Learning with Auxiliary Information
• How to reduce the number of examples to label ?
• Active Learning: select the most informative examples to label
• Can we obtain more useful information from selected examples ?
• Our solution: ask a human expert to provide, in addition to class labels, his/her certainty in the label decision and incorporate this information into the learning process
• Certainty can be represented in terms of
– Probability: e.g. probability of having disease p = 0.85
– Ordinal category: e.g. strong, medium or weak belief in disease
• We study and propose methods to work with each type of certainty information: probability and ordinal categories
6
Learning with Auxiliary Information (cont’d)
• Cost of auxiliary information is insignificantcompared to the overall labeling cost
– Example: 5 minutes to review an electronic health record (EHR), few seconds to give the auxiliary label
• Orthogonal to Active Learning: AL selects examples to label, we obtain more useful information from those selected
Opportunity: combine these approaches (proposed future work)
7
Traditional Classification Problem
Learning
Patient record (labs, medications etc)
……..
x1
xN
y1=1/0
yN=1/0
……..
Classifier
Class label(disease/no disease)
8
Learning with Auxiliary Information
Learning
+
Patient record (labs, medications etc)
Class label + Certainty label
(disease/no disease) (certainty in disease)
……..
x1
xN
y1=1/0 + p1
yN=1/0 + pN
……..
Classifier
9
Learning with Auxiliary Probabilistic Information: Linear Regression
Patient record (labs, medications etc)
Probability score(certainty in disease)
……..
x1
xN
p1
pN
……..
Regression ( f: X → 𝑝 )
Learning
Predicted probability p may not be in [0,1]
10LinRaux: Linear regression with auxiliary information
Learning with Auxiliary Probabilistic Information:Logistic Transformation
Patient record (labs, medications etc)
Probability score(certainty in disease)
……..
x1
xN
p1
pN
……..
Regression ( f: X → t(p) )
Learning
t(p) = 𝑙𝑜𝑔𝑝
1−𝑝
p = 1
1+𝑒−𝑡∈ [0,1]
11LogRaux: Logistic regression with auxiliary information
Learning with Auxiliary Information: Noise Issue
• Human certainty estimates are often noisy
– certainty score p may be inconsistent
• Regression relies on exact values of p
Sensitive to noise
Patient record
p = ?
12
LogR: Logistic Regression with binary class labels
LinRaux: Linear Regression with certainty labels
LogRaux: Logistic Regression with certainty labels
Learning with Auxiliary Information: Noise Issue
Solution ?
No noise: LinRaux and LogRaux clearly outperforms LogR
With noise: LinRaux and LogRaux are not better than LogR
13
Modeling pairwise orders
• Observation: Certainty scores let us order examples
• Idea: build a discriminant projection f(x) that respects this order
• Minimize the number of violated pairwise order constraints
• Modeling pairwise orders instead of relying on exact values of p
Hypothesis: learning is less sensitive to noise
<
f(x)
14
Learning with Class and Pairwise Order Constraints
• Modeling pairwise orders: adapt SVM Rank (Herbrich 2000)
• Combining class and certainty information
– Optimize:
minw,b1
2𝒘 2 + C 𝑖,𝑗=1:𝑝𝑖>𝑝𝑗
𝑁 ξ𝑖𝑗
– Pairwise order constraints:
∀i,j: pi > pj: wT (xi – xj) ≥ 1 - ξi,j
∀i,j : ξi,j ≥ 0
– Class constraints:
∀i : wTxiyi + b ≥ 1 - η𝑖
∀i : η𝑖 ≥ 0
Note: (1) constants B and C regularize the trade-off between class and auxiliary information; (2) Number of constraints = O(N2)
+ B 𝑖=1𝑁 𝜂𝑖
Penalty for violating class constraints
Penalty for violating pairwise orders constraints
15
SVMCombo
Experimental Setup: UCI Data
• 5 UCI data sets with continuous outputs
• Ailerons, Concrete, Bank8, Housing, Pol
• Generated labels
• Certainty labels: by normalizing continuous outputs
• Binary class labels: by setting a threshold on certainty labels
• Ratios of positive examples
• 10%, 25% and 50%
• Noises added to certainty labels
• 4 different levels of noise to signal ratio: no, weak, moderate, strong noise, generated from Gaussian 0%, 5%, 15%, 30% * N(0,1), respectively
16
Experimental Setup• Models
– Trained with only class labels
• LogR: logistic regression with lasso (L1) regularization
• SVM: standard linear SVM
– Trained with only certainty labels
• LinRaux: linear regression with L1 regularization
• LogRaux: logistic regression with L1 regularization
– Trained with both class and certainty labels
• SVM-Combo: SVM with 2 hinge losses for class and pairwise order constraints
• Evaluation
– Training examples were randomly sampled from train set
– Repeat training/testing process 100 times
– Average AUC and 95% confidence interval were recorded
17
Result: ‘Concrete’ data with Weak Noise
• Methods trained with auxiliary labels (svmCombo, LinRaux, LogRaux) consistently outperforms standard binary classifiers (SVM, LogR)
• Regression methods (LinRaux, LogRaux) are comparable with svmCombo18
8280
75
Result: ‘Concrete’ data with Moderate Noise
• svmCombo is robust to noise and outperforms other methods
• LinRaux, LogRaux start to suffer from noise, but still better than standard binary classifiers (SVM, LogR)
19
81
78
75
Result: ‘Concrete’ data with Strong Noise
• svmCombo is very robust to noise and still consistently outperforms other methods
• Regression methods (LinRaux, LogRaux) suffer from strong noise20
80
7375
Experimental Results: UCI Data (Cont’d)
• Auxiliary information helps to learn better models with smaller training data
• svmCombo is more robust to noise than regression methods 21
Experimental Results: UCI Data (Cont’d)
22
• Auxiliary information helps to learn better models with smaller training data
• svmCombo is more robust to noise than regression methods
Experiments: Unbalanced Data
• Challenge: in many applications data are often unbalanced (e.g. in medicine positive examples are usually rare)
Does certainty information help ?
• Auxiliary information is especially useful when data are unbalanced
23
pos. examples = 50%
pos. examples = 25% pos. examples = 10%
Experiments: Unbalanced Data (cont’d)
24
Auxiliary information is especially useful when data are unbalanced
Experiments: Unbalanced Data (cont’d)
25
Auxiliary information is especially useful when data are unbalanced
Learning with Auxiliary Ordinal Categories
• Certainty labels can be presented in terms of
– Probability: e.g. probability of having disease p = 0.85
– Ordinal categories: e.g. strong, medium or weak belief in having disease
• Regression methods do not work with ordinal categories. What to do ?
• Can we reduce the number of constraints for SVMCombo ? (O(N2), slow if N is large)
26
Application: HIT Alert
Heparin-induced thrombocytopenia (HIT):
• A life-threatening condition that may develop when patients are treated by heparin
Labeling:
• For each patient case we asked the expert 3 questions
– Do you agree with raising an alert on HIT or not ? Yes/No => used as binary class label
– How strongly the clinical evidence indicate that the patient has HIT ? Score from 0 to 100 => used as auxiliary probability
– How strongly do you agree with the alert ? 4 categories: strongly-disagree, weakly-disagree, weakly-agree and strongly-agree => used as auxiliary categories
27
Regression with Local Search• 4 categories: strongly-disagree, weakly-disagree, weakly-agree and
strongly-agree
• Regression methods require numeric values as input
Idea: search for a mapping of 4 categories to 4 numeric values, that maximizes AUC when applying regression (e.g. LinRaux)
• Local search algorithm:
– Initiate a set of mapping values for categories, e.g. 0, 1, 2, 3
– Repeat: (1) Move mapped points left/right a distance d; (2) Train LinRaux on the local mapping solution; (3) The local optimal solution is the one that maximizes AUC
– Until: reach a max number of iteration n, or AUC does not improve more than 𝜺
28
0 31 2
SVM with ordinal categories
29
x1 x2 x3 x4
b1 b2 b3 f(x) = wTxb1-1 b1+1 b2-1 b2+1 b3-1 b3+1
Strongly-disagree Weakly-disagree Weakly-agree Strongly-agree
Idea is based on SVM regression (Chu et. al. ‘05)One constraint for each pair of example and boundary between categories
Augmented dataset:
(x1,-1), (x2,1), (x3,1), (x4,1)
SVM with ordinal categories
30
x1 x2 x3 x4
b2 f(x) = wTxb2-1 b2+1
Strongly-disagree Weakly-disagree Weakly-agree Strongly-agree
Augmented dataset:
(x1,-1), (x2,1), (x3,1), (x4,1)
(x1,-1), (x2,-1), (x3,1), (x4,1)
Idea is based on SVM regression (Chu et. al. ‘05)
SVM with ordinal categories
31
x1 x2 x3 x4
b3 f(x) = wTxb3-1 b3+1
Strongly-disagree Weakly-disagree Weakly-agree Strongly-agree
Augmented dataset:
(x1,-1), (x2,1), (x3,1), (x4,1)
(x1,-1), (x2,-1), (x3,1), (x4,1)
(x1,-1), (x2,-1), (x3,-1), (x4,1)
Idea is based on SVM regression (Chu et. al. ‘05)
SVM with ordinal categories
32
x1 x2 x3 x4
b1 b2 b3 f(x) = wTxb1-1 b1+1 b2-1 b2+1 b3-1 b3+1
Category 1Strongly-disagree
Category 2Weakly-disagree
Category 3Weakly-agree
Category 4Strongly-agree
min𝒘,𝑐,𝑏𝑗,η𝑖,ξ𝑗𝑘
𝒘 2 + 𝐶
𝑖=1
𝑁
𝜂𝑗 + 𝐵
𝑗=1
𝑟−1
𝑘=1
𝑛𝑗
𝜉𝑗𝑘
∀𝑖 = 1. . 𝑁, 𝑗 = 1. . 𝑟, 𝑘 = 1. . 𝑛𝑗: 𝑦𝑖 𝒘𝑻𝒙𝒊 + 𝑐 ≥ 1 − η𝑖
∀ 𝒙𝒊 ∈ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 1. . 𝑗: 𝒘𝑻𝒙𝒊 ≤ 𝑏𝑗 − 1 + ξ𝑗𝑘
∀𝒙𝒊 ∈ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 𝑗 + 1 . . 𝑟: 𝒘𝑻𝒙𝒊 ≥ 𝑏𝑗 + 1 − 𝜉𝑗𝑘
Where 𝑟 = number of categories, 𝑛𝑗 = number of examples in category j
Note: number of constraints = O(rN) => linear to N
Ordinal categories constraints
Class constraints
SVMCombo_cat
Experiments: HIT Data
• Data
– 50 features derived from time series of labs, medications and procedures
– 377 patient instances labeled by an expert
• Models
– Trained with only class labels
• LogR: logistic regression
• SVM: linear SVM
– Trained with only certainty labels
• LinRaux: linear regression with auxiliary probability
• LogRaux: logistic regression with auxiliary probability
• LinRaux_localsearch: linear regression with auxiliary categories
– Trained with both class and certainty labels
• svmCombo: SVM with 2 hinge losses for class and order constraints
• svmCombo_cat: svmCombo with auxiliary categories
33
Experimental Result: HIT with auxiliary probability
• Methods trained with auxiliary probability (LinRaux, LogRaux, svmCombo) outperform standard binary classifiers (LogR, SVM) when the training set is small
34
Experimental Result: HIT with auxiliary categories
• Methods trained with auxiliary categories (LinRaux, LinRaux_localsearch, svmCombo) outperform standard binary classifiers (LogR, SVM)
• Regression with local search performs very well35
Summary• Auxiliary certainty information:
– Helps to learn better classification models with smaller numbers of examples
– Especially useful when data are unbalanced
– Can be obtained with little additional cost
• Human subjective certainty assessments are noisy
– We proposed a method that is robust to noise
• Certainty assessments can be presented in terms of probability or ordinal categories
– We proposed efficient methods to work with these cases
36
Outline
• Introduction
• Learning with auxiliary information
– Framework
– Experiments
• Learning with multiple annotators
– Framework
– Experiments
• Future works
37
Multiple-Annotator Learning
• Traditional supervised learning assumes one annotator labels all examples
• In practice, for complicated tasks (e.g. disease diagnosis), labeling is difficult and time consuming
Typically, a group of annotators work in the labeling process
Different than traditional supervised learning. New solutions ?
Patient records Diagnoses(class labels)Data
AnnotationLabeling
disease/no disease
LabsMedicationsNotes……...
38
Problem and Objective• Given: a set of training examples labeled by multiple annotators
• Objective:
– Learn a consensus classifier that can predict future unseen examples
– Learn annotator-specific models, i.e. how they predict future examples
– Learn different characteristics of annotators (e.g. expertise, bias, consistency, etc.)
• Challenges:
– Different annotators have different understanding of examples => disagreements/contradictions in the labeling
– How to model and combine all these disagreements to learn a good classifier ?
39
Existing Approaches
• Majority vote
– Assume all annotators are equal
– Simple, most widely used
• Two main directions
– Estimating a consensus label representing annotators’ labels (Dawid’79, Smyth’95, Whitehill’09, Donmez’09, Welinder’10)
– Learning a consensus model to predict future data (Yan’10, Raykar’10)
• State-of-the-art:
– Welinder’10: model annotator bias and example difficulty; do not produce a consensus model
– Raykar’10 : model annotator bias and reliability
40
Proposed Approach
• Modeling different annotator characteristics that lead to disagreements in labeling
– Different bias associated with false positive/negative costs
• Example: a physician may be very conservative to diagnose a condition to be positive
– Different knowledge and understanding
• Example: a physician may be more experienced or knowledgeable than another
– Different consistency level
• Example: an annotator may be inconsistent if he has little time, or distracted/tired during the labeling process
41
Modeling Annotators: Example
Annotator 1 (w1)
42
Annotator 2 (w2) Annotator 3 (w3)
• Different bias associated with false positive/negative costs
– Annotator 2 is very conservative, e.g. rarely gives positive labels
• Different knowledge and understanding
– Annotator 3 knowledge is the most similar to the consensus model
• Different consistency level
– Annotator 1 is less consistent with himself, e.g. makes more random mistakes (circled points)
Consensus model (u)
Graphical Model
43
• Each annotator model 𝒘𝑘 is generated from 𝒖 based on a density
function with parameter 𝛽𝑘: 𝑝(𝒘𝑘 𝒖, 𝛽𝑘 = 𝒩(𝒖,𝜷𝒌−𝟏𝐼𝑑)
• 𝛽𝑘 models the consistency between annotator model 𝒘𝑘 and consensus model 𝒖
𝛽𝑘
𝒘𝑘 𝒖
Hidden: Empty circles Observed: Filled circles
annotator
Graphical Model (cont’d)
44
𝒙𝑖𝑘 𝛽𝑘
𝒘𝑘 𝒖
𝑦𝑖𝑘 𝛼𝑘
• Each annotator labels examples 𝒙𝑖 using model 𝒘𝑘 with some noise reflected by 𝛼𝑘: 𝑝(𝑦 𝒙,𝒘𝒌, 𝛼𝑘 = 𝒩(𝑦|𝒘𝒌𝒙, 𝛼𝑘
−1)
• 𝛼𝑘 models the inconsistency of annotator k within his own model 𝒘𝑘
Hidden: Empty circles Observed: Filled circles
annotator
example
Objective Function
min𝒖,𝒘,𝑏,ξ,𝛼,𝛽
η
2𝒖 2 +
1
2
𝑘=1
𝑚
𝛽𝑘 𝒘𝑘 − 𝒖2 +1
2
𝑘=1
𝑚
𝛼𝑘
𝑖=1
𝑛𝑘
ξ𝑖𝑘
−1
2
𝑘=1
𝑚
ln(𝛽𝑘) −1
2
𝑘=1
𝑚
𝑛𝑘ln(𝛼𝑘)
s.t. 𝑦𝑖𝑘(𝒘𝑘𝑇𝒙𝑖𝑘 + 𝑏𝑘) ≥ 1 − ξ𝑖𝑘ξ𝑖𝑘 ≥ 0 𝑘 = 1. .𝑚, 𝑖 = 1. . 𝑛𝑘
45• Model consistency: how consistent (similar) the annotator model is
with the consensus model
• Self consistency: how consistent the annotator is with his own model
• Bias: how bias the annotator is towards positive
consensus model model consistency self consistency
bias
SVMCrowd
Learning
46
• Fix 𝛼𝑘 and 𝛽𝑘 to learn models 𝒘𝑘 and 𝒖
– Lead to the learning of a SVM
• Fix models 𝒘𝑘 and 𝒖 to learn 𝛼𝑘 and 𝛽𝑘
– Lead to a close form solution for 𝛼𝑘 and 𝛽𝑘
• Repeat until convergence
Experiments: UCI Data
• 5 data sets with binary labels: parkinsons, sonar, wdbc, ionosphere, vertebral
• Generate data:
– Learn a consensus model u using SVM and “true” binary labels from the data sets
– Generate Gaussian model noise zk from N(0,variance*) and add to u to obtain wk => study model consistency
– Generate different percentage of positive examples from Gaussian N(0,variance*) => study bias
– Randomly flip a fraction of examples => study self-consistency
*Note: variance values vary from 0 to 0.5
47
Experimental Setup
• Models
– Majority: majority vote
– Raykar: Raykar et al. ‘10
– SVMCrowd: our method
• Evaluation
– Randomly split data: 2/3 for training an 1/3 for test
– Repeat 100 times
– Report average AUC and 95% confidence interval
48
Experimental Result
• Result with default setting: 3 annotators, model noise from N(0,0.1), bias deviation from N(0,0.1), Flipping noise from N(0,0.3)
• Our method SVMCrowd outperforms baselines
49
Result: AUC vs Number of reviewers
• Our method (SVMCrowd) significantly outperform baselines
50
Result: AUC vs Model Noise
• Our method (SVMCrowd) significantly outperform baselines
51
Result: AUC vs Flipping Noise
• Our method (SVMCrowd) significantly outperform baselines
52
Result: AUC vs Bias Deviation
• Our method (SVMCrowd) significantly outperform baselines
53
Result: Example Overlapping Effect
• Our method (SVMCrowd) significantly outperform baselines
54
Experiments: HIT data
• We asked 3 junior physicians to label each patient case
• We also asked a senior physician to label each case
– This is unseen by the learner
– Only used for evaluation and as the true label
• Data were divided into 2/3 training and 1/3 test
– Trained on the training set
– Reported AUC on test set
55
HIT Result: AUC vs Training Size
• Our method (SVMCrowd) significantly outperform baselines, especially when the training set is mall
56
HIT Result: Example Overlapping Effect
• Our method (SVMCrowd) significantly outperform baselines and robust to
the amount of overlapping examples
• Raykar and Majority perform better when examples are more diverse (less
overlapped) 57
Learning reviewer-specific models• Objective: learn a reviewer model, e.g. how he predicts future examples
• Hypothesis: learning reviewer model together is better than learning each reviewer individually
– Intuition: one would work/study more efficiently if he collaborates with others
• Experiment: SVMCrowd vs SVM for individual reviewers
• Results show that learning reviewers together is indeed better than learning each of them individually
58
Learning Reviewer Characteristics
Notice the strong correlation between annotator consensus + self consistency and how much he agrees with the “true” labels 59
Summary
• In practice, examples may be labeled by multiple annotators
• We proposed a multi-annotator learning framework:
– Learn a consensus model that can predict future examples
– Learn annotator-specific models
– Learn different characteristics of annotators, e.g. expertise (model-consistency), self-consistency and bias
• Experimental results on UCI datasets and our medical data showed the benefits of our method
– Significantly outperform baselines
– Provide meaningful evaluation of annotators
60
Outline
• Introduction
• Learning with auxiliary information
– Framework
– Experiments
• Learning with multiple annotators
– Framework
– Experiments
• Future works
61
Future Works
• A learning framework that combines active learning and auxiliary information
• A learning framework that combines multi-annotator learning and auxiliary information
62
Active Learning with Auxiliary Information: Motivation
• Active Learning: select the most informative examples to label
• Our solution: ask a human expert to provide, in addition to class labels, his/her certainty in the label decision and incorporate this information into the learning process
Our solution is orthogonal to AL: AL selects examples to label, we obtain more useful information from those selected
Can we combine the strengths of these two ?
New problem !
63
Active Learning with Auxiliary Information: Challenges
• How active learning work:
– Inspect unlabeled examples
– Query the most k informative examples to label
– Retrain current learning model with the new set of labeled examples
• For the new problem setting, we need to study
– What query strategy should be used ?
– What learning model should be employed ?
64
Active Learning with Auxiliary Information: Challenges
• What query strategy ?
– Traditional active learning strategies were designed only for classification (query class label) or regression (query continuous values)
– We have both class and auxiliary certainty labels
• What learning model ?
– Traditional active learning assumes annotator labels are golden standard
– We showed that human subjective certainty assessments may be noisy and a solution is needed
65
Multi-annotator Learning with Auxiliary Information
• Our current multi-annotator learning method works with binary class labels => modify to incorporate auxiliary information
• Challenges
– Model expertise, self-consistency and bias with auxiliary certainty labels may be complicated
– Can we assume if an annotator is good/consistent with class labels then he will be good/consistent with certainty labels ?
– Technical issue: adding more parameters to the framework makes it harder to optimize and possibly being over-fit
66
Time Line
• January – May 2013: Investigate the framework that combines active learning and auxiliary information
• May – August 2013: Investigate the framework that combines multi-annotator learning and auxiliary information
• August – September 2013: Write the thesis
• End of September: Thesis defense
67
Thank You !
Q & A
68