efficient learning with active learning, auxiliary...

Efficient Learning with Active Learning, Auxiliary Information and Multiple

Annotators

PhD ProposalQuang Nguyen

Committee:Dr. Milos Hauskrecht (Advisor)

Dr. Janyce WiebeDr. Gregory Cooper

Dr. Jingtao Wang

Outline

• Introduction

• Learning with auxiliary information

– Framework

– Experiments

• Learning with multiple annotators

– Framework

– Experiments

• Future works

2

Supervised Learning Framework

New Example(New patient)

Prediction(Disease or not)

Annotation(Patients Labels)

LearningUnlabeled

Data (Patients)

Model(Classifier)

• Objective: build an efficient learning framework• Generalize well on future data• With limited amount of annotated training data

3

Data Annotation

• Labeling often requires human experts → time consuming and costly

• How to reduce the number of examples to label ?

Learning with auxiliary information (first part of the thesis)

• One annotator may not label all examples

Multi-annotator learning (second part of the thesis)

Patient records Diagnoses(class labels)Training

dataLabeling

disease/no disease

LabsMedicationsNotes……...

4

Outline

• Introduction


– Framework

– Experiments


– Framework

– Experiments

• Future works

5

Learning with Auxiliary Information

• How to reduce the number of examples to label ?

• Active Learning: select the most informative examples to label

• Can we obtain more useful information from selected examples ?

• Our solution: ask a human expert to provide, in addition to class labels, his/her certainty in the label decision and incorporate this information into the learning process

• Certainty can be represented in terms of

– Probability: e.g. probability of having disease p = 0.85

– Ordinal category: e.g. strong, medium or weak belief in disease

• We study and propose methods to work with each type of certainty information: probability and ordinal categories

6

Learning with Auxiliary Information (cont’d)

• Cost of auxiliary information is insignificantcompared to the overall labeling cost

– Example: 5 minutes to review an electronic health record (EHR), few seconds to give the auxiliary label

• Orthogonal to Active Learning: AL selects examples to label, we obtain more useful information from those selected

Opportunity: combine these approaches (proposed future work)

7

Traditional Classification Problem

Learning

Patient record (labs, medications etc)

……..

x1

xN

y1=1/0

yN=1/0

……..

Classifier

Class label(disease/no disease)

8

Learning with Auxiliary Information

Learning

+


Class label + Certainty label

(disease/no disease) (certainty in disease)

……..

x1

xN

y1=1/0 + p1

yN=1/0 + pN

……..

Classifier

9

Learning with Auxiliary Probabilistic Information: Linear Regression


Probability score(certainty in disease)

……..

x1

xN

p1

pN

……..

Regression ( f: X → 𝑝 )

Learning

Predicted probability p may not be in [0,1]

10LinRaux: Linear regression with auxiliary information

Learning with Auxiliary Probabilistic Information:Logistic Transformation


Probability score(certainty in disease)

……..

x1

xN

p1

pN

……..

Regression ( f: X → t(p) )

Learning

t(p) = 𝑙𝑜𝑔𝑝

1−𝑝

p = 1

1+𝑒−𝑡∈ [0,1]

11LogRaux: Logistic regression with auxiliary information

Learning with Auxiliary Information: Noise Issue

• Human certainty estimates are often noisy

– certainty score p may be inconsistent

• Regression relies on exact values of p

Sensitive to noise

Patient record

p = ?

12

LogR: Logistic Regression with binary class labels

LinRaux: Linear Regression with certainty labels

LogRaux: Logistic Regression with certainty labels

Learning with Auxiliary Information: Noise Issue

Solution ?

No noise: LinRaux and LogRaux clearly outperforms LogR

With noise: LinRaux and LogRaux are not better than LogR

13

Modeling pairwise orders

• Observation: Certainty scores let us order examples

• Idea: build a discriminant projection f(x) that respects this order

• Minimize the number of violated pairwise order constraints

• Modeling pairwise orders instead of relying on exact values of p

Hypothesis: learning is less sensitive to noise

<

f(x)

14

Learning with Class and Pairwise Order Constraints

• Modeling pairwise orders: adapt SVM Rank (Herbrich 2000)

• Combining class and certainty information

– Optimize:

minw,b1

2𝒘 2 + C 𝑖,𝑗=1:𝑝𝑖>𝑝𝑗

𝑁 ξ𝑖𝑗

– Pairwise order constraints:

∀i,j: pi > pj: wT (xi – xj) ≥ 1 - ξi,j

∀i,j : ξi,j ≥ 0

– Class constraints:

∀i : wTxiyi + b ≥ 1 - η𝑖

∀i : η𝑖 ≥ 0

Note: (1) constants B and C regularize the trade-off between class and auxiliary information; (2) Number of constraints = O(N2)

+ B 𝑖=1𝑁 𝜂𝑖

Penalty for violating class constraints

Penalty for violating pairwise orders constraints

15

SVMCombo

Experimental Setup: UCI Data

• 5 UCI data sets with continuous outputs

• Ailerons, Concrete, Bank8, Housing, Pol

• Generated labels

• Certainty labels: by normalizing continuous outputs

• Binary class labels: by setting a threshold on certainty labels

• Ratios of positive examples

• 10%, 25% and 50%

• Noises added to certainty labels

• 4 different levels of noise to signal ratio: no, weak, moderate, strong noise, generated from Gaussian 0%, 5%, 15%, 30% * N(0,1), respectively

16

Experimental Setup• Models

– Trained with only class labels

• LogR: logistic regression with lasso (L1) regularization

• SVM: standard linear SVM

– Trained with only certainty labels

• LinRaux: linear regression with L1 regularization

• LogRaux: logistic regression with L1 regularization

– Trained with both class and certainty labels

• SVM-Combo: SVM with 2 hinge losses for class and pairwise order constraints

• Evaluation

– Training examples were randomly sampled from train set

– Repeat training/testing process 100 times

– Average AUC and 95% confidence interval were recorded

17

Result: ‘Concrete’ data with Weak Noise

• Methods trained with auxiliary labels (svmCombo, LinRaux, LogRaux) consistently outperforms standard binary classifiers (SVM, LogR)

• Regression methods (LinRaux, LogRaux) are comparable with svmCombo18

8280

75

Result: ‘Concrete’ data with Moderate Noise

• svmCombo is robust to noise and outperforms other methods

• LinRaux, LogRaux start to suffer from noise, but still better than standard binary classifiers (SVM, LogR)

19

81

78

75

Result: ‘Concrete’ data with Strong Noise

• svmCombo is very robust to noise and still consistently outperforms other methods

• Regression methods (LinRaux, LogRaux) suffer from strong noise20

80

7375

Experimental Results: UCI Data (Cont’d)

• Auxiliary information helps to learn better models with smaller training data

• svmCombo is more robust to noise than regression methods 21

Experimental Results: UCI Data (Cont’d)

22

• Auxiliary information helps to learn better models with smaller training data

• svmCombo is more robust to noise than regression methods

Experiments: Unbalanced Data

• Challenge: in many applications data are often unbalanced (e.g. in medicine positive examples are usually rare)

Does certainty information help ?

• Auxiliary information is especially useful when data are unbalanced

23

pos. examples = 50%

pos. examples = 25% pos. examples = 10%

Experiments: Unbalanced Data (cont’d)

24

Auxiliary information is especially useful when data are unbalanced

Experiments: Unbalanced Data (cont’d)

25

Auxiliary information is especially useful when data are unbalanced

Learning with Auxiliary Ordinal Categories

• Certainty labels can be presented in terms of

– Probability: e.g. probability of having disease p = 0.85

– Ordinal categories: e.g. strong, medium or weak belief in having disease

• Regression methods do not work with ordinal categories. What to do ?

• Can we reduce the number of constraints for SVMCombo ? (O(N2), slow if N is large)

26

Application: HIT Alert

Heparin-induced thrombocytopenia (HIT):

• A life-threatening condition that may develop when patients are treated by heparin

Labeling:

• For each patient case we asked the expert 3 questions

– Do you agree with raising an alert on HIT or not ? Yes/No => used as binary class label

– How strongly the clinical evidence indicate that the patient has HIT ? Score from 0 to 100 => used as auxiliary probability

– How strongly do you agree with the alert ? 4 categories: strongly-disagree, weakly-disagree, weakly-agree and strongly-agree => used as auxiliary categories

27

Regression with Local Search• 4 categories: strongly-disagree, weakly-disagree, weakly-agree and

strongly-agree

• Regression methods require numeric values as input

Idea: search for a mapping of 4 categories to 4 numeric values, that maximizes AUC when applying regression (e.g. LinRaux)

• Local search algorithm:

– Initiate a set of mapping values for categories, e.g. 0, 1, 2, 3

– Repeat: (1) Move mapped points left/right a distance d; (2) Train LinRaux on the local mapping solution; (3) The local optimal solution is the one that maximizes AUC

– Until: reach a max number of iteration n, or AUC does not improve more than 𝜺

28

0 31 2

SVM with ordinal categories

29

x1 x2 x3 x4

b1 b2 b3 f(x) = wTxb1-1 b1+1 b2-1 b2+1 b3-1 b3+1

Strongly-disagree Weakly-disagree Weakly-agree Strongly-agree

Idea is based on SVM regression (Chu et. al. ‘05)One constraint for each pair of example and boundary between categories

Augmented dataset:

(x1,-1), (x2,1), (x3,1), (x4,1)


30

x1 x2 x3 x4

b2 f(x) = wTxb2-1 b2+1


Augmented dataset:

(x1,-1), (x2,1), (x3,1), (x4,1)

(x1,-1), (x2,-1), (x3,1), (x4,1)

Idea is based on SVM regression (Chu et. al. ‘05)


31

x1 x2 x3 x4

b3 f(x) = wTxb3-1 b3+1


Augmented dataset:

(x1,-1), (x2,1), (x3,1), (x4,1)

(x1,-1), (x2,-1), (x3,1), (x4,1)

(x1,-1), (x2,-1), (x3,-1), (x4,1)

Idea is based on SVM regression (Chu et. al. ‘05)


32

x1 x2 x3 x4

b1 b2 b3 f(x) = wTxb1-1 b1+1 b2-1 b2+1 b3-1 b3+1

Category 1Strongly-disagree

Category 2Weakly-disagree

Category 3Weakly-agree

Category 4Strongly-agree

min𝒘,𝑐,𝑏𝑗,η𝑖,ξ𝑗𝑘

𝒘 2 + 𝐶

𝑖=1

𝑁

𝜂𝑗 + 𝐵

𝑗=1

𝑟−1

𝑘=1

𝑛𝑗

𝜉𝑗𝑘

∀𝑖 = 1. . 𝑁, 𝑗 = 1. . 𝑟, 𝑘 = 1. . 𝑛𝑗: 𝑦𝑖 𝒘𝑻𝒙𝒊 + 𝑐 ≥ 1 − η𝑖

∀ 𝒙𝒊 ∈ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 1. . 𝑗: 𝒘𝑻𝒙𝒊 ≤ 𝑏𝑗 − 1 + ξ𝑗𝑘

∀𝒙𝒊 ∈ 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 𝑗 + 1 . . 𝑟: 𝒘𝑻𝒙𝒊 ≥ 𝑏𝑗 + 1 − 𝜉𝑗𝑘

Where 𝑟 = number of categories, 𝑛𝑗 = number of examples in category j

Note: number of constraints = O(rN) => linear to N

Ordinal categories constraints

Class constraints

SVMCombo_cat

Experiments: HIT Data

• Data

– 50 features derived from time series of labs, medications and procedures

– 377 patient instances labeled by an expert

• Models

– Trained with only class labels

• LogR: logistic regression

• SVM: linear SVM

– Trained with only certainty labels

• LinRaux: linear regression with auxiliary probability

• LogRaux: logistic regression with auxiliary probability

• LinRaux_localsearch: linear regression with auxiliary categories

– Trained with both class and certainty labels

• svmCombo: SVM with 2 hinge losses for class and order constraints

• svmCombo_cat: svmCombo with auxiliary categories

33

Experimental Result: HIT with auxiliary probability

• Methods trained with auxiliary probability (LinRaux, LogRaux, svmCombo) outperform standard binary classifiers (LogR, SVM) when the training set is small

34

Experimental Result: HIT with auxiliary categories

• Methods trained with auxiliary categories (LinRaux, LinRaux_localsearch, svmCombo) outperform standard binary classifiers (LogR, SVM)

• Regression with local search performs very well35

Summary• Auxiliary certainty information:

– Helps to learn better classification models with smaller numbers of examples

– Especially useful when data are unbalanced

– Can be obtained with little additional cost

• Human subjective certainty assessments are noisy

– We proposed a method that is robust to noise

• Certainty assessments can be presented in terms of probability or ordinal categories

– We proposed efficient methods to work with these cases

36

Outline

• Introduction


– Framework

– Experiments


– Framework

– Experiments

• Future works

37

Multiple-Annotator Learning

• Traditional supervised learning assumes one annotator labels all examples

• In practice, for complicated tasks (e.g. disease diagnosis), labeling is difficult and time consuming

Typically, a group of annotators work in the labeling process

Different than traditional supervised learning. New solutions ?

Patient records Diagnoses(class labels)Data

AnnotationLabeling

disease/no disease

LabsMedicationsNotes……...

38

Problem and Objective• Given: a set of training examples labeled by multiple annotators

• Objective:

– Learn a consensus classifier that can predict future unseen examples

– Learn annotator-specific models, i.e. how they predict future examples

– Learn different characteristics of annotators (e.g. expertise, bias, consistency, etc.)

• Challenges:

– Different annotators have different understanding of examples => disagreements/contradictions in the labeling

– How to model and combine all these disagreements to learn a good classifier ?

39

Existing Approaches

• Majority vote

– Assume all annotators are equal

– Simple, most widely used

• Two main directions

– Estimating a consensus label representing annotators’ labels (Dawid’79, Smyth’95, Whitehill’09, Donmez’09, Welinder’10)

– Learning a consensus model to predict future data (Yan’10, Raykar’10)

• State-of-the-art:

– Welinder’10: model annotator bias and example difficulty; do not produce a consensus model

– Raykar’10 : model annotator bias and reliability

40

Proposed Approach

• Modeling different annotator characteristics that lead to disagreements in labeling

– Different bias associated with false positive/negative costs

• Example: a physician may be very conservative to diagnose a condition to be positive

– Different knowledge and understanding

• Example: a physician may be more experienced or knowledgeable than another

– Different consistency level

• Example: an annotator may be inconsistent if he has little time, or distracted/tired during the labeling process

41

Modeling Annotators: Example

Annotator 1 (w1)

42

Annotator 2 (w2) Annotator 3 (w3)

• Different bias associated with false positive/negative costs

– Annotator 2 is very conservative, e.g. rarely gives positive labels

• Different knowledge and understanding

– Annotator 3 knowledge is the most similar to the consensus model

• Different consistency level

– Annotator 1 is less consistent with himself, e.g. makes more random mistakes (circled points)

Consensus model (u)

Graphical Model

43

• Each annotator model 𝒘𝑘 is generated from 𝒖 based on a density

function with parameter 𝛽𝑘: 𝑝(𝒘𝑘 𝒖, 𝛽𝑘 = 𝒩(𝒖,𝜷𝒌−𝟏𝐼𝑑)

• 𝛽𝑘 models the consistency between annotator model 𝒘𝑘 and consensus model 𝒖

𝛽𝑘

𝒘𝑘 𝒖

Hidden: Empty circles Observed: Filled circles

annotator

Graphical Model (cont’d)

44

𝒙𝑖𝑘 𝛽𝑘

𝒘𝑘 𝒖

𝑦𝑖𝑘 𝛼𝑘

• Each annotator labels examples 𝒙𝑖 using model 𝒘𝑘 with some noise reflected by 𝛼𝑘: 𝑝(𝑦 𝒙,𝒘𝒌, 𝛼𝑘 = 𝒩(𝑦|𝒘𝒌𝒙, 𝛼𝑘

−1)

• 𝛼𝑘 models the inconsistency of annotator k within his own model 𝒘𝑘

Hidden: Empty circles Observed: Filled circles

annotator

example

Objective Function

min𝒖,𝒘,𝑏,ξ,𝛼,𝛽

η

2𝒖 2 +

1

2

𝑘=1

𝑚

𝛽𝑘 𝒘𝑘 − 𝒖2 +1

2

𝑘=1

𝑚

𝛼𝑘

𝑖=1

𝑛𝑘

ξ𝑖𝑘

−1

2

𝑘=1

𝑚

ln(𝛽𝑘) −1

2

𝑘=1

𝑚

𝑛𝑘ln(𝛼𝑘)

s.t. 𝑦𝑖𝑘(𝒘𝑘𝑇𝒙𝑖𝑘 + 𝑏𝑘) ≥ 1 − ξ𝑖𝑘ξ𝑖𝑘 ≥ 0 𝑘 = 1. .𝑚, 𝑖 = 1. . 𝑛𝑘

45• Model consistency: how consistent (similar) the annotator model is

with the consensus model

• Self consistency: how consistent the annotator is with his own model

• Bias: how bias the annotator is towards positive

consensus model model consistency self consistency

bias

SVMCrowd

Learning

46

• Fix 𝛼𝑘 and 𝛽𝑘 to learn models 𝒘𝑘 and 𝒖

– Lead to the learning of a SVM

• Fix models 𝒘𝑘 and 𝒖 to learn 𝛼𝑘 and 𝛽𝑘

– Lead to a close form solution for 𝛼𝑘 and 𝛽𝑘

• Repeat until convergence

Experiments: UCI Data

• 5 data sets with binary labels: parkinsons, sonar, wdbc, ionosphere, vertebral

• Generate data:

– Learn a consensus model u using SVM and “true” binary labels from the data sets

– Generate Gaussian model noise zk from N(0,variance*) and add to u to obtain wk => study model consistency

– Generate different percentage of positive examples from Gaussian N(0,variance*) => study bias

– Randomly flip a fraction of examples => study self-consistency

*Note: variance values vary from 0 to 0.5

47

Experimental Setup

• Models

– Majority: majority vote

– Raykar: Raykar et al. ‘10

– SVMCrowd: our method

• Evaluation

– Randomly split data: 2/3 for training an 1/3 for test

– Repeat 100 times

– Report average AUC and 95% confidence interval

48

Experimental Result

• Result with default setting: 3 annotators, model noise from N(0,0.1), bias deviation from N(0,0.1), Flipping noise from N(0,0.3)

• Our method SVMCrowd outperforms baselines

49

Result: AUC vs Number of reviewers

• Our method (SVMCrowd) significantly outperform baselines

50

Result: AUC vs Model Noise


51

Result: AUC vs Flipping Noise


52

Result: AUC vs Bias Deviation


53

Result: Example Overlapping Effect


54

Experiments: HIT data

• We asked 3 junior physicians to label each patient case

• We also asked a senior physician to label each case

– This is unseen by the learner

– Only used for evaluation and as the true label

• Data were divided into 2/3 training and 1/3 test

– Trained on the training set

– Reported AUC on test set

55

HIT Result: AUC vs Training Size

• Our method (SVMCrowd) significantly outperform baselines, especially when the training set is mall

56

HIT Result: Example Overlapping Effect

• Our method (SVMCrowd) significantly outperform baselines and robust to

the amount of overlapping examples

• Raykar and Majority perform better when examples are more diverse (less

overlapped) 57

Learning reviewer-specific models• Objective: learn a reviewer model, e.g. how he predicts future examples

• Hypothesis: learning reviewer model together is better than learning each reviewer individually

– Intuition: one would work/study more efficiently if he collaborates with others

• Experiment: SVMCrowd vs SVM for individual reviewers

• Results show that learning reviewers together is indeed better than learning each of them individually

58

Learning Reviewer Characteristics

Notice the strong correlation between annotator consensus + self consistency and how much he agrees with the “true” labels 59

Summary

• In practice, examples may be labeled by multiple annotators

• We proposed a multi-annotator learning framework:

– Learn a consensus model that can predict future examples

– Learn annotator-specific models

– Learn different characteristics of annotators, e.g. expertise (model-consistency), self-consistency and bias

• Experimental results on UCI datasets and our medical data showed the benefits of our method

– Significantly outperform baselines

– Provide meaningful evaluation of annotators

60

Outline

• Introduction


– Framework

– Experiments


– Framework

– Experiments

• Future works

61

Future Works

• A learning framework that combines active learning and auxiliary information

• A learning framework that combines multi-annotator learning and auxiliary information

62

Active Learning with Auxiliary Information: Motivation

• Active Learning: select the most informative examples to label

• Our solution: ask a human expert to provide, in addition to class labels, his/her certainty in the label decision and incorporate this information into the learning process

Our solution is orthogonal to AL: AL selects examples to label, we obtain more useful information from those selected

Can we combine the strengths of these two ?

New problem !

63

Active Learning with Auxiliary Information: Challenges

• How active learning work:

– Inspect unlabeled examples

– Query the most k informative examples to label

– Retrain current learning model with the new set of labeled examples

• For the new problem setting, we need to study

– What query strategy should be used ?

– What learning model should be employed ?

64

Active Learning with Auxiliary Information: Challenges

• What query strategy ?

– Traditional active learning strategies were designed only for classification (query class label) or regression (query continuous values)

– We have both class and auxiliary certainty labels

• What learning model ?

– Traditional active learning assumes annotator labels are golden standard

– We showed that human subjective certainty assessments may be noisy and a solution is needed

65

Multi-annotator Learning with Auxiliary Information

• Our current multi-annotator learning method works with binary class labels => modify to incorporate auxiliary information

• Challenges

– Model expertise, self-consistency and bias with auxiliary certainty labels may be complicated

– Can we assume if an annotator is good/consistent with class labels then he will be good/consistent with certainty labels ?

– Technical issue: adding more parameters to the framework makes it harder to optimize and possibly being over-fit

66

Time Line

• January – May 2013: Investigate the framework that combines active learning and auxiliary information

• May – August 2013: Investigate the framework that combines multi-annotator learning and auxiliary information

• August – September 2013: Write the thesis

• End of September: Thesis defense

67

Thank You !

Q & A

68

efficient learning with active learning, auxiliary...

Documents