semi-supervised learning with generalized expectation criteriamccallum/talks/xrce-ge.pdf · •...

84
Semi-supervised Learning with Generalized Expectation Criteria Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Gideon Mann and Greg Druck.

Upload: others

Post on 11-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Semi-supervised Learning with Generalized Expectation Criteria

Andrew McCallum

Computer Science DepartmentUniversity of Massachusetts Amherst

Joint work with Gideon Mann and Greg Druck.

Page 2: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Successful Applications of ML

• sentiment classification: – > 80% accuracy classifying positive and negative reviews

• sequence labeling: – > 99% accuracy labeling research paper references

• dependency parsing: – > 90% accuracy on English news text

Page 3: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Substantial Human Annotation Required

• sentiment classification: – > 80% accuracy classifying positive and negative reviews– with 2000 labeled reviews

• sequence labeling: – > 99% accuracy labeling research paper references– with 500 labeled references

• dependency parsing: – > 90% accuracy on English news text– with the Penn Treebank, more than 3 years to

annotate

Page 4: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Goal

• Problem: To apply machine learning to a new problem, a substantial amount of human annotation effort is required.

• Goal: – Reduce the amount of human effort required to

learn an accurate model for a new task.– Provide a natural way to inject human domain

knowledge.

People have domain knowledge.They need tools for naturally, safely incorporate that knowledge.

Page 5: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Supervised Learning

Decision boundary

Creation of labeled instances requires extensive human effort

Page 6: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

What if limited labeled data?

Small amount of labeled data

Page 7: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Semi-Supervised Learning:Labeled & Unlabeled data

Large amount of unlabeled data

Small amount of labeled data

Augment limited labeled data by using unlabeled data

Page 8: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

More Semi-Supervised Algorithms than Applications

0

8

15

23

30

1998 1999 2000 2001 2002 2003 2004 2005 2006

AlgorithmsApplications

# papers

Compiled from [Zhu, 2007]

Page 9: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Weakness of ManySemi-Supervised Algorithms

Difficult to ImplementSignificantly more complicated than supervised counterparts

FragileMeta-parameters hard to tune

Lacking in ScalabilityO(n2) or O(n3) on unlabeled data

Page 10: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

“EM will generally degrade [tagging] accuracy, except when only a limited

amount of hand-tagged text is available.”

[Merialdo, 1994]

“When the percentage of labeled data increases from

50% to 75%, the performance of [Label Propagation with

Jensen-Shannon divergence] and SVM become almost

same, while [Label propagation with cosine distance] performs significantly worse than SVM.” [Niu,Ji,Tan, 2005]

Page 11: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Families ofSemi-Supervised Learning

1. Expectation Maximization2. Graph-Based Methods3. Auxiliary Functions4. Decision Boundaries in Sparse Regions

Page 12: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Family 1 : Expectation Maximization

[Dempster, Laird, Rubin, 1977]

Fragile -- often worse than supervised

Page 13: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Family 2: Graph-Based Methods

[Zhu, Ghahramani, 2002]

[Szummer, Jaakkola, 2002]

Lacking in scalability, Sensitive to choice of metric

Page 14: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Family 3: Auxiliary-Task Methods[Ando and Zhang, 2005]

Complicated to find appropriate auxiliary tasks

Page 15: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Family 4: Decision Boundary in Sparse Region

Page 16: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Family 4: Decision Boundary in Sparse Region

Transductive SVMs [Joachims, 1999]: Sparsity measured by marginEntropy Regularization [Grandvalet and Bengio, 2005] …by label entropy

Page 17: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Minimal Entropy Solution!

Page 18: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

How do we know the minimal entropy solution is wrong?

We suspect at least some of the data is in the second class!

0

0.2

0.4

0.6

0.8

Class Size

In fact we often have prior knowledge of the relative class proportions 0.8 : Student

0.2 : Professor

Page 19: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

How do we know the minimal entropy solution is wrong?

We suspect at least some of the data is in the second class!

0

0.2

0.5

0.7

0.9

Class Size

In fact we often have prior knowledge of the relative class proportions 0.1 : Gene Mention

0.9 : Background

Page 20: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

How do we know the minimal entropy solution is wrong?

We suspect at least some of the data is in the second class!

0

0.2

0.3

0.5

0.6

Class Size

In fact we often have prior knowledge of the relative class proportions 0.6 : Person

0.4 : Organization

Page 21: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Families ofSemi-Supervised Learning

1. Expectation Maximization2. Graph-Based Methods3. Auxiliary Functions4. Decision Boundaries in Sparse Regions5. Generalized Expectation

0

0.2

0.4

0.6

0.8

Class Size

Page 22: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Family 5: Generalized Expectation

0

0.2

0.4

0.6

0.8

Class Size

Low density region

Favor decision boundaries that match the prior

Page 23: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Generalized Expectation

Simple: Easy to implement

Robust: Meta-parameters need little or no tuning

Scalable: Linear in number of unlabeled examples

Page 24: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Generalized ExpectationSpecial Cases

• Label Regularization p(y)

• Expectation Regularization p(y | feature)

• Generalized Expectation E [ f(x,y) ](general case)

0

0.2

0.4

0.6

0.8

Class Size

Page 25: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Label Regularization (LR)

Log-likelihood LR

KL-Divergence between a prior distribution

and an expected distribution

over the unlabeled data

Page 26: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

LR Results for Classification

Accuracy Accuracy Accuracy # Labeled Examples # Labeled Examples # Labeled Examples

2 100 1000

SVM (supervised) 55.41% 66.29%Cluster Kernel SVM 57.05% 65.97%

QC Smartsub 57.68% 59.16%

Naïve Bayes (supervised) 52.42% 57.12% 64.47%Naïve Bayes EM 50.79% 57.34% 57.60%

Logistic Regression (supervised) 52.42% 56.74% 65.43%Logistic Regression + Entropy Reg. 48.56% 54.45% 58.28%

Logistic Regression + GE 57.08% 58.51% 65.44%

Secondary Structure Prediction

Page 27: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

XR Results for Classification: Sliding Window Model

CoNLL03 Named Entity Recognition Shared Task

Page 28: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

XR Results for Classification: Sliding Window Model 2

BioCreativeII 2007 Gene/Gene Product Extraction

Page 29: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Noise in Prior Knowledge

What happens when users’ estimates of the class proportions is in error?

Page 30: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Noisy Prior Distribution

20% changein probability of majority class

CoNLL03 Named Entity Recognition Shared Task

Page 31: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Generalized ExpectationSpecial Cases

• Label Regularization p(y)

• Expectation Regularization p(y | feature)

• Generalized Expectation E [ f(x,y) ](general case)

0

0.2

0.4

0.6

0.8

Class Size

p ( BASEBALL | “homerun” ) = 0.95

Page 32: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

An Alternative Style of SupervisionClassifying Baseball versus Hockey

Traditional

HumanLabeling

Effort

(Semi-)Supervised Training viaMaximum Likelihood

Generalized Expectation

Brainstorma few

Keywords

Semi-Supervised Training viaGeneralized Expectation

puckicestick

ballfieldbat

p(HOCKEY | “puck”) = .9

Page 33: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Labeling Features

hockeybaseball

HRMets

85%

ballOilersSox

Pensruns

Pittsburgh

Penguins

Edmonton

Oilers

94.5%

goalBuffaloLeafspuck

Lemieux

92%

Toronto

Maple Leafs

battingbaseNHL

BruinsPenguins

96%Accuracy

features labeled . . .

~1000 unlabeled examples

Page 34: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Human Annotation Experiments

• Three annotators labeled 100 features and 100 documents.

• baseball vs. hockey

0 100 200 300 400 500 600 700 8000.4

0.5

0.6

0.7

0.8

0.9

1

labeling time in seconds

test

ing

accu

racy

GEER

~2 minutes,

100 features labeled or skipped,

82% accuracy

~15 minutes,

100 documents labeled (or skipped),

78% accuracy

Page 35: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Human Annotation Experiments

• words all annotators labeled

Page 36: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Generalized ExpectationSpecial Cases

• Label Regularization p(y)

• Expectation Regularization p(y | feature)

• Generalized Expectation E [ f(x,y) ](general case)

0

0.2

0.4

0.6

0.8

Class Size

Page 37: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Generalized Expectation (GE) criteria

• Definition: Parameter estimation objective fn that expresses preference on expectations of the model.

• Sometimes in same equivalence class as– Moment matching

– Maximum likelihood

– Maximum entropy

Objective = Score ( E [ f(x,y) ] )

Not just momentsNot necessarily matching a single target value

[McCallum, Mann, Druck 2007]

Not necessarily p(data)Preferences on subset of model factors

Based on constraints and expectations, butparameterization not req. to match constraints

Page 38: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

• Generalized Expectation Criteria are terms in a parameter estimation objective function that express a preference on the value of the model expectation of some function. [Mann and McCallum 07, Mann and McCallum 08, Druck et al. 08, Druck et al. 09]

Generalized Expectation (GE)

G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).

Page 39: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

• Generalized Expectation Criteria are terms in a parameter estimation objective function that express a preference on the value of the model expectation of some function. [Mann and McCallum 07, Mann and McCallum 08, Druck et al. 08, Druck et al. 09]

Generalized Expectation (GE)

G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).

constraint function

Page 40: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

• Generalized Expectation Criteria are terms in a parameter estimation objective function that express a preference on the value of the model expectation of some function. [Mann and McCallum 07, Mann and McCallum 08, Druck et al. 08, Druck et al. 09]

Generalized Expectation (GE)

G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).

constraint function

model distribution (CRF)

Page 41: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

• Generalized Expectation Criteria are terms in a parameter estimation objective function that express a preference on the value of the model expectation of some function. [Mann and McCallum 07, Mann and McCallum 08, Druck et al. 08, Druck et al. 09]

Generalized Expectation (GE)

G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).

constraint function

model distribution (CRF)

empirical distribution (unlabeled data)

Page 42: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

• Generalized Expectation Criteria are terms in a parameter estimation objective function that express a preference on the value of the model expectation of some function. [Mann and McCallum 07, Mann and McCallum 08, Druck et al. 08, Druck et al. 09]

Generalized Expectation (GE)

G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).

constraint function

model distribution (CRF)

empirical distribution (unlabeled data)

score function

Page 43: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

GE Score Functions

• In this proposal, S measures some distance to a target expectation .

• Model expectation :

• squared difference (L2):

• KL divergence:

• in some cases, a set of define a probability distribution

• sum of for all define negative cross-entropy (entropy of is constant)

G! = Ep(x)[Ep(y|x;!)[G(x,y)]]

Ssq(G!) = !(G!G!)2

Skl(G!) = G log G!

G!

Skl G!G

G

Page 44: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Estimating Parameters with GE

• Objective function:

• Maximize using gradient methods.

• Partial derivatives with respect to model parameters:!

!"jGkl(") =

G

G!

!Ep(x)[Ep(y|x;!)[Fj(x,y)G(x,y)]

!Ep(y|x;!)[Fj(x,y)]Ep(y|x;!)[G(x,y)]]"

!

!"jGsq(") = 2(G!G!)

!Ep(x)[Ep(y|x;!)[Fj(x,y)G(x,y)]

!Ep(y|x;!)[Fj(x,y)]Ep(y|x;!)[G(x,y)]]"

O =!

G(!)

G(!) + log p(!)

Page 45: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Estimating Parameters with GE

• Objective function:

• Maximize using gradient methods.

• Partial derivatives with respect to model parameters:!

!"jGkl(") =

G

G!

!Ep(x)[Ep(y|x;!)[Fj(x,y)G(x,y)]

!Ep(y|x;!)[Fj(x,y)]Ep(y|x;!)[G(x,y)]]"

!

!"jGsq(") = 2(G!G!)

!Ep(x)[Ep(y|x;!)[Fj(x,y)G(x,y)]

!Ep(y|x;!)[Fj(x,y)]Ep(y|x;!)[G(x,y)]]"

O =!

G(!)

G(!) + log p(!)

predicted covariance between constraint function and model

feature function

Page 46: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

GE in Practice

• Active Learning with GE

• GE for Dependency Parsing

Page 47: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Active Learning

Page 48: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Active Learning by Labeling Features

Page 49: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Active Learning by Labeling Features

algorithm allows skipping queries

queries for feature labels

Page 50: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Query Selection: Expected Information Gain

• ideal criteria: expected reduction in model uncertainty after labeling feature k

• are new parameter estimates after getting a particular labeling .

• drawback: This requires re-training the model with every possible labeling for every feature.

• solution: Approximate the expected reduction in uncertainty.

!EIG(qk) = Ep(g|qk)[Ep(x)[H(p(y|x; ")!H(p(y|x; "g)]]

!g g

Page 51: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Marginal Uncertainty Query Selection

• Approximation: reduction in uncertainty is ≈ uncertainty of marginal distributions at positions where feature occurs

• total uncertainty:

• weighted uncertainty:

• addresses biasing towards features that are too frequent/infrequent

• diverse uncertainty:

• chooses uncertain features that appear in diverse contexts

!TU (qk) =!

i

!

j

qk(xi, j)H(p(yj |xi; "))

!WU (qk) = log(Ck)!TU (qk)

Ck

!DU (qk) = !TU (qk)1|C|

!

j!C1!sim(qk, qj)

Page 52: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Other Query Selection Baselines

• Methods that are active but do not require re-training:

• coverage (dissimilarity from other labeled features)

• similarity (similarity to other labeled features)

• Passive baselines: random, frequency, LDA (top words in each topic [Druck et al. 08])

!cov(qk) = Ck1|C|

!

j!C1! sim(qk, qj)

!sim(qk) = Ck maxj!C

sim(qk, qj)

Page 53: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Related Work

• Tandem Learning [Raghavan & Allan 07]

• does not generalize to structured outputs in a straightforward way

• similarity query selection inspired by their method, but performs poorly

• Active Measurement Selection [Liang et al. 09]

• query selection closely related to EIG, but too slow for real experiments

• does not consider skipping queries; limited evaluation; no human experiments

• Dual Supervision [Sindhwani et al. 09]

• does not generalize to structured outputs in a straightforward way

• uses certainty query selection (because method similar to expectation uncertainty does not work well)

Page 54: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

“Simulated” Annotation Experiments

• Simulated annotator labeling instances:

• provide the true labels

• Simulated annotator labeling features:

• skip or label?

• labels a feature if the entropy of its distribution over labels is ≤ 0.7

• skips otherwise

• which labels to assign?

• assign max probability label, as well as any label whose probability as at least half as large [Druck et al. 08]

Page 55: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Experimental Setup

• instance active learning

• query selection: random (rand), uncertainty sampling [Lewis and Gale 94] (US), information density [Settles and Craven 08] (ID)

• training: maximum likelihood + entropy regularization [Jiao 06] (ER)

• feature active learning

• query selection: random, frequency, LDA, coverage (cov), similarity (sim), expectation uncertainty (EU), total uncertainty (TU), weighted uncertainty (WU), diverse uncertainty (DU)

• training: maximum marginal likelihood (MML), GE

• limit candidate set to 500 most frequent features

Page 56: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Experimental Setup

• data sets:

• apartments: 300 Craigslist apartment postings, 11 labels

• cora reference extraction: 500 reference paper references, 13 labels

• setup:

• each experiment simulates 10 minutes of annotation time

• Measured annotation times for labeling actions (seconds):

0 1 2 3 4

label tokenskip featurelabel feature

Page 57: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Simulated Experiments Results

• Active learning with labeled features using GE training outperforms:

• passive or active learning with labeled features using MML

• passive learning with labeled features using GE

• active and passive learning with instances

• Uncertainty based query selection methods generally outperform others.

Page 58: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Simulated Experiments Results

• on cora, GE + weighted uncertainty outperforms random after 5 minutes of annotation

Page 59: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

“Grid” Interface

• Feature queries are concise and easy to browse.

• Suggests new interfaces in which rather than being asked to answer one query at a time, groups of queries are displayed.

• “Grid” interface:

• displays small groups of related (distributional similarity) features

• may reduce annotation time because features in the same group are likely to have the same label

• within groups, features sorted by query selection metric

• groups that are more uncertain displayed closer to the top

Page 60: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Feature Active Learning Grid Interface

Page 61: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Human Active Learning Experiments

• human active learning experiments:

• labeling instances (with fast labeling interface)

• labeling features with the serial interface

• labeling feature with the grid interface

• 5 two minute sessions per annotator per experiment.

Page 62: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Human Annotation Experiments: User 1

Page 63: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Human Annotation Experiments: User 2

Page 64: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Human Annotation Experiments: User 3

Page 65: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Conclusions

• Developed an active learning method in which the user is asked to “label features” instead of labeling instances.

• Outperforms:

• active and passive learning with instances

• passive learning with labeled features

• Suggests new user interfaces that may allow more efficient annotation.

Page 66: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Dependency Parsing

Page 67: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Problem and Motivation

• Suppose we are given, for some language:

• How do we efficiently learn a dependency parser?

• Why is this important?

• There are low-density languages and sub-domains of languages for which there are no syntactically annotated corpora.

text

dependency

syntax prior

knowledge

text

textraw

text

Page 68: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Supervised Solution

• Supervised solution:

• Problem: Syntactic annotation is costly.

• Penn English Treebank: 6 years of development (1989 - 1995)

• Chinese Treebank: 9 years of development (1998 - 2007)

• Arabic Treebank: 2 years of development (2001-2003)

text

dependency

syntax prior

knowledge

text

textrawtext

text

text

text

treebank

annotationsupervisedlearning

parser

Page 69: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Traditional Semi-Supervised Solution

• Semi-supervised solution:

• Problem: Costs of developing annotation guidelines may dominate total annotation time early in development.

text

dependency

syntax prior

knowledge

text

textrawtext

seed treebank

annotationsemi-supervised

learningparser

text

textrawtext

entropy regularization [Smith & Eisner 07] Brown clustering [Koo et al. 08]self-training [McClosky et al. 06] convex loss on unlabeled [Wang et al. 08]

Page 70: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Unsupervised Solution

• Unsupervised solution:

• Possible approaches:

• Dependency model with Valence (DMV) + EM [Klein & Manning 04]

• Contrastive Estimation (CE) [Smith & Eisner 05]

• Others: [Smith & Eisner 06], [Bod 06], [Seginer 07]

text

dependency

syntax prior

knowledge

text

textrawtext

unsupervised learning

parser

Page 71: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Unsupervised Solution

• Unsupervised solution:

• Problem: Requires some limited prior knowledge, but incorporates this information in a cumbersome way.

• developing model structure, tweaking learning algorithm, clever parameter initialization, hyperparameter tuning, devising neighborhood function [Smith & Eisner 05]

text

dependency

syntax prior

knowledge

text

textrawtext

unsupervised learning

parser

Page 72: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

text

dependency

syntax prior

knowledge

text

textrawtext

unsupervised learning

parser

Our Approach

• Our approach:

• Example constraints:

• A DT should attach to a NN directly to the right 90% of the time.

• The parent of a VBD is the ROOT 75% of the time.

• How to estimate model parameters with such expectation constraints?

Encode prior knowledge directly with model expectation constraints, use to learn a feature-rich parser.

Page 73: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Non-Projective Dependency Tree CRF

• x is the input sentence, i.e. xi is the word at position i

• y is non-projective tree represented as a vector of parent indices, i.e. yi is the index of the parent word of word i

• CRF that models the probability of tree y given sentence x

• θ are model parameters

• fj are edge-factored feature functions, i.e. they consider the entire input x and single edge yi → i

• Zx is the partition function, or the sum of the scores of all possible trees for x

p(y|x; !) =1

Zxexp

! n"

i=1

"

j

!jfj(xi, xyi ,x)#

Page 74: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Experiments

Page 75: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Simulated “Oracle” Constraints

• In some experiments, prior knowledge is simulated using an “oracle” that looks at labeled data.

• Oracle constraint selection uses a few simple statistics:

• count:

• edge count:

• edge probability:

• Target expectations are true edge probabilities, binned into the set: [ 0, 0.1, 0.25, 0.5, 0.75, 1]

c(g) =!

x!D

!

i

!

j

g(xi, xj ,x)

cedge(g) =!

(x,y)!D

!

i

g(xi, xyi ,x)

p(edge|g) =cedge(g)

c(g)

Page 76: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Comparison with Unsupervised Methods

Corpus:

• WSJ10: WSJ portion of Penn Treebank, stripped of punctuation, sentences of 10 words or fewer, only POS tags (unlexicalized)

Models:

• DMV [Klein and Manning 04]: does not model distance, can model arity and sibling relationships.

• Restricted CRF: only features of type (parent-POS ∧ child-POS ∧ direction). Weaker than DMV (ignoring projective vs non-projective).

• CRF: [McDonald et al. 05] features. models distance, but still unlexicalized.

• baseline: assigns target expectations as scores to edges with constraints, runs MST with the resulting scores

Page 77: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Comparison with Unsupervised Methods

• parameter estimation methods:

• DMV (results from [Smith 06]):

• expectation maximization (EM)

• contrastive estimation (CE)

• restricted CRF / CRF:

• supervised maximum likelihood (upper bound)

• generalized expectation (GE)

Page 78: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Oracle Expectation Constraints

• constraint selection: sort functions (parent-POS ∧ child-POS ∧ direction) with count ≥ 200 by edge probability

• first 20 constraints selected:

• POS tags in sentence order, head → modifier, grouped by head

Page 79: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Human Provided Expectation Constraints

• Constraints extracted from grammar fragments in [Haghighi & Klein 06]

• Target expectations provided using (limited!) prior knowledge

• Based on output, refined target expectations, added new constraints

Page 80: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

GE vs. Supervised & Baseline

• GE outperforms the baseline

• human constraints provide accuracy comparable to oracle

• GE performs much better in conjunction with feature-rich model

10 20 30 40 50 6010

20

30

40

50

60

70

80

90

number of constraints

accu

racy

constraint baselineCRF restricted supervisedCRF supervisedCRF restricted GECRF GECRF GE human

Page 81: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

GE vs. Unsupervised

• GE outperforms DMV EM with 10 or 20 (restricted CRF) constraints.

• GE outperforms DMV CE with 50 (restricted CRF) or 20 constraints

10 20 30 40 50 6010

20

30

40

50

60

70

80

number of constraints

accu

racy

attach right baselineDMV EMDMV CECRF restricted GECRF GECRF GE human

Page 82: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Sensitivity of Unsupervised Methods

• sensitivity of DMV EM to initialization [Smith 06]:

• reported results use best of three parameter initialization methods, the method of [Klein & Manning 04]

• others give accuracies lower than 32%

• sensitivity of DMV CE to neighborhood function [Smith 06]:

• reported results use the best of eight neighborhood functions, DEL1ORTRANS1

• DEL1ORTRANS2 gives 51.2% accuracy

• the other six give accuracy of less than 50%

Page 83: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Generalized Expectation criteriaEasy communication with domain experts

• Inject domain knowledge into parameter estimation

• Like “informative prior”...

• ...but rather than the “language of parameters”(difficult for humans to understand)

• ...use the “language of expectations”(natural for humans)

Page 84: Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Use of Domain Knowledge

• “Expectations” are a natural language in which to express expertise.

• GE translates expectations into parameter estimation objective.

• Expert has knowledge.Must provide ML tools to integrate safely.