rapid training of information extraction with local and global data views

87
Rapid Training of Information Extraction with Local and Global Data Views Dissertation Defense Ang Sun Computer Science Department New York University April 30, 2012 Committee Prof. Ralph Grishman Prof. Satoshi Sekine Prof. Heng Ji Prof. Ernest Davis Prof. Lakshminarayanan Subramanian

Upload: tiger-nolan

Post on 04-Jan-2016

28 views

Category:

Documents


2 download

DESCRIPTION

Rapid Training of Information Extraction with Local and Global Data Views. Committee Prof. Ralph Grishman Prof. Satoshi Sekine Prof. Heng Ji Prof. Ernest Davis Prof. Lakshminarayanan Subramanian. Dissertation Defense Ang Sun Computer Science Department - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Rapid Training of Information Extraction with Local and Global Data Views

Rapid Training of Information Extraction with Local and Global Data Views

Dissertation DefenseAng Sun

Computer Science DepartmentNew York University

April 30, 2012

Committee Prof. Ralph Grishman Prof. Satoshi Sekine Prof. Heng Ji Prof. Ernest Davis Prof. Lakshminarayanan Subramanian

Page 2: Rapid Training of Information Extraction with Local and Global Data Views

Outline

I. Introduction

II. Relation Type Extension: Active Learning with Local and Global Data Views

III. Relation Type Extension: Bootstrapping with Local and Global Data Views

IV. Cross-Domain Bootstrapping for Named Entity Recognition

V. Conclusion

Page 3: Rapid Training of Information Extraction with Local and Global Data Views

Part I

Introduction

Page 4: Rapid Training of Information Extraction with Local and Global Data Views

Tasks

1. Named Entity Recognition (NER)

2. Relation Extraction (RE)i. Relation Extraction between Namesii. Relation Mention Extraction

Page 5: Rapid Training of Information Extraction with Local and Global Data Views

NER

Name TypeBill Gates PERSON

Seattle LOCATIONMicrosoft ORGANIZATION

Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current

chairman of Microsoft.

NER

Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current

chairman of Microsoft.

Page 6: Rapid Training of Information Extraction with Local and Global Data Views

REi. Relation Extraction between Names

NER

Adam, a data analyst for ABC Inc.

ABC Inc.AdamEmployment

RE

Page 7: Rapid Training of Information Extraction with Local and Global Data Views

REi. Relation Mention Extraction

Entity Extraction

Entity Mention Entity

Adam {Adam, a data analyst}

a data analyst {Adam, a data analyst}

ABC Inc. {ABC Inc.}

Adam, a data analyst for ABC Inc.

Page 8: Rapid Training of Information Extraction with Local and Global Data Views

REi. Relation Mention Extraction

RE

Adam, a data analyst for ABC Inc.

ABC Inc.a data analyst

Employment

Page 9: Rapid Training of Information Extraction with Local and Global Data Views

Prior Work – Supervised Learning

• Learn with labeled data

– < Bill Gates, PERSON >

–< <Adam, ABC Inc. >, Employment >

,i ix y

Page 10: Rapid Training of Information Extraction with Local and Global Data Views

Prior Work – Supervised LearningO. J. Simpson was

arrested and charged with

murdering his ex-wife ,

Nicole Brown Simpson ,

and her friend Ronald

Goldman in 1994 .

O. J. Simpson was

P P P O

arrested and charged with

murdering his ex-wife ,

Nicole Brown Simpson ,

and her friend Ronald

Goldman in 1994 .

O. J. Simpson was

P P P O

arrested and charged with

O O O O

murdering his ex-wife ,

Nicole Brown Simpson ,

and her friend Ronald

Goldman in 1994 .

O. J. Simpson was

P P P O

arrested and charged with

O O O O

murdering his ex-wife ,

O O O O

Nicole Brown Simpson ,

and her friend Ronald

Goldman in 1994 .

O. J. Simpson was

P P P O

arrested and charged with

O O O O

murdering his ex-wife ,

O O O O

Nicole Brown Simpson ,

P P P O

and her friend Ronald

Goldman in 1994 .

O. J. Simpson was

P P P O

arrested and charged with

O O O O

murdering his ex-wife ,

O O O O

Nicole Brown Simpson ,

P P P O

and her friend Ronald

O O O P

Goldman in 1994 .

O. J. Simpson was

P P P O

arrested and charged with

O O O O

murdering his ex-wife ,

O O O O

Nicole Brown Simpson ,

P P P O

and her friend Ronald

O O O P

Goldman in 1994 .

P O O O

Expensive!

Page 11: Rapid Training of Information Extraction with Local and Global Data Views

• Expensive• A trained model is typically domain-dependent– Porting it to a new domain usually involves

annotating data from scratch

Prior Work – Supervised Learning

Domains

Page 12: Rapid Training of Information Extraction with Local and Global Data Views

O. J. Simpson was

P P P O

arrested and charged with

O O O O

murdering his ex-wife ,

O O O O

Nicole Brown Simpson ,

P P P O

and her friend Ronald

O O O P

Goldman in 1994 .

P O O O

Annotation is tedious!

15 minutes1 hour2 hours

Prior Work – Supervised Learning

Page 13: Rapid Training of Information Extraction with Local and Global Data Views

Prior Work – Semi-supervised Learning

• Learn with both– labeled data– Unlabeled data

• The learning is an iterative process1. Train an initial model with labeled data2. Apply the model to tag unlabeled data3. Select good tagged examples as additional training examples4. Re-train the model5. Repeat from Step2.

,i ix y

ix SmallLarge

,i ix y ix

Page 14: Rapid Training of Information Extraction with Local and Global Data Views

Prior Work – Semi-supervised Learning

• Problem 1: Semantic DriftExample1:

Learner for PERSON names ends up learning flower names. Because women's first names intersect with names of flowers (Rose,...)

Example 2:Learner for LocatedIn relation patterns ends up learning patterns for other relations (birthPlace, governorOf, …)

Page 15: Rapid Training of Information Extraction with Local and Global Data Views

Prior Work – Semi-supervised Learning

• Problem 2: Lacks a good stopping criterion

• Most systems – either use a fixed number of iterations – or use a labeled development set to detect the

right stopping point

Page 16: Rapid Training of Information Extraction with Local and Global Data Views

Prior Work – Unsupervised Learning• Learn with only unlabeled data

• Unsupervised Relation Discovery– Context based clustering– Group pairs of named entities with similar context

to the same relation cluster

ix

Page 17: Rapid Training of Information Extraction with Local and Global Data Views

Prior Work – Unsupervised Learning

• Unsupervised Relation Discovery (Hasegawa et al., (04))

Page 18: Rapid Training of Information Extraction with Local and Global Data Views

Prior Work – Unsupervised Learning

• Unsupervised Relation Discovery– The semantics of clusters are usually unknown

– Some clusters are coherent can consistently label them

– Some are mixed, containing different topics difficult to label them

Page 19: Rapid Training of Information Extraction with Local and Global Data Views

PART IIRelation Type Extension: Active Learning with Local and Global Data Views

Page 20: Rapid Training of Information Extraction with Local and Global Data Views

Relation Type Extension

• Extend a relation extraction system to new types of relationsACE 2004 Relations

Type ExampleEMP-ORG the CEO of Microsoft

PHYS a military base in GermanyGPE-AFF U.S. businessmanPER-SOC his ailing father

ART US helicoptersOTHER-AFF Cuban-American people

Multi-class Setting:Target relation: one of the ACE relation typesLabeled data: 1) a few labeled examples of the target relation (possibly by random selection). 2) all labeled auxiliary relation examples.Unlabeled data: all other examples in the ACE corpus

Target

Labeled

Page 21: Rapid Training of Information Extraction with Local and Global Data Views

Relation Type Extension

• Extend a relation extraction system to new types of relationsACE 2004 Relations

Type ExampleEMP-ORG the CEO of Microsoft

PHYS a military base in GermanyGPE-AFF U.S. businessmanPER-SOC his ailing father

ART US helicoptersOTHER-AFF Cuban-American people

Binary Setting:Target relation: one of the ACE relation typesLabeled data: a few labeled examples of the target relation (possibly by random selection). Unlabeled data: all other examples in the ACE corpus

Target

Un-labeled

Page 22: Rapid Training of Information Extraction with Local and Global Data Views

LGCo-Testing

• LGCo-Testing := co-testing with local and global views• The general idea

1. Train one classifier based on the local view(the sentence that contains the pair of entities)

2. Train another classifier based on the global view(distributional similarities between relation instances)

3. Reduce annotation cost by only requesting labels of contention data points

Page 23: Rapid Training of Information Extraction with Local and Global Data Views

• The local view

<e1>President Clinton</e1> traveled to <e2>the Irish border</e2> for an evening ceremony.

Words before entity 1 {NIL}Words between {travel, to}Words after entity 2 {for, an}# words between 2Token pattern coupled with entity types PERSON_traveled_to_LOCATION

Token Sequence

Syntactic Parsing

Tree

Path of phrase labels connecting E1 and E2 augmented with the head word of the top phrase

NP--S--traveled--VP--PP

Page 24: Rapid Training of Information Extraction with Local and Global Data Views

• The local view

<e1>President Clinton</e1> traveled to <e2>the Irish border</e2> for an evening ceremony.

Dependency Parsing

Tree

Shortest path connecting the two entities coupled with entity types

PER_nsubj'_traveled_prep_to_LOC

Page 25: Rapid Training of Information Extraction with Local and Global Data Views

• The local view

• The local view classifier– Binary Setting: MaxEnt binary classifier– Multi-class Setting: MaxEnt multiclass classifier

Page 26: Rapid Training of Information Extraction with Local and Global Data Views

• The global view

Corpusof

2,000,000,000 tokens

* * * * * * *(7-grams)

1. Compile corpus to database of 7-grams

2. Represent each relation instance as a relational phrase

3. Compute distributional similarities between phrases in the 7-grams database

4. Build a relation classifier based on the K-nearest neighbor idea

<e1>Clinton</e1> traveled to <e2>the Irish border</e2> for …

The General Idea

traveled to

Relation Instance Relational Phrase

… <e2><e1>his</e1> brother</e2> said that …. his brother

Page 27: Rapid Training of Information Extraction with Local and Global Data Views

• Compute distributional similarities

<e1>President Clinton</e1> traveled to <e2>the Irish border</e2> for an evening ceremony.

* * * * * traveled to traveled to * * * * ** * * * traveled to * * traveled to * * * ** * * traveled to * * * * traveled to * * *

> * * * traveled to * *3 's headquarters here traveled to the U.S.4 laundering after he traveled to the country3 , before Paracha traveled to the United3 have never before traveled to the United3 had in fact traveled to the United4 two Cuban grandmothers traveled to the United3 officials who recently traveled to the United6 President Lee Teng-hui traveled to the United4 1996 , Clinton traveled to the United4 commission members have traveled to the United4 De Tocqueville originally traveled to the United4 Fernando Henrique Cardoso traveled to the United3 Elian 's grandmothers traveled to the United

Page 28: Rapid Training of Information Extraction with Local and Global Data Views

• Compute distributional similarities

<e1>Ang</e1> arrived in <e2>Seattle</e2> on Wednesday.

> * * * arrived in * *4 Arafat , who arrived in the other5 of sorts has arrived in the new5 inflation has not arrived in the U.S.3 Juan Miguel Gonzalez arrived in the U.S.6 it almost certainly arrived in the New7 4 to Brazil , arrived in the country4 said Annan had arrived in the country21 he had just arrived in the country5 had not yet arrived in the country3 when they first arrived in the country3 day after he arrived in the country5 children who recently arrived in the country4 Iraq Paul Bremer arrived in the country3 head of counterterrorism arrived in the country3 election monitors have arrived in the country

Page 29: Rapid Training of Information Extraction with Local and Global Data Views

• Compute distributional similarities– Represent each phrase as a feature vector of

contextual tokens

– Compute cosine similarity between two feature vectors

– Feature weight?

President Clinton traveled to the Irish border

<L2_President, L1_Clinton, R1_the, R2_Irish, R3_Border>

Page 30: Rapid Training of Information Extraction with Local and Global Data Views

Features for traveled to (sorted by frequency)1-10 11-20 21-30 31-40

R1_the L1_have R4_to R3_in

L2_, R2_and R1_Washington L2_and

R2_to R2_in R1_New L1_He

L1_who L3_. R4_, R1_a

R2_, L1_and R2_on L1_also

L1_, R3_to R4_the R3_a

L1_had L2_who R1_China R2_with

L1_he R2_for L4_, L3_the

L3_, L4_. L2_the L2_when

L1_has R3_the R3_, L1_then

Features for arrived in (sorted by frequency)1-10 11-20 21-30 31-40

R1_the R1_Beijing R2_in R3_a

R2_on L1_had R3_, R4_a

L1_who R2_to R2_for R4_the

L2_, R3_on R3_for L3_,

L1_, L4_. R2_from R1_a

L3_. R3_the R4_for L1_they

R2_, R1_New R3_to R4_to

L1_he R2_. L3_the R1_Moscow

L1_has L2_when R3_capital L5_.

L1_have R4_, L2_the L3_The

FeatureWeight

Use Frequency

?

Page 31: Rapid Training of Information Extraction with Local and Global Data Views

FeatureWeight

Use tf-idf

tf the number of corpus instances of P

having feature f divided by the number of instances of P

idf the total number of phrases in the corpus divided by the number of

phrases with at least one instance with feature f

Page 32: Rapid Training of Information Extraction with Local and Global Data Views

FeatureWeight

Use tf-idf

Features for traveled to (sorted by tf-idf)1-10 11-20 21-30 31-40

L1_had L1_He L1_then R1_Beijing

L1_who R1_New L1_she R1_London

L1_he L2_who L1_also R2_for

L1_has R1_China R2_York R2_in

L1_have R2_, R1_Afghanistan L2_when

R2_to L1_recently L1_Zerhouni R1_Baghdad

L1_, R1_Thenia L1_Clinton R1_Mexico

R1_the L1_and L1_they L2_He

R1_Washington R1_Europe L3_Nouredine R4_to

L2_, R1_Cuba R2_and R2_United

Features for arrived in (sorted by tf-idf)1-10 11-20 21-30 31-40

L1_who R1_Baghdad L1_they R1_Seoul

R2_on R1_Moscow R2_Sunday L5_.

R1_Beijing L1_delegation R2_Tuesday R1_Damascus

L1_has R3_capital R1_Washington R2_,

L1_he R1_New R3_Monday R3_Wednesday

L1_have L3_. R2_Wednesday R3_Thursday

L1_had L2_, R3_Sunday R2_from

L1_, L1_He R2_Thursday R1_Amman

R1_Cairo L2_when R3_Tuesday L3_Minister

R1_the R2_Monday R2_York R1_Belgrade

Page 33: Rapid Training of Information Extraction with Local and Global Data Views

• Compute distributional similarities

traveled to his familyPhrase Sim. Phrase Sim.visited 0.779 his staff 0.792

arrived in 0.763 his brother 0.789

worked in 0.751 his friends 0.780

lived in 0.719 his children 0.769

served in 0.686 their families 0.753

consulted with 0.672 his teammates 0.746

played for 0.670 his wife 0.725

Sample of similar phrases.

Page 34: Rapid Training of Information Extraction with Local and Global Data Views

• The global view classifiertraveled to his family

Phrase Sim. Phrase Sim.visited 0.779 his staff 0.792

arrived in 0.763 his brother 0.789

worked in 0.751 his friends 0.780

lived in 0.719 his children 0.769

served in 0.686 their families 0.753

consulted with 0.672 his teammates 0.746

played for 0.670 his wife 0.725

k-nearest neighbor classifier:classify an unlabeled example based on closest labeled examples

<e1>President Clinton</e1> traveled to <e2>the Irish border</e2> PHYS-LocatedIn

<e1>Ang Sun</e1> arrived in <e2>Seattle</e2> on Wednesday. ?PHYS-LocatedIn

… <e2><e1>his</e1> brother</e2> said that … PER-SOC

sim(arrived in, traveled to) = 0.763 sim(arrived in, his brother) = 0.012

Page 35: Rapid Training of Information Extraction with Local and Global Data Views

• LGCo-Testing Procedure in Detail

Use KL-divergence to quantify the disagreement between the two classifiers

KL-divergence: 0 for identical distributions max when distributions are peaked and prefer different class labels Rank instances by descending order of KL-divergence Pick the top 5 instances to request human labels during a single iteration

Page 36: Rapid Training of Information Extraction with Local and Global Data Views

Active Learning Baselines• RandomAL• UncertaintyAL– Local view classifier– Sample selection:

• UncertaintyAL+– Local view classifier (with phrase cluster features)– Sample selection:

• SPCo-Testing– Co-Testing (sequence view classifier and parsing view classifier)– Sample selection: KL-divergence

( ) ( ) log ( )i ii

h p p c p c

( ) ( ) log ( )i ii

h p p c p c

Page 37: Rapid Training of Information Extraction with Local and Global Data Views

Annotation speed: 4 instances per

minute 200 instances per

hour (annotator takes a 10-mins break in each hour)

Supervised36K instances

180 HoursLGCo-Testing300 instances

1.5 Hour

PER-SOC

# Labeled Instances

0 100 200 300 400 500 600 700 800 9001000

F1

0

10

20

30

40

50

60

70

80

Supervised

PER-SOC

# Labeled Instances

0 100 200 300 400 500 600 700 800 9001000

F1

0

10

20

30

40

50

60

70

80

LGCo-TestingSupervised

PER-SOC

# Labeled Instances

0 100 200 300 400 500 600 700 800 9001000

F1

0

10

20

30

40

50

60

70

80

LGCo-TestingSPCo-TestingSupervised

PER-SOC

# Labeled Instances

0 100 200 300 400 500 600 700 800 9001000

F1

0

10

20

30

40

50

60

70

80

LGCo-TestingSPCo-TestingUncertaintyALSupervisedRandomALUncertaintyAL+

Results for PER-SOC (Multi-class Setting)

Results for other types of relations have similar trends (in both binary and multiclass settings)

Page 38: Rapid Training of Information Extraction with Local and Global Data Views

Precision-recall Curve of LGCo-Testing (Multi-class setting)

Recall

15 20 25 30 35 40 45 50 55 60 65 70 75 80

Pre

cis

ion

0

10

20

30

40

50

60

70

80

90

EMP-ORGPER-SOCARTOTHER-AFFGPE-AFFPHYS

Page 39: Rapid Training of Information Extraction with Local and Global Data Views

Comparing LGCo-Testing with the Two Settings

#Labels

0 200 400 600 800 1000

F1

Dif

fere

nc

e

-50

-40

-30

-20

-10

0

GPE-AFF BinaryGPE-AFF Multi-class OTHER-AFF BinaryOTHER-AFF Multi-class

F1 difference (in percentage) = F1 of active learning

minus F1 of supervised learning

the reduction of annotation cost by incorporating auxiliary types is more pronounced in early learning stages (#labels < 200) than in later ones

Page 40: Rapid Training of Information Extraction with Local and Global Data Views

Part I

Part IIIRelation Type Extension: Bootstrapping with Local and Global Data Views

Page 41: Rapid Training of Information Extraction with Local and Global Data Views

Basic Idea

• Consider a bootstrapping procedure to discover semantic patterns for extracting relations between named entities

Page 42: Rapid Training of Information Extraction with Local and Global Data Views

Basic Idea• It starts from some seed patterns which are used to extract named

entity (NE) pairs , which in turn result in more semantic patterns learned from the corpus.

Page 43: Rapid Training of Information Extraction with Local and Global Data Views

Basic Idea• Semantic drift occurs because

1) a pair of names may be connected by patterns belonging to multiple relations

2) the bootstrapping procedure is looking at the patterns in isolation

Named Entity 1

Pattern Named Entity 2

Bill Clintonvisit

Arkansasborn infly to governor ofarrive incampaign in… …

Page 44: Rapid Training of Information Extraction with Local and Global Data Views

Unguided Bootstrapping Guided Bootstrapping

NE Pair Ranker

Use local evidenceLook at the patterns in isolation

NE Pair Ranker

Use global evidenceTake into account the clusters (Ci) of patterns

Page 45: Rapid Training of Information Extraction with Local and Global Data Views

Unguided Bootstrapping

• Initial Settings:– The seed patterns for the target relation R have precision 1

and all other patterns 0.– All NE pairs have confidence of 0

Page 46: Rapid Training of Information Extraction with Local and Global Data Views

Unguided Bootstrapping

• Step 1: Use seed patterns to match new NE pairs and evaluate NE pairs– if many of the k patterns connecting the two names are high-

precision patterns – then the name pair should have a high confidence.

– The confidence of NE pairs is estimated as

– Problem: over-rate NE pairs which are connected by patterns belonging to multiple relations

1

( ) 1 (1 ( ))k

i jj

Conf N Prec p

Page 47: Rapid Training of Information Extraction with Local and Global Data Views

Unguided Bootstrapping

• Step 2: Use NE pairs to search for new patterns and rank patterns– Similarly, for a pattern p, – if many of the NE pairs it matches are very confident – then p has many supporters and should have a high ranking

– Estimation of the confidence of patterns

( )( ) log ( )

| |

Sup pConf p Sup p

H

the number of unique NE pairs matched by p

sum of the support from the |H| pairs

Page 48: Rapid Training of Information Extraction with Local and Global Data Views

Unguided Bootstrapping

• Step 2: Use NE pairs to search for new patterns and rank patterns– Sup(p) is the sum of the support it can get from the |H| pairs

– The precision of p is given by the average confidence of the NE pairs matched by p

• It normalizes the precision to range from 0 to 1• As a result the confidence of each NE pair is also normalized to

between 0 and 1

| |

1

( ) ( )H

jj

Sup p Conf N

( )( )

| |

Sup pPrec p

H

Page 49: Rapid Training of Information Extraction with Local and Global Data Views

Unguided Bootstrapping

• Step 3: Accept patterns– accept the K top ranked patterns in Step 2

• Step 4: Loop or stop– The procedure now decides whether to repeat

from Step 1 or to terminate.– Most systems simply do NOT know when to stop

Page 50: Rapid Training of Information Extraction with Local and Global Data Views

Guided Bootstrapping

• Pattern Clusters--Clustering steps:I. Extract features for patterns

II. Compute the tf-idf value of extracted features

III. Compute the cosine similarity between patterns

IV. Build a pattern hierarchy by complete linkage

Sample features for “X visited Y” as in “Jordan visited China”

Page 51: Rapid Training of Information Extraction with Local and Global Data Views

Guided Bootstrapping

• Pattern Clusters– We use 0.005 to cut the pattern hierarchy to generate

clusters

– This ‘cutoff’ is decided by • trying a series of thresholds • searching for the maximal one that is capable of placing the

seed patterns for each relation into a single cluster

– We define target cluster Ct as the one containing the seeds

Page 52: Rapid Training of Information Extraction with Local and Global Data Views

Guided Bootstrapping

• Pattern cluster example– Top 15 patterns in the Located-in Cluster

Page 53: Rapid Training of Information Extraction with Local and Global Data Views

Guided Bootstrapping

• Step 1: Use seed patterns to match new NE pairs and evaluate NE pairs

the total number of pattern instances matching Ni

the number of times

p matches NiDegree of

association between Ni

and Ct

Page 54: Rapid Training of Information Extraction with Local and Global Data Views

Guided Bootstrapping

• Step 1: Use seed patterns to match new NE pairs and evaluate NE pairs

Why it gives better confidence estimation?

<Clinton, Arkansas> for the Located-in relation Local_Conf(Ni) is very high Global_Conf(Ni) is very low (less than 0.1)Conf(Ni) is low, high Local_Conf(Ni) is discounted by low Global_Conf(Ni)

Page 55: Rapid Training of Information Extraction with Local and Global Data Views

Guided Bootstrapping

• Step 2: Use NE pairs to search for new patterns and rank patterns.– All the measurement functions are the same as

those used in the unguided bootstrapping. – However, with better ranking of NE pairs in Step 1– the patterns are also ranked better

• Step 3: Accept patterns– We also accept the K top ranked patterns

Page 56: Rapid Training of Information Extraction with Local and Global Data Views

Guided Bootstrapping

• Step 4: Loop or stopSince each pattern in our corpus has a cluster membership, we can

– monitor the semantic drift easily – and naturally stop• it drifts when the procedure tries to accept patterns

which do not belong to the target cluster • we can stop when the procedure tends to accept more

patterns outside of the target cluster

Page 57: Rapid Training of Information Extraction with Local and Global Data Views

Experiments• Pattern clusters:

– Computed from a corpus of 1.3 billion tokens

• Evaluation data:– ACE 2004 training data (no relation annotation between each pair of

names)– We take advantage of entity co-reference information to automatically

re-annotate the relations – Annotation was reviewed by hand

• Evaluation method:– direct evaluation– strict pattern match

Page 58: Rapid Training of Information Extraction with Local and Global Data Views

Experiments

Red: guided bootstrappingBlue: unguided bootstrapping

drift : the percentage of false positives belonging to ACE relations

other than the target relation

Page 59: Rapid Training of Information Extraction with Local and Global Data Views

Experiments

Red: guided bootstrappingBlue: unguided bootstrapping

drift : the percentage of false positives belonging to ACE relations

other than the target relation

Page 60: Rapid Training of Information Extraction with Local and Global Data Views

Experiments

Red: guided bootstrappingBlue: unguided bootstrapping

drift : the percentage of false positives belonging to ACE relations

other than the target relation

Page 61: Rapid Training of Information Extraction with Local and Global Data Views

Experiments

• Guided bootstrapping terminates when the precision is still high while maintaining a reasonable recall

• It also effectively prevented semantic drift

Page 62: Rapid Training of Information Extraction with Local and Global Data Views

Part I

Part IVCross-Domain Bootstrapping for

Named Entity Recognition

NER Semi-supervised learning NER

Source Domain Target Domain

Page 63: Rapid Training of Information Extraction with Local and Global Data Views

NER Model Maximum Entropy Markov Model (McCallum et al., 2000)

Split a name type into two classes B_PER (beginning of PERSON) I_PER (continuation of PERSON)

11 (,...,|,...,) nn PSSTT

1 1 11

( ,..., | ,..., ) ( | , )n

n n i i ii

P S S T T P S S T

1( | , )i i iP S S T

U.S. Defense Secretary Donald H. Rumsfeld

B_GPE B_ORG O B_PER I_PER I_PERT1 T2 T3 T4 T5 T6

S1 S2 S3 S4 S5 S6

Goal

MEMM

Maximum Entropy Classifier

11

( ) max ( ) ( | , )N

t t j i ti

j i P s s o Viterbi Algorithm

Page 64: Rapid Training of Information Extraction with Local and Global Data Views

NER Model

Estimate the name class of each individual token ti

Extract a feature vector from the local context window (ti-2, ti-1, ti, ti+1, ti+2)

Learn feature weights using Maximum Entropy model

U.S. Defense Secretary Donald H. Rumsfeld

B_GPE B_ORG O B_PER I_PER I_PER

Feature ValuecurrentToken DonaldwordType_currentToken initial_capitalizedpreviousToken_-1 SecretarypreviousToken_-1_class OpreviousToken_-2 DefensenextToken_+1 H.

… …

Page 65: Rapid Training of Information Extraction with Local and Global Data Views

NER Model

Estimate the name classes of the token sequence Search the most likely path argmax ( ) Use dynamic programming ( possible paths)

N := number of name classes L := length of the token sequence

U.S. Defense Secretary Donald H. RumsfeldB-PERI-PER

B-ORGI-ORGB-GPEI-GPE

O

LN

U.S. Defense Secretary Donald H. Rumsfeld

B_GPE B_ORG O B_PER I_PER I_PER

Page 66: Rapid Training of Information Extraction with Local and Global Data Views

Domain Adaptation Problems

Source domain(news articles)

George Bush Donald H. Rumsfeld

…Department of Defense

Target domain(reports on terrorism)

Abdul Sattar al-RishawiFahad bin Abdul Aziz bin Abdul Rahman Al-Saud

…Al-Qaeda in Iraq

Q(Target domain): What is the weight of the feature currentToken=AbdulA(Source domain): Sorry, I don’t know. I’ve never seen this guy in my training data

Page 67: Rapid Training of Information Extraction with Local and Global Data Views

Domain Adaptation ProblemsSource domain(news articles)

George Bush Donald H. Rumsfeld

…Department of Defense

Target domain(reports )

Abdul Sattar al-RishawiFahad bin Abdul Aziz bin Abdul Rahman Al-Saud

…Al-Qaeda in Iraq

1. Many words are out-of-vocabulary2. Naming conventions are different:

1. Length: short vs long2. Capitalization: weaker in target

3. Name variation occurs often in targetShaikh, Shaykh, Sheikh, Sheik, …

We want to automatically adapt the source-domain tagger to the target domain

without annotating target domain data

Page 68: Rapid Training of Information Extraction with Local and Global Data Views

The Benefits of Incorporating Global Data View -- Feature Generalization

Q(Target domain): What is the weight of the feature currentToken=AbdulA(Source domain): Sorry, I don’t know. I’ve never seen this guy in my training data

Bit string Examples

110100011 John, James, Mike, Steven

11010011101 Abdul, Mustafa, Abi, Abdel

11010011111 Shaikh, Shaykh, Sheikh, Sheik

111111110 Qaeda, Qaida, qaeda, QAEDA

00011110000 FBI, FDA, NYPD

000111100100 Taliban

Global Data View Comes to the Rescue!Build a word hierarchy from a 10M word corpus (Source + Target), using the Brown word clustering algorithm

Page 69: Rapid Training of Information Extraction with Local and Global Data Views

The Benefits of Incorporating Global Data View -- Feature Generalization

• Add an additional layer of features that include word clusters• currentToken = John• currentPrefix3 = 100• currentPrefix3 = 100 fires also for target words!

• To avoid commitment to a single cluster: cut word hierarchy at different levels

Page 70: Rapid Training of Information Extraction with Local and Global Data Views

The Benefits of Incorporating Global Data View -- Feature Generalization

Performance on the target domainModel P R F1

Source_Model 70.02 61.86 65.69Source_Model

+ Word Clusters 72.82 66.61 69.58

Page 71: Rapid Training of Information Extraction with Local and Global Data Views

The Benefits of Incorporating Global Data View -- Instance Selection

Cross-domain Bootstrapping Algorithm:1. Train a tagger from labeled source data2. Tag all unlabeled target data with current tagger3. Select good tagged words and add these to labeled data4. Re-train the tagger

Trained tagger

Unlabeled target data

Instance Selection

Labeled Source data

President Assad

FeatureGeneralization

MultipleCriteria

Page 72: Rapid Training of Information Extraction with Local and Global Data Views

The Benefits of Incorporating Global Data View -- Instance Selection

• Multiple criteria– Criterion 1: Novelty– prefer target-specific

instances • Promote Abdul instead of John

– Criterion 2: Confidence - prefer confidently labeled instances

Page 73: Rapid Training of Information Extraction with Local and Global Data Views

The Benefits of Incorporating Global Data View -- Instance Selection

Criterion 2: Confidence - prefer confidently labeled instances

Local confidence: based on local features

1) maximum: 1. when one name class is predicted with probability 1, e.g., p(ci|v) = 1

2) minimum: when the predictions are evenly distributed over all the name classes.

3) The higher the value, the more confident the instance is.

( ) 1 ( | ) log ( | )i

i ic

LocalConf I p c v p c v

I := instancev := feature vector for I

ci := name class i

Page 74: Rapid Training of Information Extraction with Local and Global Data Views

The Benefits of Incorporating Global Data View -- Instance Selection

Criterion 2: Confidence Global confidence: based on corpus statistics

1 Prime Minister Abdul Karim Kabariti PER2 warlord General Abdul Rashid Dostum PER3 President A.P.J. Abdul Kalam will PER4 President A.P.J. Abdul Kalam has PER5 Abdullah bin Abdul Aziz , PER6 at King Abdul Aziz University ORG7 Nawab Mohammed Abdul Ali , PER8 Dr Ali Abdul Aziz Al PER9 Nayef bin Abdul Aziz said PER

10 leader General Abdul Rashid Dostum PER

P( Abdul is a PER) = 0.9

Page 75: Rapid Training of Information Extraction with Local and Global Data Views

The Benefits of Incorporating Global Data View -- Instance Selection

Criterion 2: Confidence Global confidence

Combined confidence: product of local and global confidence

( ) 1 ( ) log ( )i

i ic

GlobalConf I p c p c

The higher the value, the more confident the instance is.

Page 76: Rapid Training of Information Extraction with Local and Global Data Views

The Benefits of Incorporating Global Data View -- Instance Selection

Criterion 3: Density - prefer representative instances which can be seen as centroid instances

1

( , )

( )1

N

j j i

Sim i j

Density iN

average similarity between i and all other instances j

Jaccard Similarity between the feature vectors of the two instances

the total number of instances in the corpus

Page 77: Rapid Training of Information Extraction with Local and Global Data Views

The Benefits of Incorporating Global Data View -- Instance Selection

Criterion 4: Diversity - prefer a set of diverse instances instead of similar instances

“, said * in his” Highly confident instance High density, representative instance BUT, continuing to promote such instance would not gain additional

benefit

( , ) ( ) ( )diff i j Density i Density j

diff(i, j) := difference between instances i and j Use a small value for diff(i, j) dense instances still have a higher chance to be selected while a certain degree of diversity is achieved at the same time.

Page 78: Rapid Training of Information Extraction with Local and Global Data Views

The Benefits of Incorporating Global Data View -- Instance Selection

Putting all criteria together1. Novelty: filter out source-dependent instances

2. Confidence: rank instances based on confidence and the top ranked instances will be used to generate a candidate set

3. Density: rank instances in the candidate set in descending order of density

4. Diversity: 1. accepts the first instance (with the highest density) in the candidate set 2. and selects other candidates based on the diff measure.

Page 79: Rapid Training of Information Extraction with Local and Global Data Views

The Benefits of Incorporating Global Data View -- Instance Selection

Results

+ Novelty + CombinedConf + Diversity+ Novelty + CombinedConf + Density + Novelty + CombinedConf + Novelty + LocalConf Generalized seed model (SourceModel + WordCluster)- Novelty + LocalConf +/- := with/without

Iteration

0 5 10 15 20 25 30 35

F1

68

69

70

71

72

73

74

Iteration

0 5 10 15 20 25 30 35

F1

68

69

70

71

72

73

74

Iteration

0 5 10 15 20 25 30 35

F1

68

69

70

71

72

73

74

Iteration

0 5 10 15 20 25 30 35

F1

68

69

70

71

72

73

74

Iteration

0 5 10 15 20 25 30 35

F1

68

69

70

71

72

73

74

Iteration

0 5 10 15 20 25 30 35

F1

68

69

70

71

72

73

74

Page 80: Rapid Training of Information Extraction with Local and Global Data Views

Part V

Conclusion

Page 81: Rapid Training of Information Extraction with Local and Global Data Views

Contribution• The main contribution is the use of both local and global evidence

for fast system development

• The co-testing procedure reduced annotation cost by 97%

• The use of pattern clusters as the global view in bootstrapping– not only greatly improved the quality of learned patterns – but also contributed to a natural stopping criterion

• Feature generalization and instance selection in the cross-domain bootstrapping were able to improve the source model's performance on the target domain by 7% F1 without annotating any target domain data

Page 82: Rapid Training of Information Extraction with Local and Global Data Views

Future Work

• Active Learning for Relation Type Extension– conduct real world active learning– combine semi-supervised learning with active learning to further

reduce annotation cost

• Semi-supervised Learning for Relation Type Extension– better seed selection strategy

• Cross-domain Bootstrapping for Named Entity Recognition– extract dictionary-based features to further generalize lexical features– combine with distantly annotated data to further improve

performance

Page 83: Rapid Training of Information Extraction with Local and Global Data Views

Thanks!

Page 84: Rapid Training of Information Extraction with Local and Global Data Views

?

Page 85: Rapid Training of Information Extraction with Local and Global Data Views

• Backup slides

Page 86: Rapid Training of Information Extraction with Local and Global Data Views

Experimental Setup for Active Learning

• ACE 2004 data– 4.4K relation instances– 45K non-relation instances

• 5-fold cross validation– Roughly 36K unlabeled instances (45K ÷ 5 X 4)– Random initialization (repeated 10 times)– Totally 50 runs – Each iteration selects 5 instances for annotation– 200 iterations are performed

Page 87: Rapid Training of Information Extraction with Local and Global Data Views

EMP-ORG

# Labeled Instances

0 100 200 300 400 500 600 700 800 900 1000

F1

0

10

20

30

40

50

60

70

80

90ART

# Labeled Instances

0 100 200 300 400 500 600 700 800 900 1000F1

0

10

20

30

40

50

60

70

80OTHER-AFF

# labeled examples

0 100 200 300 400 500 600 700 800 900 1000

F1

0

10

20

30

40

50

60

PHYS

# Labeled Instances

0 100 200 300 400 500 600 700 800 900 1000

F1

0

10

20

30

40

50

60

70

80 GPE-AFF

# Labeled Instances

0 100 200 300 400 500 600 700 800 900 1000

F1

0

10

20

30

40

50

60

70

LGCo-TestingSPCo-TestingUncertaintyALSupervisedRandomALUncertaintyAL+