1 discriminative training of clustering functions theory and experiments with entity identification...
TRANSCRIPT
1
Discriminative Training ofClustering Functions
Theory and Experiments with EntityIdentification
Xin Li & Dan Roth
University of Illinois, Urbana-Champaign
Page 2
Outline
Clustering Current approaches Some problems
Making Clustering a Learning problem
Supervised Discriminative Clustering Framework
The Reference Problem: Entity Identification
within & across documents.
Page 3
Document 1: The Justice Department has officially ended its inquiry into the assassinations of John F. Kennedy and Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission 's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963.
Document 2: In 1953, Massachusetts Sen. John F. Kennedy married Jacqueline Lee Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedy confronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me.'‘
Document 3: David Kennedy was born in Leicester, England in 1959. …Kennedy co-edited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980-1994 (Seren 1996).
Kennedy
The Reference Problem
Page 4
Entity Identification in Text
Goal: Given names, within or across documents, identify real-world entities behind them.
Problem Definition: Given a set of names and their semantic types, [people], [locations] [Organizations] partition them into groups that refer to different entities.
Approaches: A generative Model [Li, Morie, Roth, NAACL’04] A discriminative approach [Li, Morie, Roth, AAAI’04]Other works [on citation, and
more: Milche et. al; Bilenko et. al.,…]
6
Document 1: The Justice Department has officially ended its inquiry into the assassinations of John F. Kennedy and Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963.
Document 2: In 1953, Massachusetts Sen. John F. Kennedy married Jacqueline Lee Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedyconfronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me.'‘
Document 3: David Kennedy was born in Leicester, England in 1959. …Kennedy co-edited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980-1994 (Seren 1996).
Kennedy
Intuitively, a discriminative approach, requires using some similarity measure between names, followed up by clustering into clusters that represent entities.
Page 5
Clustering
An optimization procedure that takes
A collection of data elements A distance (similarity) measure on the space of
data elements A Partition Algorithm
Attempts to:
Optimize some quality with respect to the given distance metric.
Page 6
Clustering: Example (k=4)
Page 7
An Optimization Problem:
Data: X = {x1,x2,…} Cluster Names: C = {1,2,3,…,K} The Euclidean Distance: d(x1,x2) = [ (x1-x2)T(x1-x2)]1/2
Find a mapping f: X C That minimizes:
j x 2 Cj
d(x,j )2
Where j = 1/m x 2 Cj x mean of elements in the k-th
cluster
Example: K-means Clustering
Page 8
Many NLP Applications
Class-based language models: group similar words together based on their semantics
(Dagan et. al 99, Lee et. al ; Pantel and Lin, 2002).
Document categorization; and topic identification (Karypis, Han 99,02).
Co-reference resolution – build coreference chain of noun phrases (Cardie,
Wagstaff 99).
In all cases – fixed metric distance; tuned for the application and the data. (and the algorithm?)
Page 9
Clustering: Metric and Algorithm
d1(x,x’) = [(f1 - f1’) 2+(f2 - f2’) 2]1/2 d2(x,x’) = |(f1+ f2)-(f1’+f2’)|
(a) Single-Linkage with Euclidean
(b) K-Means with Euclidean
(c) K-Means with a Linear Metric
There is no ‘universal’ distance metric that is appropriate for all clustering algorithms
How do we make sure we have an appropriate one, that reflects the task/designer intentions?
Page 10
Additional information in Clustering
Page 11
Traditional Clustering Framework
Typically, unsupervised; no learning. More recently: work on metric learning with
supervision: [Bilenko&Mooney 03, 04, Xing et. al.’03, Schultz & Joachims’03,
Bach & Jordan03] Learning a metric; then cluster Learning while clustering (algorithm specific)
A partition function
h(S) = Ad(S)
unlabeled data set S
partition h(S)
distance metric d
clustering algorithm A+
K-meansX = {x1,x2,…}, C = {c1,c2,
…,ck}
Euclidean Distance:d(x, x’) = [(x- x’)T(x- x’)]1/2
Page 12
labeled data set S
supervisedlearner
Training Stage:
Goal: h*=argmin errS(h,p)
distance metric d
clustering algorithm A+
unlabeled data set S’
partition h(S’)
Application Stage: h(S’ )
A partition function
h(S) = Ad(S)
Supervised Discriminative Clustering (SDC)
Incorporates supervision directly into metric training process; Training is driven by true clustering error Computed via the chosen data partition algorithm.
Page 13
Elements of SDC: Partition Function and Error
Goal: A partition function h maps a set of data points S to a partition h(S) of S. (outcome of a clustering
algorithm) [Note difference from multi-class classification]
The partition function h is a function of the parameterized
distance metric: d(x1,x2) = wi |xi1- xi
2| Error: Given a labeled data set S; p(S) = {(xi,ci)}1
m, the correct partition, and
a fixed clustering algorithm A, the training process attempts to find d*, minimizing the
clustering error:
d*= argmind errS(h,p), where h(S)=Ad(S).Optimal (given) PartitionLearned Partition
Page 14
A Supervised Clustering Error
errS(h,p) = 1/|S|2 ij [d(xi,xj)*Aij +(D-d(xi,xj))*Bij]
(as opposed to a quality function that depends only on the distance)
Two types of errors in pairwise prediction:(xi,xj) h `together’ or ‘apart’
False negative: Aij = I [p(xi)=p(xj) & h(xi)h(xj)],
False positive: Bij = I [p(xi) p(xj) & h(xi)=h(xj)],
D = maxij d(xi,xj ) . (See paper for a comparison with other error functions)
Page 15
Training the distance function
Initialize the distance Metric d
Cluster S using algorithm h=Ad
Evaluate ErrS(h,p)
Update d
S
A gradient descent based algorithm
Page 16
Training the distance function
Gradient descent Alg.
Learns a metric Iteratively by adjusting the parameter vector by a small amount in the direction that would most reduce the error.
Page 17
Entity Identification in Text
Goal: Given names, within or across documents, identify real-world entities behind them.
Problem Definition: Given a set of names and their semantic types, [people], [locations] [Organizations] partition them into groups that refer to different entities.
Approaches: A generative Model [Li, Morie, Roth,
NAACL’04] A discriminative approach [Li, Morie, Roth, AAAI’04]
6
Document 1: The Justice Department has officially ended its inquiry into the assassinations of John F. Kennedy and Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963.
Document 2: In 1953, Massachusetts Sen. John F. Kennedy married Jacqueline Lee Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedyconfronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me.'‘
Document 3: David Kennedy was born in Leicester, England in 1959. …Kennedy co-edited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980-1994 (Seren 1996).
Kennedy
Page 18
Feature Extraction: (John F. Kennedy, President Kennedy )= (1, 2 , …)
1. Fixed distance: distance (similarity) metric d for names. d (John F. Kennedy, President Kennedy) 0.6 d (Chicago Cubs, Cubs) 0.6 d (United States, USA) 0.7
2. A learned distance function parameterized as a Linear function over features (kernelized):
d(John F. Kennedy, President Kennedy ) = wi i Make it a pairwise classifier:
h (John F. Kennedy, President Kennedy ) = `together’ iff wi i <= 0.5
The distance function can be trained separately, to optimize partition quality, or via SDC, to minimize Error. (via gradient descent)
Parameterized Distance Metrics for Name Matching
John F. Kennedy President Kennedy?
Page 19
Features
Relational features that are extracted from a pair of strings, taking into account relative positions of tokens, substring relations, etc.
Page 20
Experimental Setting Names of people, locations and organizations.
John F. Kennedy, Bush, George W. Bush U.S.A, United States, and America University of Illinois, U. of I., IBM, International Business
Machines. 300 randomly picked New York Times news articles. 8,600 names annotated by a named entity tagger
and manually verified. Training sets contain names labeled with its global
entity.John F. Kennedy Kennedy1
President Kennedy Kennedy1,
David Kennedy Kennedy2.
Data is available from http://l2r.cs.uiuc.edu/~cogcomp/
Page 21
Gain from Metric Learning while Clustering
SoftTFIDF (Cohen et. al): Fixed metric LMR (Li, Morie, Roth, AAAI’04) learned metric via a pairwise classifier;
relational features extracted from pairs of strings; feedback from pairwise labels; SDC: trains a linear weighted distance metric for the single-link clustering
algorithm with labeled pairs of 600 names.
Page 22
Influence of Data Size
Page 23
Different Clustering Algorithms
Difference across clustering algorithm is not as significant as difference obtained from learning a good metric via SDC.
Page 24
Summary
A framework for Metric Learning for Clustering that is guided by global supervision with clustering as part of the feedback loop.
A parameterized distance metric is learned in a way that depends on the specific clustering algorithm used.
Significant improvement shown on the Reference Problem: Entity Identification Across documents.
Page 25
K=16
Intuition behind SDC
d
Page 26
Relational features: do not depend on specific tokens in the two names, but depend on some abstraction over tokens.
Honorific Equal: Mr., Mrs., President, Prof. Nickname: Thomas, Tom Edit Distance.
Relational Features
John DavisKennedy
2
1
1
2 3
John Kennedy
Toward Concept-Based Text Understanding and Mining
Page 27
How to employ transitivity between names ?
Toward Concept-Based Text Understanding and Mining
Michael Jordan
Michael Jordan,
Clustering: splitting a set of names. Distance Metrics: Edit distance, SoftTFIDF, Jora-Winkler
Clustering Algorithms: Single-Link, Complete-Link, K-means, graph cut.
Page 28
Outline
Clustering Current approaches Some problems
Making Clustering a Learning problem Supervised Discriminative Clustering Framework
The Reference Problem: Entity Identification in within & across document.
Page 29
Entity Identification in Text
Goal: Given names, within or across documents, identify real-world entities behind them.
Problem Definition: Given a set of names and their semantic types, [people], [locations] [Organizations] partition them into groups that refer to different entities.
Approaches: A generative Model [Li, Morie, Roth,
NAACL’04] A discriminative approach [Li, Morie, Roth, AAAI’04]
6
Document 1: The Justice Department has officially ended its inquiry into the assassinations of John F. Kennedy and Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963.
Document 2: In 1953, Massachusetts Sen. John F. Kennedy married Jacqueline Lee Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedyconfronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me.'‘
Document 3: David Kennedy was born in Leicester, England in 1959. …Kennedy co-edited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980-1994 (Seren 1996).
Kennedy
Challenge: millions of entities in the world, but in training, we can only see names of a limited number of entities.