the lodie team at tac-kbp2015

1
LODIE Team Participation Summary Method Details The LODIE 1 team Participation at the TAC2015 Entity Discovery Task of the Cold Start KBP Track The Task Entity Discovery of Cold Start KBP Cold Start KBP aims to build a KB from scratch using a given corpus and a predefined schema for the entities and relations that will compose the KB Entity Discovery (ED, new in 2015) create a KB node for each person (PER), organization (ORG) and geo- political entity (GPE) mentions in the document collection cluster all KB nodes that refer to the same entity Challenge Scale-up: millions of name mentions are extracted and clustered Ziqi Zhang, Jie Gao, and Anna Lisa Gentile 1. Representing the Linked Open Data for Information Extraction project team Contact: Ziqi Zhang [email protected] Method Overview – a cross-document coreference approach State-of-the-art Named Entity Recognition (NER) Clustering within each type of Named Entities (NEs) A non-deterministic string similarity clustering process to split data into macro-clusters Agglomerative clustering within each macro-cluster (that contains NEs from different documents) Performance Overview 63.2 CEAF mention F-measure (ranked as #3) on the 2015 Cold Start KBP evaluation dataset Evaluation 1. NER Modules NE Types Text type Stanford NER Standard PER, ORG, GPE All Stanford NER Re-trained PER, ORG, GPE Colloquial Gaze:eer GPE All Ad-hoc Rules PER Colloquial Merge by heuristics 2.1. String similarity clustering string similarity between entity names non-deterministic to split data into smaller macro-clusters to focus on conflation of entity names can over-cluster, e.g., ‘David Miliband’ & ‘Ed Miliband’ = 0.8 2.2. Agglomerative clustering Applied to each macro-cluster that contain NEs from different documents. (hypothesizing ‘one-sense-per-name’ within individual document) b. Clustering Standard group average agglomerative clustering (Murtagh, 1985) with L 1 distance Determine a natural cluster number in data: o Silhouette coefficient to evaluate clustering quality o A non-greedy iterative algorithm that searches for a local optimum as an approximation a. Featurization Contextual tokens: previous and following n tokens Contextual NEs: previous and following n NEs Surface tokens (‘Mr Blair’ => ‘mr’, ‘blair’) Word embedding based o train word & phrase embeddings using Mikolov et al. (2013) o compute OOV vectors based on additive compositionality, i.e., vec(London Tower) = vec(London) + vec(Tower) Feature combination by Weighting also experimented Training the Standford NER for colloquial text: - The training dataset of TAC2014 English Entity Discovery and Linking (EDL) Training the word embeddings: - Comprehensive English source corpora 2013-14 Computing resources: single thread NER, agglomerative clustering parallelized on 16 cores, max of 64GB memory Evaluation measures: standard Precision, Recall, F1 for NER; mention CEAF for clustering Settings: string similarity threshold of 0.7 (ss 0.7 ) and 0.9 (ss 0.9 ), combined with: previous and following 5 tokens (tok 5 ); previous and following 3 entity mentions (ne 3 ); surface tokens (sf); word embedding based (dvec) CEAF (P, R, F) on TAC2014 EDL evaluation dataset Ceiling CEAF (P, R, F) on TAC2014 EDL evaluation dataset Final results (CEAF) on TAC 2015 evaluation dataset ss 0.9 +sf+dvec ss 0.9 only ss 0.9 +dvec ss 0.7 +sf+dvec ss 0.7 +dvec NER on TAC2014 EDL datasets 1. NER results 2. Clustering results (CEAF mention) 3. Clustering results, using NER ground truth 4. Final results on TAC2015

Upload: jie-gao

Post on 12-Apr-2017

20 views

Category:

Science


0 download

TRANSCRIPT

Page 1: The LODIE team at TAC-KBP2015

LODIE Team Participation Summary

Method Details

The LODIE1 team Participation at the TAC2015 Entity Discovery Task of the Cold Start KBP Track

The Task Entity Discovery of Cold Start KBP •  Cold Start KBP aims to build a KB from scratch using a given corpus and a

predefined schema for the entities and relations that will compose the KB •  Entity Discovery (ED, new in 2015)

•  create a KB node for each person (PER), organization (ORG) and geo-political entity (GPE) mentions in the document collection

•  cluster all KB nodes that refer to the same entity Challenge •  Scale-up: millions of name mentions are extracted and clustered

Ziqi Zhang, Jie Gao, and Anna Lisa Gentile

1. Representing the Linked Open Data for Information Extraction project team Contact: Ziqi Zhang [email protected]

Method Overview – a cross-document coreference approach •  State-of-the-art Named Entity Recognition (NER) •  Clustering within each type of Named Entities (NEs)

•  A non-deterministic string similarity clustering process to split data into macro-clusters

•  Agglomerative clustering within each macro-cluster (that contains NEs from different documents)

Performance Overview •  63.2 CEAF mention F-measure (ranked as #3) on the 2015 Cold

Start KBP evaluation dataset

Evaluation

1. NER

Modules NETypes TexttypeStanfordNERStandard PER,ORG,GPE AllStanfordNERRe-trained PER,ORG,GPE ColloquialGaze:eer GPE AllAd-hocRules PER Colloquial

Merge by heuristics

2.1. String similarity clustering •  string similarity between

entity names • non-deterministic •  to split data into smaller

macro-clusters •  to focus on conflation of

entity names •  can over-cluster, e.g.,

‘David Miliband’ & ‘Ed Miliband’ = 0.8

2.2. Agglomerative clustering Applied to each macro-cluster that contain NEs from different documents. (hypothesizing ‘one-sense-per-name’ within individual document)

b. Clustering • Standard group average agglomerative clustering

(Murtagh, 1985) with L1 distance • Determine a natural cluster number in data:

o  Silhouette coefficient to evaluate clustering quality o  A non-greedy iterative algorithm that searches for

a local optimum as an approximation

a. Featurization • Contextual tokens: previous and following n tokens • Contextual NEs: previous and following n NEs • Surface tokens (‘Mr Blair’ => ‘mr’, ‘blair’) • Word embedding based

o  train word & phrase embeddings using Mikolov et al. (2013) o  compute OOV vectors based on additive compositionality, i.e., vec(London Tower) = vec(London) + vec(Tower)

Feature combination by W

eighting also experimented

Training the Standford NER for colloquial text: - The training dataset of TAC2014 English Entity Discovery and Linking (EDL) Training the word embeddings: - Comprehensive English source corpora 2013-14

Computing resources: single thread NER, agglomerative clustering parallelized on 16 cores, max of 64GB memory Evaluation measures: standard Precision, Recall, F1 for NER; mention CEAF for clustering

Settings: string similarity threshold of 0.7 (ss0.7) and 0.9 (ss0.9), combined with: •  previous and following 5 tokens (tok5); •  previous and following 3 entity mentions

(ne3); •  surface tokens (sf); •  word embedding based (dvec) CEAF (P, R, F) on TAC2014 EDL

evaluation dataset

Ceiling CEAF (P, R, F) on TAC2014 EDL evaluation dataset

Final results (CEAF) on TAC 2015 evaluation dataset

ss0.9+sf+dvec ss0.9 only ss0.9+dvec ss0.7+sf+dvec ss0.7+dvec

NER on TAC2014 EDL datasets

1. NER results

2. Clustering results (CEAF mention)

3. Clustering results, using NER ground truth

4. Final results on TAC2015