collaborative ranking: a case study on entity ranking (emnlp2011読み会)

EMNLP2011読み会 Collaborative Ranking: A Case Study

on Entity Ranking (D11-1071)

2011-12-23 Yoshihiko Suhara @sleepy_yoshi

1

今日読む論文

• Collaborative Ranking: A Case Study on Entity Ranking

– by Zheng Chen, Heng Ji

2

一枚概要

• TAC-KBP2010 Entity Linkingタスク – クエリに対してエンティティを回答

–生成した候補をランキングすることで回答を選択

• Collaborative Ranking を提案 – (1) query-level collaboration

• Micro collaborative ranking

– (2) ranker-level collaboration • Macro collaborative ranking

3

背景

4

Named Entity Recogtion の歴史

5 [McNamee+ 10]

Knowledge Base Population (KBP) Track

• TAC > KBP > Entity Linking

[Ji+ 2010] 7

KBエントリ

[Chen+ 2010]

クエリ例

Entity Linking System Michael Jordan “England Youth

International goalkeeper“

Michael Jordan (mycologist)

Michael Jordan (footballer)

...


候補の生成

回答

8





...


候補の生成

回答

今回はここ

9





...


候補の生成

回答

今回はここ

INPUT: クエリと回答候補のエンティティ群 OUTPUT: 順位づけした最上位のエンティティ or NIL

10

クエリとエンティティ候補

• クエリ 𝑞 = (𝑞. 𝑖𝑑, 𝑞. 𝑠𝑡𝑟𝑖𝑛𝑔, 𝑞. 𝑡𝑒𝑥𝑡)

• クエリ𝑞に対するKBエントリ候補

𝑜 𝑞 = 𝑜1𝑞, … , 𝑜

𝑛 𝑞𝑞

• KBエントリ𝑜𝑖𝑞の情報

– KB title – KB infobox

• attribute-value pairs (e.g., per:alternate_names, per:date_of_birth, ...)

– KB text

11

Introduction

12

ランキングとその応用

• NLPにおける多くの問題が順位づけ問題として定式化できる

–構文解析

• 構文木の順位づけ

–機械翻訳

• 翻訳候補の順位づけ

–照応解析

–など

13

既存手法の課題

• 全てのデータに対して効果的に働く学習手法はない

⇒協調的なモデルを作ろう! (= ollaborative ranking)

cf. collaborative filtering (協調フィルタリング)

• 関係ありません

14

わかりやすい図解?!

15

ﾜｶﾗﾝ!

16

Collaborative Ranking のポイント

• (1) 疑似的にクエリを増やすことで精度向上を図る

– query-level collaboration

• (2) 複数のrankerを効果的に統合することで精度向上を図る

– ranker-level collaboration

• (3) (1)と(2)の合わせ技

17

Collaborative Ranking

• 3つの提案手法

– (1) Micro Collaborative Ranking (MiCR)

– (2) Macro Collabortive Ranking (MaCR)

– (3) Micro-Macro Collaborative Ranking (MiMaCR)

18

(1) Micro Collaborative Ranking (MiCR)

19

Micro Collaborative Ranking

• (1) クエリqに対してk個のcollaboratorを選ぶ

–選択基準は後述

• (2) collaboratorを考慮したランキングを行う

20

Collaborator の選び方

• クラスタリング問題として解く

– クエリ𝑞が与えられた際，コーパスからq.stringを含む文書を最大300件取得

– クラスタリングアルゴリズムを適用

– q.textを含むクラスタからcollaboratorを選択

21

階層型クラスタリング (agglomerative) とスペクトラルクラスタリング (graph) を利用

22

𝑥𝑗𝑞

= 𝜙 𝑞, 𝑜𝑗𝑞

, 𝑥𝑗𝑐𝑞1 = 𝜙 𝑐𝑞1, 𝑜𝑗

𝑐𝑞1

23

𝑥𝑗𝑞



𝑐𝑞1

𝑥𝑗𝑞



𝑞

たぶん

×

○

再掲: クエリとエンティティ候補

• クエリ 𝑞 = (𝑞. 𝑖𝑑, 𝑞. 𝑠𝑡𝑟𝑖𝑛𝑔, 𝑞. 𝑡𝑒𝑥𝑡)

• クエリ𝑞に対するKBエントリ候補

𝑜 𝑞 = 𝑜1𝑞, … , 𝑜

𝑛 𝑞𝑞

• KBエントリ𝑜𝑖𝑞の情報

– KB title – KB infobox

• attribute-value pairs (e.g., per:alternate_names, per:date_of_birth, ...)

– KB text

24

MiCR の𝑔1(⋅)の計算方法

25

-> average

これがいい

(2) Macro Collaborative Ranking (MaCR)

26

Macro Collaborative Ranking

• 複数のRanker 𝐹∗ = *𝑓1, … , 𝑓𝑚+ を用意して，それらの合成関数でスコアを計算

27

MaCR の𝑔2(⋅)の計算方法

29

これがいい

(3) Micro-Macro Collaborative Ranking

(MiMaCR)

30

Micro-Macro Collaborative Ranking

• MiCR + MaCR

32

m個の voting

k+1個の average

Experiments

33

Dataset

• TAC-KBP2009 dataset – 75% training data, 25% development data

• TAC-KBP2010 dataset – test data

• reference KB

– Oct. 2008 dump of English Wikipedia – 818,741 entries

• Source text corpus – mostly Newswire and Web Text – 1,777,888 documents in 5 genres

34

Baseline Rankers (1/2)

• 教師なし

– Naive (𝑓1)

• あらゆるクエリにNILを返す

– Entity (𝑓2)

• q.textとKB textから抽出した固有表現の重みづけ類似度

– TFIDF (𝑓3)

• q.textとKB textのコサイン類似度をTF-IDFで重みづけ

– Profile (𝑓4)

• q.textとKB textのprofile類似度 [Chen+ 10] 35

Baseline Rankers (2/2)

• 教師あり – Maxent (𝑓5)

• Maximum entropy model (pointwise ranker)

– SVM (𝑓6) • SVM (pointwise ranker)

– RankSVM (𝑓7) • RankingSVM (pointwise ranker)

– ListNet (𝑓8) • ListNet (listwise ranker)

• 特徴 – 1. surface features [Dredze+ 10][Zheng+ 10] – 2. document features [Dredze+ 10][Zheng+ 10] – 3. profiling features [Chen+ 11]

36

評価

• マイクロ平均で評価

37

Baseline rankers の比較

• 教師ありrankerの方が基本的によい

38

MiCRの評価

• 実験条件

• rankerはTFIDF (𝑓3)

• 𝑔1はave, max, minの3種類

• collaborator searchはgraph, agglomerativeの2種類

average max min

39

MaCRの評価 (1/2)

• 実験条件

– 𝑔2はvotingとaverage

※rankerはdev.における性能順に追加

40

MaCRの評価 (2/2)

• top-10 KBP2009 entity linking systems を MaCR

41

MiMaCR の評価 (1/2)

• 実験条件

– micro-ranking (𝑔1(⋅))

• graph clustering

• 5 rankers (TFIDF, entity, Maxent, SVM, ListNet) – average for TFIDF, entity

– supervised versions for Maxent, SVM, ListNet

– macro-ranking (𝑔2(⋅))

• voting

42

MiMaCR の評価 (2/2)

43

まとめと感想

• Entity linking task のエンティティ候補を高精度にランキングするためにCollaborative ranking を提案

– query-level collaboration [new!]

• ただしcollaboratorの選択基準や𝑔1 ⋅ の計算方法に依存

– ranker-level collaboration

• タスク依存のチューニングが強い印象

–他のタスクでも同様に効果が出るか?

44

References

• [McNamee+ 10] P. McNamee, J. C. Mayfield, C. D. Piatko, “Processing Named Entities in Text”, Johns Hopkins APL Technical Digest, Vol.30(1), pp.31-40, 2011.

• [Chen+ 10] Z. Chen, S. Tamang, A. Lee, X. Li, W.-P. Lin, M. Snover, J. Artiles, M. Passantino and H. Ji, “CUNYBLENDER TAC-KBP2010 Entity Linking and Slot Filling System Description”, In Proc. TAC2010, 2010.

• [Ji+ 10] H. Ji, R. Grishman, H. T. Dang and K. Griffit, “An Overview of the TAC2010 Knowledge Base Population Track. In Proc. TAC2010, 2010.

• [Dredze+ 10] M. Dredze, P. McNamee, D. Rao, A. Gerber and T. Finin, “Entity Disambiguation for Knowledge Base Population”, In Proc. COLING2010, 2010.

• [Zheng+ 10] Z. Zheng, F. Li, M. Huang, X. Zhu, Learning to Link Entities with Knowledge Base. In Proc. HLT-NAACL2010, 2010.

• [Chen+ 11] Z. Chen, S. Tamang, A. Lee and H. Ji, “A Toolkit for Knowledge Base Population”, In Proc. SIGIR2011, 2011.

45

collaborative ranking: a case study on entity ranking (emnlp2011読み会)

Technology