mining wiki resources for multilingual named entity recognition alexander e. richman & patrick...

20
Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin- Hsi Chen Department of Defense ACL 2008

Upload: bernadette-simpson

Post on 18-Dec-2015

232 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

Mining Wiki Resources for Multilingual Named Entity

Recognition

Alexander E. Richman & Patrick Schone

Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

Department of DefenseACL 2008

Page 2: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

2

Introduction Using the multilingual Wikipedia to

automatically create an annotated corpus of text in any given language.

Languages : French, Ukrainian, Spanish, Polish, Russian, and Portuguese.

Do not use of any non-English linguistic resources outside of the Wikimedia domain and any semantic resources such as WordNet or POS tagger.

Use an internally modified variant of BBN's IdentiFinder (Bikel et al., 1999), specifically modified to emphasize fast text processing, called “PhoenixIDF.”

2

Page 3: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

3

Related Work Toral and Muñoz (2006) used Wikipedia to

create lists of named entities. Rely on WordNet, and need a manual supervision

step Kazama and Torisawa (2007) used Wikipedia

to building entity dictionaries. Rely on POS tagger

Cucerzan (2007) used Wikipedia primarily for Named Entity Disambiguation, following the path of Bunescu and Pasca (2006) Using Category, but specific to English

Page 4: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

4

Wikipedia Multilingual, collaborative encyclopedia on

the Web which is freely available As of October 2007, there were over 2 million

articles in English, and 30 languages with at least 50,000 articles and another 40 with at least 10,000 articles.

4

Page 5: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

5

Wikipedia - feature

Article links, links from one article to another of the same language.

Category links, links from an article to special “Category” pages.

Interwiki links, links from an article to a presumably equivalent, article in another language.

Redirect pages, short pages which often provide equivalent names for an entity

Disambiguation pages, a page with little content that links to multiple similarly named articles.

Example: http://en.wikipedia.org/wiki/FBI5

Page 6: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

6

Training Data Generation

1. Initial Set-up

2. English Language Categorization

3. Multilingual Categorization

4. The Full System

6

Page 7: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

7

Initial Set-up ACE Named Entity types:

PERSON, GPE (Geo-Political Entities), ORGANIZATION, VEHICLE, WEAPON, LOCATION, FACILITY, DATE, TIME, MONEY, and PERCENT.

MUC tags like <ENAMEX TYPE=“GPE”>Place Name</ENAMEX>

Process:1. Identifies words and phrases that might represent entities.2. Uses category links and/or interwiki links to associate that

phrase with an English language phrase or set of Categories.

3. Determines the appropriate type of the English language data and assumes that the original phrase is of the same type.

Page 8: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

8

English Language Categorization(1) Wiki Useful Category => Key Category Phrase

=> Disambiguation Pages? => Wiktionary

Useful Category: “Category:Living People” :PERSON “Category:Cities in Norway”:GPE

Useless Category:“Category:1912 Establishments” which includes articles on Fenway Park (a facility), the Republic of China (a GPE), and the Better Business Bureau (an organization).

Page 9: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

9

English Language Categorization(2)

Page 10: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

10

Multilingual Categorization Not all articles have English equivalent, but

many of the most useful categories have English equivalents.

French: “Catégorie:Commune des Côtes-d'Armor,” “Catégorie:Ville portuaire de France,” “Catégorie:Port de plaisance,” and “Catégorie:Station balnéaire.”

English: “Category: Communes of Côtes-d'Armor,” UNKNOWN, “Category:Marinas,” and “Category:Seaside resorts”

Page 11: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

11

The Full System The first pass uses the explicit article links

within the text. We then search an associated English language

article, if available, for additional information. A second pass checks for multi-word phrases

that exist as titles of Wikipedia articles. We look for certain types of person and

organization instances. We perform additional processing for alphabetic

or space-separated languages, including a third pass looking for single word Wikipedia titles.

We use regular expressions to locate additional entities such as numeric dates.

Page 12: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

12

Evaluation – All Wiki test set

Three human annotated newswire test sets: Spanish, French and Ukrainian. 12

F-score Spanish

French

Ukrainian

Polish Portuguese

Russian

ALL .846 .844 .807 .859 .804 .802

DATE .925 .910 .848 .891 .861 .822

GPE .877 .868 .887 .916 .826 .867

ORG .701 .718 .657 .785 .706 .712

PERSON .821 .823 .690 .836 .802 .751

Page 13: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

13

Evaluation – Spanish (1) Spanish is a substantial, well-developed

Wikipedia, consisting of more than 290,000 articles at October 2007.

Newswire: 25,000 words from the ACE 2007 test set, manually modified extended MUC-style standards.

Wiki test set: 335,000 words.

Page 14: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

14

Evaluation – Spanish (2) Either Wikipedia is relatively poor in

Organizations or that PhoenixIDF underperforms when identifying Organizations relative to other categories or a combination.

Traditional Training: trained PhoenixIDF on ACE 2007 data converted to MUC-style tag.

Page 15: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

15

French is one of the largest Wikipedias, containing more than 570,000 articles at October 2007.

Newswire: 25,000 words from Agence France Presse

Wiki test set: 920,000 words.

Similar to Spanish.

Evaluation – French

15

Page 16: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

16

Evaluation – Ukrainian (1)

16

Ukrainian is a medium-sized Wikipedia with 74,000 articles at October 2007.

The typical article is shorter and less well-linked to other articles than in the French or Spanish versions.

Newswire: approximately 25,000 words from various online news sites covering primarily political topics.

Wiki test set: 395,000 words. Traditional Training: trained PhoenixIDF

Newswire data

Page 17: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

17

Evaluation – Ukrainian (2)

17

The Ukrainian newswire contained a much higher proportion of organizations than the French or Spanish versions.

The Ukrainian language Wikipedia contains very few articles on organizations relative to other types

Page 18: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

18

Conclusion Wikipedia can create a NER system with

performance comparable to one developed human-annotated Newswire, while not requiring any linguistic expertise.

This level of performance can likely be obtained currently in 20-40 languages.

Wikipedia-derived system could be used as a supplement to other systems for many more languages.

An automatically generated entity dictionary embedded in our system . 18

Page 19: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

19

Future Work Automatically generate the list of key words

and phrases for useful English language categories.

The authors also believe performance could be improved by using higher order non-English categories and better disambiguation.

Lists of organizations might be particularly useful, and “List of” pages are common in many languages.

19

Page 20: Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen

20

Thank you!

20