mining the web to create minority language corpora

Mining the Web to Create Minority Language Corpora

Rayid GhaniAccenture Technology Labs - Research

Rosie JonesCarnegie Mellon University

Dunja MladenicJ. Stefan Institute, Slovenia

http://www.accenture.com/xd/xd.asp?it=enWeb&xd=index.xml

Who Needs a Language Specific Corpus?

Language Technology Applications Language Modeling Speech Recognition Machine Translation Linguistic and Socio-Linguistic Studies Multilingual Retrieval

What Corpora are Available?

Explicit, marked up corpora: Linguistic Data Consortium -- 20 languages [Liebermann and Cieri 1998]

Search Engines -- implicit language-specific corpora, European languages, Chinese and Japanese Excite - 12 languages Google - 25 languages AltaVista - 25 languages Lycos - 25 languages

BUT what about Slovenian? Or Tagalog? Or Tatar?

You’re just out of luck!

The Human Solution

Start from Yahoo->Slovenia… Crawl www.*.si Search on the web, look at documents,

modify query, analyze documents, modify query,…

Repetitive, time-consuming, requires reasonable familiarity with the language

Task

Given: 1 Document in Target Language 1 Other Document (negative example) Access to a Web Search Engine

Create a Corpus of the Target Language quickly with no human effort

Algorithm

Query Generator WWWSeed Docs

Language Filter

Web

Word Statistics

Initial Docs

Build Query

Filter

Relevant

Non-Relevant

Learning

Query Generation

Examine current relevant and non-relevant documents to generate a query likely to find documents that ARE similar to the relevant ones and NOT similar to non-relevant ones

A Query consists of m inclusion terms and n exclusion terms e.g +intelligence +web –military

Query Term Selection Methods

Uniform (UN) – select k words randomly from the current vocabulary

Term-Frequency (TF) – select top k words ranked according to their frequency

Probabilistic TF (PTF) – k words with probability proportional to their frequency

Query Term Selection Methods

RTFIDF – top k words according to their rtfidf scores

Odds-Ratio (OR) – top k words according to their odds-ratio scores

Probabilistic OR (POR) – select k words with probability proportional to their Odds-Ratio scores

Evaluation

Goal: Collect as many relevant documents as possible while minimizing the cost

Cost Number of total documents retrieved from the Web Number of distinct Queries issued to the Search Engine

Evaluation Measures Percentage of retrieved documents that are relevant Number of relevant documents retrieved per unique query

Experimental Setup

Language: Slovenian Initial documents: 1 web page in Slovenian, 1

in English Search engine: Altavista

Results

Results – Precision at 3000

0

10

20

30

40

50

60

70

80

90

100

Length=1 Length=3 Length=5 Length=10

OR

POR

TF

PTF

UN

Percentage of Target Docs after 3000 Docs Retrieved

Results – Docs Per Query

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Length=1 Length=3 Length=5 Length=10

OR

POR

TF

PTF

UN

Results - Summary

In terms of documents: For lengths 1-3, Odds-Ratio works best

In terms of queries: Odds-Ratio is consistently better than others

Long queries are usually very precise but do not result in a lot of documents (low recall)

Further Experiments

Comparison to Altavista’s “More Like This” Better performance than Altavista’s feature

Keywords Similar results when initializing with keywords

instead of documents

Other Languages Similar results with Croatian, Czech and Tagalog

Conclusions

Successfully able to build corpora for minority languages (Slovenian, Croatian, Czech, Tagalog) using Web search engines

Not sensitive to initial “seed” documents

System and Corpora are/will be available at www.cs.cmu.edu/~TextLearning/CorpusBuilder

Ideas for Future Work

Explore other Term-Selection methods

From Language specific corpus to Topic Specific corpus as an alternative to focused spidering

Finding documents matching a user profile – Personal Agent

mining the web to create minority language corpora

Documents

slovenian documents

nonrelevant documents

slovenianinitial documents

nonrelevant onesa query

query likely

language specific corpus

web search enginecreate

web page