using unigram and bigram language models for monolingual and cross-language ir

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Lixin Shi and Jian-Yun Nie

Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2

1. Motivation

2. Related Work

3. Using Different Indexing and Translation Unit

4. Experiments

5. Conclusion and Future Work

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation

3

The difference between East-Asian and most European languages

The common problem in East-Asian languages (Chinese, Japanese) processing is the lack of natural word boundaries.

For information retrieval, we have to determine the index unit first.— Using word segmentation— Cutting sentence into n-grams


4

Word segmentation

Based on rule and/or statistical approaches

Problems in information retrieval— Segmentation Ambiguity: If a document and a query are

segmented into different words, it is impossible to match them with each other.“发展中国家” 发展中 (developing)/国家 (country) 发展 (development)/ 中 (middle)/国家 (country) 发展 (development)/中国 (China)/ 家 (family)

— Two different words maybe have same meaning or related, especially when they share come common characters.办公室 (office) ↔ 办公楼 (office building)


5

Cutting the sentence into n-gram

No need of any linguistic resource

The utilization of unigrams and bigrams has been investigated in several previous studies.— As effective as using a word segmentation

The limitations of previous studies— Only used in monolingual IR— It is not clear if one can use bigrams and

characters in CLIR


6

We focus on

Utilization of words and n-grams as index units for monolingual IR within LM framework.

Comparing the utilization of words and n-grams as translation units in CLIR— we only tested for English-Chinese CLIR

2. Related work

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work

8

Cross-Language IR

Translation between query and document languages

Basic approach: translation query— MT system— Bilingual dictionary— Parallel corpus

Querysrc parallel corpus retrieval docquerytrgIR (Davis&Ogden’97, Yang et al’98)

Train a probabilistic translation model from parallel corpus, then use the TM for CLIR (Nie et al’99, Gao et al’01,’02, Jin&Chai’05)


9

Query-likelihood retrieval model: Rank in the probability of document D generating query Q (Ponte&Croft’98, Croft’03)

KL-divergence (Relative entropy between query language model and document language model) (Lafferty&Zhai’01,’02)

LM approach to IR

Qq

i

i

DqPDQP )|()|(

Vw D

QQDQ wP

wPwPKLQDScore

)|(

)|(log)|()||(),(

)|()1()|()|( CwPDwPDwP


10

LM approach to CLIRStatistical translation model. A query as a distillation or translation from a document (Berger&Lafferty’99)

KL-divergence model for CLIR (Kraaij et al’03)

where t is a term in document (target) language; s in query (source) language

Qq w

i

i

DwPwqtDQP )|()|()|(

j Qjji

j QjQji

j QijQiQ

s

ss

ss

sPstt

sPstP

tsPtPwP

)|()|(

)|(),|(

)|,()|()|(


11

Chinese IR

Segment Chinese sentence into index unit—N-gram (Chien’95, Liang&Lee’96)—Word (Kwok’97)—Word + n-gram (Luk et al’02, Nie et

al’96,’00)

For Chinese CLIR, words are used as the translation units

3. Using different indexing and translation units

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

13

Using different indexing units

Single index

— Unigram— Bigram— Word— Word+Unigram— Bigram+Unigram

)||(),( DQKLQDScore

“国企研发投资”

U: 国 / 企 / 研 / 发 / 投 /资

B: 国企 /企研 /研发 /发投 /投资

W: 国企 /研发 /投资

WU: 国企 /研发 /投资 / 国 / 企 / 研 / 发 / 投 /资

BU: 国企 /企研 /研发 /发投 /投资 / 国 / 企 / 研 / 发 /投 /资


14

Multiple indexes — Unigram — Bigram— Word

),(),( QDScoreQDScore ii

i

Q1

Q2

D1

D2

),( 11 DQKL

Q D),( 22 DQKL

Combining different scores


15

Using different translation units

…history / and / civilization || 历史 /史文 /文明

… TM:p(历史 |history)p(史文 |history)p(文明 |history)…

Parallel Corpus

English Word

Chinese WordChinese UnigramChinese BigramChinese Bigram&Unigram…

“history and civilization” || “历史文明”…

GIZA++ training

Bigrams


16

Using different translation units

English Query Chinese Documents

Using the best translation and index unit

Combine multiple index units in the same way as in monolingual IR

j jjibQ QePeut )|()|(:

Q

j jjiuQ QePebt )|()|(:

bD

D

uD


17

Combine bilingual dictionary and parallel corpus

Combining a translation model with a bilingual dictionary can greatly improve the CLIR effectiveness (Nie&Simard’01)

Our experiments show that treating a bilingual dictionary as a parallel text and using the trained TM from it is much better than directly using the items of a dictionary.

As a dictionary is transformed into a TM, it can be combined with a parallel corpus by combining their TMs.

),()1(),(),( QDScoreQDScoreQDScore MDMCMDMC

4. Experiments

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

19

Experiment Setting

NTCIR3/4 NTCIR5/6

Collections #doc (KB) Collections #doc(KB)

Cn CIRB011 CIRB020 381 CIRB040r 901

JpMainichi98/99 Yomiuri98+99

594Mainichi00/01r Yomiuri00+01

858

KrChosunilbo98/99

Hankookilbo254

Chosunilbo00/01 Hankookilbo00/01

220

NTCIR3 NTCIR4 NTCIR5 NTCIR6

Numbers of topics 50 60 50 50


20

Chinese Linguistic Resource

The English-Chinese parallel corpus is mined from Web by our automatic mining tool. — From 6 websites: United Nations, Hong Kong, Taiwan, and

Mainland China — About 4,000 pairs of pages— After sentence alignment, we have 281,000 parallel

sentence pairs

English-Chinese bilingual dictionaries are from LDC. The final merged dictionary contains 42,000 entries.


21

Run

Means Average Precision (MAP)

U B W BU WU 0.3B+0.7U

Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax

C-C-T-N4 .1929 .2370 .1670 .2065 .1679 .2131 .1928 .2363 .1817 .2269 .1979 .2455

C-C-T-N5 .3302 .3589 .2713 .3300 .2676 .3315 .2974 .3554 .3017 .3537 .3300 .3766

J-J-T-N4 .2377 .2899 .2768 .3670 − − .2807 .3722 − − .2873 .3664

J-J-T-N5 .2376 .2730 .2471 .3273 − − .2705 .3458 − − .2900 .3495

K-K-T-N4 .2004 .2147 .3873 .4195 − − .4084 .4396 − − .3608 .3889

K-K-T-N5 .2603 .2777 .3699 .3996 − − .3865 .4178 − − .3800 .4001

Using different indexing units for C/J/K monolingual IR on NTCIR4/5

U > B and W in Chinese

Interpolating unigram and bigram (B+U) has the best performance for Chinese and Japanese.

However, BU and B are best for Korean.


22

The results of CJK monolingual IR on NTCIR6

Our submission: Chinese&Japanese: U+B; Korean K-K-T:BU, K-K-D:UOur submitted results are lower than average MAPs of NTCIR6. After apply a simple pseudo relevance feedback the results become more comparable to average MAPs.

Run-id

RALI without pseudo feedback

RALI with pseudo feedback

Average MAP of all NTCIR6 runs

Rigid Relax Rigid Relax Rigid Relax

C-C-T .2139 .3022 .2330 .3303 .2269 .3141

C-C-D .1671 .2376 .2031 .2907 .2354 .3294

J-J-T .2426 .3171 .2576 .3343 .2707 .3427

J-J-D .1877 .2485 .2292 .3052 .2480 .3214

K-K-T .3332 .3939 .3460 .4130 .3833 .4644

K-K-D .2623 .2970 .3287 .3945 .3892 .4678


23

Interpolating parallel corpus and bilingual dictionary

U B W BU

Relax Relax Relax Relax

E-C-T-N4

Corpus .0962 .0804 .0703 .1066

Dict. .0862 .0548 .0712 .0870

Int(C,D) .1060 .1004 .0897 .1194

E-C-D-N4

Corpus .0898 .0749 .0735 .0981

Dict. .0739 .0482 .0647 .0723

Int(C,D) .1021 .0897 .0893 .1076

E-C-T-N5

Corpus .1686 .1306 .1305 .1852

Dict. .1187 .0694 .0898 .0910

Int(C,D) .1727 .1512 .1566 .1970

E-C-D-N5

Corpus .1558 .1172 .0943 .1743

Dict. .1140 .0550 .1114 .0920

Int(C,D) .1792 .1369 .1492 .1844

Int(C,D) = 0.7*Corpus+0.3*Dict


24

English to Chinese CLIR result on NTCIR 3/4/5

Two-step interpolation of the retrieval scores: Dict(B+U) + Corpus(B+U)Using bigrams and unigrams as translation units is a reasonable alternative to words. Combinations of bigrams and unigrams usually produce higher effectiveness.

U B W BU 0.3B+0.7U

Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax

E-C-N3 .0928 .1106 .0805 .0985 .0898 .1080 .0938 .1102 .1021 .1170

E-C-N3 .0900 .1149 .1037 .1333 .1163 .1315 .1116 .1370 .1226 .1439

E-C-N4 .0935 .1060 .0872 .1004 .0746 .0897 .1042 .1194 .1018 .1180

E-C-N4 .0921 .1021 .0774 .0897 .0727 .0893 .0935 .1076 .1017 .1173

E-C-N5 .1533 .1727 .1245 .1512 .1317 .1566 .1632 .1970 .1655 .1916

E-C-N5 .1676 .1792 .1158 .1369 .1254 .1492 .1629 .1844 .1776 .1946

5. Conclusion and future work

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5. Conclusion and Future Work

26

We focused on the comparison between words, n-grams, and their combinations, both for indexing and for query translation.Our experimental results showed that n-grams are generally as effective as words for monolingual IR and CLIR in Chinese. For Japanese and Korean, n-grams approaches are close to the average results of NTCIR6.We tested the approach that creates different types of index separately, then groups them during the retrieval process. We found this approach slightly more effective for Chinese and Japanese.Overall, n-grams can be interesting alternative or complementary indexing and translation units to word.

Conclusion

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5. Conclusion and Future Work

27

Future work

We noticed that index units are sensitive to query, i.e., an indexing unit is better for some queries, but another index unit is better for some other queries.

— How to determine the correct combination?

Combine words and n-grams by semantic information

Thanks

using unigram and bigram language models for monolingual and cross-language ir

Documents

bigram language models

crosslanguage irusing

crosslanguage irtranslation

crosslanguage irlixin

motivationusing unigram

monolingual irit

jinchai05using unigram

related workusing unigram