using unigram and bigram language models for monolingual and cross-language ir

28
Using Unigram and Bigram Language Models for Monolingual and Cross- Language IR Lixin Shi and Jian-Yun Nie Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal

Upload: lisbet

Post on 01-Feb-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Lixin Shi and Jian-Yun Nie Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR. 1. Motivation 2. Related Work 3. Using Different Indexing and Translation Unit 4. Experiments - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Lixin Shi and Jian-Yun Nie

Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal

Page 2: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2

1. Motivation

2. Related Work

3. Using Different Indexing and Translation Unit

4. Experiments

5. Conclusion and Future Work

Page 3: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation

3

The difference between East-Asian and most European languages

The common problem in East-Asian languages (Chinese, Japanese) processing is the lack of natural word boundaries.

For information retrieval, we have to determine the index unit first.— Using word segmentation— Cutting sentence into n-grams

Page 4: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation

4

Word segmentation

Based on rule and/or statistical approaches

Problems in information retrieval— Segmentation Ambiguity: If a document and a query are

segmented into different words, it is impossible to match them with each other.“发展中国家” 发展中 (developing)/国家 (country) 发展 (development)/ 中 (middle)/国家 (country) 发展 (development)/中国 (China)/ 家 (family)

— Two different words maybe have same meaning or related, especially when they share come common characters.办公室 (office) ↔ 办公楼 (office building)

Page 5: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation

5

Cutting the sentence into n-gram

No need of any linguistic resource

The utilization of unigrams and bigrams has been investigated in several previous studies.— As effective as using a word segmentation

The limitations of previous studies— Only used in monolingual IR— It is not clear if one can use bigrams and

characters in CLIR

Page 6: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation

6

We focus on

Utilization of words and n-grams as index units for monolingual IR within LM framework.

Comparing the utilization of words and n-grams as translation units in CLIR— we only tested for English-Chinese CLIR

Page 7: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

2. Related work

Page 8: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work

8

Cross-Language IR

Translation between query and document languages

Basic approach: translation query— MT system— Bilingual dictionary— Parallel corpus

Querysrc parallel corpus retrieval docquerytrgIR (Davis&Ogden’97, Yang et al’98)

Train a probabilistic translation model from parallel corpus, then use the TM for CLIR (Nie et al’99, Gao et al’01,’02, Jin&Chai’05)

Page 9: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work

9

Query-likelihood retrieval model: Rank in the probability of document D generating query Q (Ponte&Croft’98, Croft’03)

KL-divergence (Relative entropy between query language model and document language model) (Lafferty&Zhai’01,’02)

LM approach to IR

Qq

i

i

DqPDQP )|()|(

Vw D

QQDQ wP

wPwPKLQDScore

)|(

)|(log)|()||(),(

)|()1()|()|( CwPDwPDwP

Page 10: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work

10

LM approach to CLIRStatistical translation model. A query as a distillation or translation from a document (Berger&Lafferty’99)

KL-divergence model for CLIR (Kraaij et al’03)

where t is a term in document (target) language; s in query (source) language

Qq w

i

i

DwPwqtDQP )|()|()|(

j Qjji

j QjQji

j QijQiQ

s

ss

ss

sPstt

sPstP

tsPtPwP

)|()|(

)|(),|(

)|,()|()|(

Page 11: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work

11

Chinese IR

Segment Chinese sentence into index unit—N-gram (Chien’95, Liang&Lee’96)—Word (Kwok’97)—Word + n-gram (Luk et al’02, Nie et

al’96,’00)

For Chinese CLIR, words are used as the translation units

Page 12: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

3. Using different indexing and translation units

Page 13: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

13

Using different indexing units

Single index

— Unigram— Bigram— Word— Word+Unigram— Bigram+Unigram

)||(),( DQKLQDScore

“国企研发投资”

U: 国 / 企 / 研 / 发 / 投 /资

B: 国企 /企研 /研发 /发投 /投资

W: 国企 /研发 /投资

WU: 国企 /研发 /投资 / 国 / 企 / 研 / 发 / 投 /资

BU: 国企 /企研 /研发 /发投 /投资 / 国 / 企 / 研 / 发 /投 /资

Page 14: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

14

Multiple indexes — Unigram — Bigram— Word

),(),( QDScoreQDScore ii

i

Q1

Q2

D1

D2

),( 11 DQKL

Q D),( 22 DQKL

Combining different scores

Page 15: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

15

Using different translation units

…history / and / civilization || 历史 /史文 /文明

… TM:p(历史 |history)p(史文 |history)p(文明 |history)…

Parallel Corpus

English Word

Chinese WordChinese UnigramChinese BigramChinese Bigram&Unigram…

“history and civilization” || “历史文明”…

GIZA++ training

Bigrams

Page 16: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

16

Using different translation units

English Query Chinese Documents

Using the best translation and index unit

Combine multiple index units in the same way as in monolingual IR

j jjibQ QePeut )|()|(:

Q

j jjiuQ QePebt )|()|(:

bD

D

uD

Page 17: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

17

Combine bilingual dictionary and parallel corpus

Combining a translation model with a bilingual dictionary can greatly improve the CLIR effectiveness (Nie&Simard’01)

Our experiments show that treating a bilingual dictionary as a parallel text and using the trained TM from it is much better than directly using the items of a dictionary.

As a dictionary is transformed into a TM, it can be combined with a parallel corpus by combining their TMs.

),()1(),(),( QDScoreQDScoreQDScore MDMCMDMC

Page 18: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

4. Experiments

Page 19: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

19

Experiment Setting

NTCIR3/4 NTCIR5/6

Collections #doc (KB) Collections #doc(KB)

Cn CIRB011 CIRB020 381 CIRB040r 901

JpMainichi98/99 Yomiuri98+99

594Mainichi00/01r Yomiuri00+01

858

KrChosunilbo98/99

Hankookilbo254

Chosunilbo00/01 Hankookilbo00/01

220

NTCIR3 NTCIR4 NTCIR5 NTCIR6

Numbers of topics 50 60 50 50

Page 20: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

20

Chinese Linguistic Resource

The English-Chinese parallel corpus is mined from Web by our automatic mining tool. — From 6 websites: United Nations, Hong Kong, Taiwan, and

Mainland China — About 4,000 pairs of pages— After sentence alignment, we have 281,000 parallel

sentence pairs

English-Chinese bilingual dictionaries are from LDC. The final merged dictionary contains 42,000 entries.

Page 21: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

21

Run

Means Average Precision (MAP)

U B W BU WU 0.3B+0.7U

Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax

C-C-T-N4 .1929 .2370 .1670 .2065 .1679 .2131 .1928 .2363 .1817 .2269 .1979 .2455

C-C-T-N5 .3302 .3589 .2713 .3300 .2676 .3315 .2974 .3554 .3017 .3537 .3300 .3766

J-J-T-N4 .2377 .2899 .2768 .3670 − − .2807 .3722 − − .2873 .3664

J-J-T-N5 .2376 .2730 .2471 .3273 − − .2705 .3458 − − .2900 .3495

K-K-T-N4 .2004 .2147 .3873 .4195 − − .4084 .4396 − − .3608 .3889

K-K-T-N5 .2603 .2777 .3699 .3996 − − .3865 .4178 − − .3800 .4001

Using different indexing units for C/J/K monolingual IR on NTCIR4/5

U > B and W in Chinese

Interpolating unigram and bigram (B+U) has the best performance for Chinese and Japanese.

However, BU and B are best for Korean.

Page 22: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

22

The results of CJK monolingual IR on NTCIR6

Our submission: Chinese&Japanese: U+B; Korean K-K-T:BU, K-K-D:UOur submitted results are lower than average MAPs of NTCIR6. After apply a simple pseudo relevance feedback the results become more comparable to average MAPs.

Run-id

RALI without pseudo feedback

RALI with pseudo feedback

Average MAP of all NTCIR6 runs

Rigid Relax Rigid Relax Rigid Relax

C-C-T .2139 .3022 .2330 .3303 .2269 .3141

C-C-D .1671 .2376 .2031 .2907 .2354 .3294

J-J-T .2426 .3171 .2576 .3343 .2707 .3427

J-J-D .1877 .2485 .2292 .3052 .2480 .3214

K-K-T .3332 .3939 .3460 .4130 .3833 .4644

K-K-D .2623 .2970 .3287 .3945 .3892 .4678

Page 23: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

23

Interpolating parallel corpus and bilingual dictionary

U B W BU

Relax Relax Relax Relax

E-C-T-N4

Corpus .0962 .0804 .0703 .1066

Dict. .0862 .0548 .0712 .0870

Int(C,D) .1060 .1004 .0897 .1194

E-C-D-N4

Corpus .0898 .0749 .0735 .0981

Dict. .0739 .0482 .0647 .0723

Int(C,D) .1021 .0897 .0893 .1076

E-C-T-N5

Corpus .1686 .1306 .1305 .1852

Dict. .1187 .0694 .0898 .0910

Int(C,D) .1727 .1512 .1566 .1970

E-C-D-N5

Corpus .1558 .1172 .0943 .1743

Dict. .1140 .0550 .1114 .0920

Int(C,D) .1792 .1369 .1492 .1844

Int(C,D) = 0.7*Corpus+0.3*Dict

Page 24: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

24

English to Chinese CLIR result on NTCIR 3/4/5

Two-step interpolation of the retrieval scores: Dict(B+U) + Corpus(B+U)Using bigrams and unigrams as translation units is a reasonable alternative to words. Combinations of bigrams and unigrams usually produce higher effectiveness.

U B W BU 0.3B+0.7U

Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax

E-C-N3 .0928 .1106 .0805 .0985 .0898 .1080 .0938 .1102 .1021 .1170

E-C-N3 .0900 .1149 .1037 .1333 .1163 .1315 .1116 .1370 .1226 .1439

E-C-N4 .0935 .1060 .0872 .1004 .0746 .0897 .1042 .1194 .1018 .1180

E-C-N4 .0921 .1021 .0774 .0897 .0727 .0893 .0935 .1076 .1017 .1173

E-C-N5 .1533 .1727 .1245 .1512 .1317 .1566 .1632 .1970 .1655 .1916

E-C-N5 .1676 .1792 .1158 .1369 .1254 .1492 .1629 .1844 .1776 .1946

Page 25: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

5. Conclusion and future work

Page 26: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5. Conclusion and Future Work

26

We focused on the comparison between words, n-grams, and their combinations, both for indexing and for query translation.Our experimental results showed that n-grams are generally as effective as words for monolingual IR and CLIR in Chinese. For Japanese and Korean, n-grams approaches are close to the average results of NTCIR6.We tested the approach that creates different types of index separately, then groups them during the retrieval process. We found this approach slightly more effective for Chinese and Japanese.Overall, n-grams can be interesting alternative or complementary indexing and translation units to word.

Conclusion

Page 27: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5. Conclusion and Future Work

27

Future work

We noticed that index units are sensitive to query, i.e., an indexing unit is better for some queries, but another index unit is better for some other queries.

— How to determine the correct combination?

Combine words and n-grams by semantic information

Page 28: Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Thanks