using unigram and bigram language models for monolingual and cross-language ir
DESCRIPTION
Lixin Shi and Jian-Yun Nie Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR. 1. Motivation 2. Related Work 3. Using Different Indexing and Translation Unit 4. Experiments - PowerPoint PPT PresentationTRANSCRIPT
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR
Lixin Shi and Jian-Yun Nie
Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2
1. Motivation
2. Related Work
3. Using Different Indexing and Translation Unit
4. Experiments
5. Conclusion and Future Work
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation
3
The difference between East-Asian and most European languages
The common problem in East-Asian languages (Chinese, Japanese) processing is the lack of natural word boundaries.
For information retrieval, we have to determine the index unit first.— Using word segmentation— Cutting sentence into n-grams
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation
4
Word segmentation
Based on rule and/or statistical approaches
Problems in information retrieval— Segmentation Ambiguity: If a document and a query are
segmented into different words, it is impossible to match them with each other.“发展中国家” 发展中 (developing)/国家 (country) 发展 (development)/ 中 (middle)/国家 (country) 发展 (development)/中国 (China)/ 家 (family)
— Two different words maybe have same meaning or related, especially when they share come common characters.办公室 (office) ↔ 办公楼 (office building)
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation
5
Cutting the sentence into n-gram
No need of any linguistic resource
The utilization of unigrams and bigrams has been investigated in several previous studies.— As effective as using a word segmentation
The limitations of previous studies— Only used in monolingual IR— It is not clear if one can use bigrams and
characters in CLIR
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation
6
We focus on
Utilization of words and n-grams as index units for monolingual IR within LM framework.
Comparing the utilization of words and n-grams as translation units in CLIR— we only tested for English-Chinese CLIR
2. Related work
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work
8
Cross-Language IR
Translation between query and document languages
Basic approach: translation query— MT system— Bilingual dictionary— Parallel corpus
Querysrc parallel corpus retrieval docquerytrgIR (Davis&Ogden’97, Yang et al’98)
Train a probabilistic translation model from parallel corpus, then use the TM for CLIR (Nie et al’99, Gao et al’01,’02, Jin&Chai’05)
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work
9
Query-likelihood retrieval model: Rank in the probability of document D generating query Q (Ponte&Croft’98, Croft’03)
KL-divergence (Relative entropy between query language model and document language model) (Lafferty&Zhai’01,’02)
LM approach to IR
i
i
DqPDQP )|()|(
Vw D
QQDQ wP
wPwPKLQDScore
)|(
)|(log)|()||(),(
)|()1()|()|( CwPDwPDwP
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work
10
LM approach to CLIRStatistical translation model. A query as a distillation or translation from a document (Berger&Lafferty’99)
KL-divergence model for CLIR (Kraaij et al’03)
where t is a term in document (target) language; s in query (source) language
Qq w
i
i
DwPwqtDQP )|()|()|(
j Qjji
j QjQji
j QijQiQ
s
ss
ss
sPstt
sPstP
tsPtPwP
)|()|(
)|(),|(
)|,()|()|(
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work
11
Chinese IR
Segment Chinese sentence into index unit—N-gram (Chien’95, Liang&Lee’96)—Word (Kwok’97)—Word + n-gram (Luk et al’02, Nie et
al’96,’00)
For Chinese CLIR, words are used as the translation units
3. Using different indexing and translation units
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units
13
Using different indexing units
Single index
— Unigram— Bigram— Word— Word+Unigram— Bigram+Unigram
)||(),( DQKLQDScore
“国企研发投资”
U: 国 / 企 / 研 / 发 / 投 /资
B: 国企 /企研 /研发 /发投 /投资
W: 国企 /研发 /投资
WU: 国企 /研发 /投资 / 国 / 企 / 研 / 发 / 投 /资
BU: 国企 /企研 /研发 /发投 /投资 / 国 / 企 / 研 / 发 /投 /资
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units
14
Multiple indexes — Unigram — Bigram— Word
),(),( QDScoreQDScore ii
i
Q1
Q2
D1
D2
),( 11 DQKL
Q D),( 22 DQKL
Combining different scores
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units
15
Using different translation units
…history / and / civilization || 历史 /史文 /文明
… TM:p(历史 |history)p(史文 |history)p(文明 |history)…
Parallel Corpus
English Word
Chinese WordChinese UnigramChinese BigramChinese Bigram&Unigram…
“history and civilization” || “历史文明”…
GIZA++ training
Bigrams
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units
16
Using different translation units
English Query Chinese Documents
Using the best translation and index unit
Combine multiple index units in the same way as in monolingual IR
j jjibQ QePeut )|()|(:
Q
j jjiuQ QePebt )|()|(:
bD
D
uD
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units
17
Combine bilingual dictionary and parallel corpus
Combining a translation model with a bilingual dictionary can greatly improve the CLIR effectiveness (Nie&Simard’01)
Our experiments show that treating a bilingual dictionary as a parallel text and using the trained TM from it is much better than directly using the items of a dictionary.
As a dictionary is transformed into a TM, it can be combined with a parallel corpus by combining their TMs.
),()1(),(),( QDScoreQDScoreQDScore MDMCMDMC
4. Experiments
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments
19
Experiment Setting
NTCIR3/4 NTCIR5/6
Collections #doc (KB) Collections #doc(KB)
Cn CIRB011 CIRB020 381 CIRB040r 901
JpMainichi98/99 Yomiuri98+99
594Mainichi00/01r Yomiuri00+01
858
KrChosunilbo98/99
Hankookilbo254
Chosunilbo00/01 Hankookilbo00/01
220
NTCIR3 NTCIR4 NTCIR5 NTCIR6
Numbers of topics 50 60 50 50
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments
20
Chinese Linguistic Resource
The English-Chinese parallel corpus is mined from Web by our automatic mining tool. — From 6 websites: United Nations, Hong Kong, Taiwan, and
Mainland China — About 4,000 pairs of pages— After sentence alignment, we have 281,000 parallel
sentence pairs
English-Chinese bilingual dictionaries are from LDC. The final merged dictionary contains 42,000 entries.
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments
21
Run
Means Average Precision (MAP)
U B W BU WU 0.3B+0.7U
Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax
C-C-T-N4 .1929 .2370 .1670 .2065 .1679 .2131 .1928 .2363 .1817 .2269 .1979 .2455
C-C-T-N5 .3302 .3589 .2713 .3300 .2676 .3315 .2974 .3554 .3017 .3537 .3300 .3766
J-J-T-N4 .2377 .2899 .2768 .3670 − − .2807 .3722 − − .2873 .3664
J-J-T-N5 .2376 .2730 .2471 .3273 − − .2705 .3458 − − .2900 .3495
K-K-T-N4 .2004 .2147 .3873 .4195 − − .4084 .4396 − − .3608 .3889
K-K-T-N5 .2603 .2777 .3699 .3996 − − .3865 .4178 − − .3800 .4001
Using different indexing units for C/J/K monolingual IR on NTCIR4/5
U > B and W in Chinese
Interpolating unigram and bigram (B+U) has the best performance for Chinese and Japanese.
However, BU and B are best for Korean.
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments
22
The results of CJK monolingual IR on NTCIR6
Our submission: Chinese&Japanese: U+B; Korean K-K-T:BU, K-K-D:UOur submitted results are lower than average MAPs of NTCIR6. After apply a simple pseudo relevance feedback the results become more comparable to average MAPs.
Run-id
RALI without pseudo feedback
RALI with pseudo feedback
Average MAP of all NTCIR6 runs
Rigid Relax Rigid Relax Rigid Relax
C-C-T .2139 .3022 .2330 .3303 .2269 .3141
C-C-D .1671 .2376 .2031 .2907 .2354 .3294
J-J-T .2426 .3171 .2576 .3343 .2707 .3427
J-J-D .1877 .2485 .2292 .3052 .2480 .3214
K-K-T .3332 .3939 .3460 .4130 .3833 .4644
K-K-D .2623 .2970 .3287 .3945 .3892 .4678
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments
23
Interpolating parallel corpus and bilingual dictionary
U B W BU
Relax Relax Relax Relax
E-C-T-N4
Corpus .0962 .0804 .0703 .1066
Dict. .0862 .0548 .0712 .0870
Int(C,D) .1060 .1004 .0897 .1194
E-C-D-N4
Corpus .0898 .0749 .0735 .0981
Dict. .0739 .0482 .0647 .0723
Int(C,D) .1021 .0897 .0893 .1076
E-C-T-N5
Corpus .1686 .1306 .1305 .1852
Dict. .1187 .0694 .0898 .0910
Int(C,D) .1727 .1512 .1566 .1970
E-C-D-N5
Corpus .1558 .1172 .0943 .1743
Dict. .1140 .0550 .1114 .0920
Int(C,D) .1792 .1369 .1492 .1844
Int(C,D) = 0.7*Corpus+0.3*Dict
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments
24
English to Chinese CLIR result on NTCIR 3/4/5
Two-step interpolation of the retrieval scores: Dict(B+U) + Corpus(B+U)Using bigrams and unigrams as translation units is a reasonable alternative to words. Combinations of bigrams and unigrams usually produce higher effectiveness.
U B W BU 0.3B+0.7U
Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax
E-C-N3 .0928 .1106 .0805 .0985 .0898 .1080 .0938 .1102 .1021 .1170
E-C-N3 .0900 .1149 .1037 .1333 .1163 .1315 .1116 .1370 .1226 .1439
E-C-N4 .0935 .1060 .0872 .1004 .0746 .0897 .1042 .1194 .1018 .1180
E-C-N4 .0921 .1021 .0774 .0897 .0727 .0893 .0935 .1076 .1017 .1173
E-C-N5 .1533 .1727 .1245 .1512 .1317 .1566 .1632 .1970 .1655 .1916
E-C-N5 .1676 .1792 .1158 .1369 .1254 .1492 .1629 .1844 .1776 .1946
5. Conclusion and future work
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5. Conclusion and Future Work
26
We focused on the comparison between words, n-grams, and their combinations, both for indexing and for query translation.Our experimental results showed that n-grams are generally as effective as words for monolingual IR and CLIR in Chinese. For Japanese and Korean, n-grams approaches are close to the average results of NTCIR6.We tested the approach that creates different types of index separately, then groups them during the retrieval process. We found this approach slightly more effective for Chinese and Japanese.Overall, n-grams can be interesting alternative or complementary indexing and translation units to word.
Conclusion
Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5. Conclusion and Future Work
27
Future work
We noticed that index units are sensitive to query, i.e., an indexing unit is better for some queries, but another index unit is better for some other queries.
— How to determine the correct combination?
Combine words and n-grams by semantic information
Thanks