lecture 10: term translation extraction & cross-language information retrieval
DESCRIPTION
Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval. Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2004/11/24. References: - PowerPoint PPT PresentationTRANSCRIPT
Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval
Wen-Hsiang Lu (盧文祥 )
Department of Computer Science and Information Engineering,
National Cheng Kung University
2004/11/24
References:• Wen-Hsiang Lu (Advisors: Lee-Feng Chien and Hsi-Jian Lee.) (2003) Term
Translation Extraction Using Web Mining Techniques, PhD thesis, Department of Computer Science and Information Engineering, National Chiao Tung University.
Outline
I. Background & Research ProblemsII. Anchor Text Mining for Term Translation
Extraction III. Transitive Translation for Multilingual Translation IV. Web Mining for Cross-Language Information
Retrieval and Web Search Applications
Part I Background &
Research Problems
Motivation
• Demands on multilingual translation lexicons– Machine translation (MT)– Cross-language information retrieval (CLIR) – Information exchange in electronic commerce (EC)
• Web mining– Explore multilingual and wide-scoped hypertext
resources on the Web
Research Problems
• Difficulties in automatic construction of multilingual translation lexicons– Techniques: Parallel/comparable corpora– Bottlenecks: Lacking diverse/multilingual resources
• Difficulties in query translation for cross-language information retrieval (CLIR) [Fig1]– Techniques: Bilingual dictionary/machine translation/
parallel corpora– Bottlenecks: Multiple-senses/short/diverse/unknown query
[Fig2]
Cross-Language Information Retrieval
Query TranslationQuery Translation Information RetrievalInformation RetrievalSourceQuery
TargetTranslation
Target Documents
Target Documents
• Query in source language and retrieve relevant documents in target languages
海珊 / 侯賽因 / 哈珊 / 胡笙 (TC)侯赛因 / 海珊 / 哈珊 (SC)
Hussein
Difficulties in Query Translation using Machine Translation Systems
English source query : National Palace Museum
Chinese translation: 全國宮殿博物館
Term-TranslationExtraction
Term-TranslationExtraction
Live Translation Lexicon
Research ParadigmResearch Paradigm
Search-ResultMining
Search-ResultMining
Anchor-TextMining
Anchor-TextMining
Web Mining
Cross-LanguageInformation Retrieval
Cross-LanguageInformation Retrieval
Cross-LanguageWeb Search
Cross-LanguageWeb Search
New approach
ApplicationsInternet
Multilingual Anchor Texts & Hyperlink StructureMultilingual Anchor Texts & Hyperlink Structure
Language-Mixed Texts in Search Result PagesLanguage-Mixed Texts in Search Result Pages
Research Results
• Anchor text mining for term translation extraction– ACM SIGIR’01(poster), IEEE ICDM’01, ACM Trans. on Asian
Language Information Processing 2002– Reviewers’ encouraging comments
• “… the approach seems to be quite novel. To my knowledge, there has not been a proposal of uses of anchor texts like this one.”
• Transitive translation for multilingual translation– COLING’02, ACM Trans. on Information Systems (first paper
from Taiwan since 1986), ACL’04– Reviewers’ encouraging comments
• “This is a nicely written, technically sound paper that pursues a clever and original idea …”
• “… the idea of using anchor texts from the Web to learn cross-lingual information retrieval algorithms is very good …”
• “I enjoyed the paper and thought the underlying work was interesting and valuable …”
Research Results (cont.)
• Web mining for cross-language Web search– ROCLING’03, ACM SIGIR’04– Improve precision rate from 0.207 (dictionary-based) to 0.241 on
NTCIR-2 Chinese-English CLIR evaluation task– Reviewers’ encouraging comments
• “It gives us insight into the value of the Web as a dynamic information source. Although the experiments are restricted to Chinese-English documents, also developers for other languages may find this work stimulating.”
• “The idea is interesting, and is relatively new. It may give inspiration to other researchers working in the same area.”
• LiveTrans: Experimental CLWS system [LiveTrans]
• http://livetrans.iis.sinica.edu.tw/lt.html [LiveTrans]– Mirror: http://wmmks.csie.ncku.edu.tw/lt.html [LiveTrans]
• System functions– Query-translation suggestion– Retrieval of Web pages and images.– Multilingual search: English, traditional Chinese, simplified
Chinese, Japanese or Korean – Gloss translation for retrieved page titles– Fusion of retrieval results
LiveTrans: Cross-Language Web Search SystemLiveTrans: Cross-Language Web Search System
• Summary of contributions– Present an innovative approach
• Significantly reduce the difficulty of unknown-term translation.
• CLIR can be improved especially for short queries.
– Develop a practical cross-language Web search engine • Without relying on translation dictionary • A live dictionary with a significant number of multilingual
term translations obtained.
– Present a new problem for further investigation in Web Mining
Research Results (cont.)Research Results (cont.)
Related Research
• Automatic extraction of multilingual translations– Statistical translation model (Brown 1993)– Parallel corpus (Melamed 2000; Wu & Chang 2003)– Non-parallel/comparable corpus (Fung 1998; Rapp
1999)– Web mining
• Parallel corpus collection (Nie 1999; Resnik 1999)
• Comparable corpus collection: Anchor texts and search-result pages (Lu et al. 2002, 2003)
• Strength: Huge amounts of Web data with link structure
Related Research (cont.)
• Query translation for cross-language information retrieval– Dictionary-/MT-based approach (Ballesteros & Croft 1997; Hull
& Grefenstette 1996)– Corpus-based approach (Dumais 1997; Nie 1999)– Combined approach (Chen & Bian 1999; Kwok 2001) – Improving techniques
• Query expansion and phrase translation (Ballesteros & Croft 1997)• Translation disambiguation (Ballesteros & Croft 1998; Chen & Bian 1999)• Proper name transliteration (Chen et al. 1998; Lin & Chang 2003) • Probabilistic retrieval/language models (Hiemstra & de Jong 1999;
Lavrenko 2002)• Unknown query translation (Lu et al. 2002, 2003)
Related Research (cont.)
• Cross-language Web search (CLWS)– Practical CLWS services have not lived up to expectations
• Keizai (Ogden et al. 1999): English query/Japanese, Korean Web news• MTIR (Bian & Chen 1999): Chinese query/English pages/translation• MuST: Multilingual Summarization and Translation (Hovy & Lin 1998)
– English/Indonesian/Spanish/Arabic/Japanese, Web news summarization or translation• TITAN (Hayashi et al.1997): English-Japanese retrieval/translated pages titles
• Challenges– Web queries are often
• Short: 2-3 words (Silverstein et al. 1998)• Diverse: wide-scoped topic • Unknown (out of vocabulary): 74% is unavailable in CEDICT Chinese-English electronic
dictionary containing 23,948 entries.
– E.g.• Proper name: 愛因斯坦 (Einstein), 海珊 (Hussein)• New terminology: 嚴重急性呼吸道症候群 (SARS), 院內感染 (Nosocomial infections)
Part IIAnchor Text Mining for
Term Translation Extraction
Anchor-Text SetAnchor-Text Set
• Anchor text (link text)– The descriptive text of a link
on a Web page
• Anchor-text set– A set of anchor texts pointing
to the same page (URL)
– Multilingual translations− Yahoo/雅虎 /야후− America/美国 /アメリカ
• Anchor-text-set corpus– A collection of anchor-text sets
Yahoo Search Engine
美国雅虎 雅虎搜尋引擎
Yahoo! America
Taiwan
China
Japan
Korea
야후 -USA
アメリカの Yahoo! http://www.yahoo.com
Term Translation
Extraction
Term Translation
Extraction
Anchor-TextExtraction
Anchor-TextExtraction
Source Query Term
Target Translation
TermExtraction
TermExtraction
Anchor-Text-Set Corpus
WebPages
Internet
Translation Lexicon
Processing of Term Translation ExtractionProcessing of Term Translation Extraction
WebSpiderWeb
Spider
Term SimilarityEstimation
Term SimilarityEstimation
Collect Web pages and build up anchor-text-set corpus.
Collect Web pages and build up anchor-text-set corpus.
Extract key terms as translation candidate.
Extract key terms as translation candidate.
Compute similarity using probabilistic inference model.
Compute similarity using probabilistic inference model.
- in USA
Taiwan -
台灣
搜尋引擎
雅虎Yahoo
www.yahoo.com
www.yahoo.com.tw
Yahoo
雅虎
雅虎
Yahoo
t: Target Translations s: Source Query Term
...
.......
(#in-link= 187)
(#in-link= 21)
Chinese-English Anchor-Text-Set Corpus
...
Example for Term Translation ExtractionExample for Term Translation Extraction
Set u1
Set u2
Term TranslationExtraction
Term TranslationExtraction
Page Authority
Co-occurrence
Probabilistic Inference Model
)(
)()(
sP
tsPtsP
)()]|()|()|()|([
)()|()|(
)()]|()|()|([
)()|(
)()|(
)()|(
)(
)()(
1
1
1
1
1
1
n
iiiiii
n
iiii
n
iiiii
n
iii
n
iii
n
iii
uPutPusPutPusP
uPutPusP
uPutsPutPusP
uPutsP
uPutsP
uPutsP
tsP
tsPtsP
• Asymmetric translation models:
• Symmetric model with link information:
s' ofnumber the)( ,)(
)()( where
1
in-linkuuLuL
uLuP jj
jn
j
ii
Page authority
Co-occurrence
Conventional translation model
Experimental EnvironmentExperimental Environment
• Anchor-text-set corpora– 109,416 traditional-Chinese-English sets (from 1,980,816 pages) – 157,786 simplified-Chinese-English sets (from 2,179,171 pages)
• Test query set– Query logs:
• Dreamer log: 228,566 unique query terms • GAIS log: 114,182 unique query terms
– Core terms: 9,709 most popular query terms, frequencies >10 in two logs– Test set: 622 English terms selected from core terms
• Average top-n inclusion rate (ATIR)
queries test ofnumber total
ions translatextracted first in the anslationscorrect tr ofnumber nATIR
Performance with Different Estimation Models
• Using different models– MA: Asymmetric model
– MAL: Asymmetric model with link information
– MS: Symmetric model
– MSL: Symmetric model with link information
• The symmetric inference model with link information was useful to improve the translation accuracy.
Type of model Top-1 Top-10
MA 41% 81%
MAL 44% 83%
MS 51% 84%
MSL 53% 85%
Performance with Different Term Extraction Methods and Query-Log-Set Sizes
• The query-log-based method achieved better performance.
• The medium-sized query-log set achieved the best performance
Type of term extraction Top-1 Top-10
PAT-tree-based 49% 94%
Query-log-based 53% 85%
Tagger-based 49% 94%
Size of query-log set Top-1 Top-10
#9,709 53% 85%
#19,124 57% 91%
#228,566 53% 94%
Performance Comparison
• Example: Test term "sakura“– Query-log set (9,709 terms)
• Top 5 extracted translations:台灣櫻花 , 櫻花 , 蜘蛛網 , 純愛 , 螢幕保護
– Query-log set (228,566 terms) • Top 10 extracted translations:
庫洛魔法使 , 櫻花建設 , 模仿 , 櫻花大戰 , 美夕 , 台灣櫻花 , 櫻花 , 蜘蛛網 , 純愛 , 螢幕保護
• Test results of 9,709 core terms [TTE9709]
Source terms(English)
Extracted target translations
TraditionalChinese
SimplifiedChinese
YahooNikeEricssonStanfordSydneyStar Warsinternet
雅虎耐吉易利信史丹佛雪梨星際大戰網際網路
雅虎耐克爱立信斯坦福悉尼星球大战互联网
• Promising results
Part IIITransitive Translation for Multilingual Translation
Transitive Translation for Multilingual Translation
• Problem– Insufficient anchor-text-set corpus for certain language pairs– E.g., Chinese-Japanese, Chinese-French, etc.
• Goal– A generalized model for multilingual translation
• Idea– Transitive translation model: Extract translations via
intermediate (third) language, e.g., English (Borin 2000; Gollins & Sanderson 2001)
– To reduce interference errors, integrates a competitive linking algorithm.
Transitive Translation: Combining Direct and Indirect Translation
)(log),( tsPtsPdirect
…
s
s : source termt : target translationm: intermediate translation
t
Sony (English)
• Direct Translation Model
• Indirect Translation Model
• Transitive Translation Model
新力(Traditional
Chinese)
ソニー (Japanese)
Indirect
Translationm
model inference ticprobabilis :)(
)() ,(
tsP
tsPtsPdirect
Direct Translation
value. thresholdpredefined :
otherwise. ), ,(
) ,( if ), ,() ,(
tsP
tsPtsPtsP
indirect
directdirecttrans
corpus in they probabilit occurrence :)(
)()()(
)(),(),(
mP
mPtmPmsP
mPtmmsPtsP
m
mindirect
Promising Results for Automatic Construction
of Multilingual Translation Lexicons
Promising Results for Automatic Construction
of Multilingual Translation Lexicons
Source terms (Traditional Chinese)
EnglishSimplified Chinese
Japanese
新力耐吉史丹佛雪梨網際網路網路首頁電腦資料庫資訊
SonyNikeStanfordSydneyinternetnetworkhomepagecomputerdatabaseinformation
索尼耐克斯坦福悉尼互联网网络主页计算机数据库信息
ソニーナイキスタンフォードシドニーインターネットネットワークホームページコンピューターデータベースインフォメーション
• Indirect association error (Melamed 2000)– t1 co-occurs often with s than t – E.g., 思科 system (translation error)
Indirect Association ProblemIndirect Association Problem
s
t
t1思科 system
Cisco
0.07
0.11
• Concepts of competitive linking (CL) algorithm (Melamed 2000)– Determine the most possible translation pairs between source and
target sets.– Assumption: each term has only one translation.– Method:
• Greedily select the most possible edges.• Select less possible edges when no conflicting with previous
selections.• Integration of anchor-text-mining and CL Algorithm
1. Build a bipartite graph using our proposed translation model.2. Use the extended CL algorithm to filter out indirect association
errors.
Competitive Linking AlgorithmCompetitive Linking Algorithm
S
t2
Bipartite Graph ConstructionBipartite Graph Construction
Step 1 Step 2
St1
St2
s
t2
t1思科 system
Cisco
s t1思科 system
Cisco系統
資訊
網路
電腦
Bipartite graph G = (S∪T, E)
T
• Pick up k most possible translations for a source term
Extended Competitive Linking AlgorithmExtended Competitive Linking Algorithm
t2
St1
St2
s t1思科 system
Cisco系統
資訊
網路
電腦
t2
St1
St2
s t1思科 system
Cisco系統
資訊
網路
電腦
Step 1 Step 2
0.l1
0.07
0.01
0.03
0.004
0.23
Construct bipartite graph G = (S∪T, E)
Computeedge weight wij
Sort wij
Choose edge ei*j* with highest weight
si* = s ?
Remove all edges linking to si* or tj* Re-estimate wij for remaining edges
Remove all edges linking to tj* Re-estimate wij for remaining edges
|R| = k ?
R = R {∪ tj*}
|E| = 0 ?
Return R
Y
Y
Y
N
N
N
Direct_Translation_with_CL (s, U, Vt)
Input: source term s
Web pages of concern U
translation vocabulary set Vt
Output: target translation set R
Direct_Translation_with_CL (s, U, Vt)
Input: source term s
Web pages of concern U
translation vocabulary set Vt
Output: target translation set R
Performance of Proposed Models with CL Algorithm
Model Top-1 Top-2 Top-3 Top-4 Top-5
Direct + CL 38.0% 43.8% 47.3% 49.6% 51.2%
Indirect + CL (k=1) 48.0% 57.0% 59.4% 60.1% 60.9%
Indirect + CL (k=3) 48.7% 58.1% 60.8% 62.0% 63.1%
Transitive + CL (k=1) 52.7% 60.1% 62.5% 63.1% 63.9%
Transitive + CL (k=3) 52.7% 61.6% 63.9% 64.3% 65.1%
Model Top-1 Top-2 Top-3 Top-4 Top-5
Direct 35.7% 43.0% 46.9% 49.6% 51.2%
Indirect (k=1) 44.2% 55.1% 58.0% 59.7% 60.5%
Indirect (k=3) 46.5% 57.0% 60.4% 62.0% 62.8%
Transitive (k=1) 49.2% 58.1% 60.9% 61.6% 62.0%
Transitive (k=3) 50.0% 60.1% 62.8% 63.9% 64.3%
• Test query set: 258 terms (from 9,709 core terms)
• Anchor-text-set corpora Traditional Chinese-Simplified Chinese : 4,516 sets Traditional Chinese-English: 109,416 sets Simplified Chinese-English: 157,786 sets
• Source/Target/Intermediate languages: Traditional Chinese/Simplified Chinese/English
• Test query set: 258 terms (from 9,709 core terms)
• Anchor-text-set corpora Traditional Chinese-Simplified Chinese : 4,516 sets Traditional Chinese-English: 109,416 sets Simplified Chinese-English: 157,786 sets
• Source/Target/Intermediate languages: Traditional Chinese/Simplified Chinese/English
Effective Translation Using CL Algorithm
Source terms(Traditional
Chinese)
Top-5 extracted target translations (Simplified Chinese)
Direct Transitive Transitive with CL
藍鳥(Bluebird) Not available
视点 (focus)电影 (movie)蓝鸟 (Bluebird)*试点 (test point)快车 (express)
蓝鸟 (Bluebird)*视点 (focus)电影 (movie)试点 (test point)快车 (express)
迪士尼(Disney)
乐园 (amusement park)
迪士尼 (Disney)*狮子王 (Lion King)狄斯尼 (Disney)*世界 (world)
乐园 (amusement park)迪士尼 (Disney)*狮子王 (Lion King)狄斯尼 (Disney)*世界 (world)
迪士尼 (Disney)*乐园 (amusement
park)狄斯尼 (Disney)*世界 (world)动画 (anime)
Part IVWeb Mining for Cross-Language
Information Retrieval and Web Search Applications
• Goal: Web mining to benefit CLIR and CLWS– Mining query translations from the Web
• Idea: Integrated Web mining approach – Anchor-text-mining approach
• Probabilistic inference model • Transitive translation model
– Search-result-mining approach• Chi-square test• Context-vector analysis
Web Mining for Cross-Language Information
Retrieval and Web Search Applications
Web Mining for Cross-Language Information
Retrieval and Web Search Applications
Search-Result-Mining Approach
• Goal: Enhance translation coverage for diverse queries• Idea
– Comparable corpus: Language-mixed texts in search-result pages
– Utilize co-occurrence relation and context information• Chi-square test
• Context-vector analysis
• Procedure of query translation based on search-result-mining
1. Corpus collection: Collect m search results from search engines.2. Translation candidate extraction: Segment the collected corpus
and extract k most frequent target terms as candidates.3. Translation selection: Compute similarity based on chi-square
test or context-vector analysis.
Chi-Square Test
• Idea– Makes good use of all relations of co-occurrence between the
source and target terms.
• Similarity measure (Gale & Church 1991)
• 2-way contingency table
t ~t
s a b
~s c d
)()()()(
)() ,(
2
2 dcdbcaba
cbdaNtsS
a: # of pages containing both terms s and t
b: # of pages containing term s but not t
c: # of pages containing term t but not s
d: # of pages containing neither term s nor t
N: the total number of pages, i.e., N= a+b+c+d
a: # of pages containing both terms s and t
b: # of pages containing term s but not t
c: # of pages containing term t but not s
d: # of pages containing neither term s nor t
N: the total number of pages, i.e., N= a+b+c+d
ij
ijij
E
EO 22 )(
)()(
)]()([
,
2,2
ts
ts
tPsPN
tPsPNn
Context-Vector Analysis
• Idea– Take co-occurring context terms as feature vectors of the
source/target terms.
• Similarity measure
• Weighting scheme: TF*IDF
. containing pages ofnumber the:
pages, Webofnumber total the:
, pageresult search in of frequency the:)(
)log() ,(max
) ,(
i
ii
jj
it
tn
N
dt,dtf
n
N
dtf
dtfw
i
)()(
) ,(
12
12
1
mi t
mi s
tsmi
CV
ii
ii
ww
wwtsS
s: ws1, ws2
, …, wsm
t: wt1, wt2
, …, wtm
s: ws1, ws2
, …, wsm
t: wt1, wt2
, …, wtm
Translation Selection based onChi-Square Test and Context-Vector Analysis
• For each candidate
– Chi-square test1. Retrieve page frequencies by submitting the Boolean queries ‘s∩t’,
‘~s∩t’, and ‘s∩~t’ to search engines
2. Compute the similarity Sχ2(s, t)
– Context-vector analysis1. Retrieve the top m search results by submitting t to search engines, and
generate its feature vector
2. Compute the similarity SCV(s, t)
Integrated Web Mining Approach
• Idea: Take both complementary advantages – Anchor-text-mining: good precision rate
– Search-result-mining: good coverage rate
• Combined similarity measure
m: an assigned weight for each similarity measure Sm
Rm(s, t): the similarity ranking between s and t using Sm
m m
m
tsRtsS
Combined ),(),(
• Test query set– 430 popular Chinese/English query terms
• Filter out terms without translations (from 9,709 core terms)
• OOV: 64% (274/430) are out of vocabulary
– 200 random Chinese query terms
• Randomly select from top 19,124 terms in Dreamer log
• OOV: 82.5% (165/200)
– 50 scientist names (proper names)
• Randomly select from 256 scientists (Science/People in the Yahoo! Directory)
• OOV: 76% (38/50)
– 50 disease names (technical terms)
• Randomly select from 664 diseases (Health/Diseases and Conditions in the Yahoo! Directory)
• OOV: 72% (36/50)
Test BedTest Bed
Query type English query Extracted Chinese translations
Scientist name
Aldrin, Buzz (Astronaut)Hadfield, Chris (Astronaut)Galilei, Galileo (Astronomer)Ptolemy, Claudius (Astronomer)Tibbets, Paul (Aviators)Crick, Francis (Biologists)Drake, Edwin Laurentine (Earth
Scientist)Aryabhata (Mathematician)Kepler, Johannes (Mathematician)Dalton, John (Physicist)Feynman, Richard (Physicist)
艾德林哈德菲爾德伽利略 /伽里略 /加利略托勒密第貝茲 /迪貝茨克立克 /克里克德拉克阿耶波多 /阿利耶波多克卜勒 /開普勒 /刻卜勒道爾頓 /道耳吞 /道耳頓費曼
Disease name
Ganglion CystGestational DiabetesHypoplastic Left Heart SyndromeLactose IntoleranceLegionnaires' DiseaseMuscular DystrophyNosocomial InfectionsShinglesStockholm SyndromeSudden Infant Death Syndrome
(SIDS)
腱鞘囊腫妊娠糖尿病左心發育不全症候群乳糖不耐症退伍軍人症肌肉萎縮症院內感染帶狀皰疹 /帶狀庖疹斯德哥爾摩症候群嬰兒猝死症
Examples of Proper Name and Technical TermExamples of Proper Name and Technical Term
Performance of Web Mining for Popular QueriesPerformance of Web Mining for Popular Queries
Approach Query type Top-1 Top-3 Top-5 Coverage
CVDic 56.4% 70.5% 74.4% 80.1%
OOV 56.2% 66.1% 69.3% 85.0%
All 56.3% 67.7% 71.2% 83.3%
χ2
Dic 40.4% 61.5% 67.9% 80.1%
OOV 54.7% 65.0% 68.2% 85.0%
All 49.5% 63.7% 68.1% 83.3%
ATDic 67.3% 78.2% 80.8% 89.1%
OOV 66.1% 74.5% 76.6% 83.9%
All 66.5% 75.8% 78.1% 85.8%
CombinedDic 68.6% 82.1% 84.6% 92.3%
OOV 66.8% 85.8% 88.0% 94.2%
All 67.4% 84.4% 86.7% 93.5%
Performance of Web Mining for
Random Queries/Proper Names/Technical Terms
Performance of Web Mining for
Random Queries/Proper Names/Technical Terms
Table 5.6 Inclusion rates for proper names and technical terms using the combined approach.
Query type Top-1 Top-3 Top-5
Scientist name
40.0% 52.0% 60.0%
Disease name
44.0% 60.0% 70.0%
Table 5.5 Coverage and inclusion rates for random queries
Approach Top-1 Top-3 Top-5 Coverage
CV 25.5% 45.5% 50.5% 60.5%
χ2 26.0% 44.5% 50.5% 60.5%
AT 19.0% 28.0% 28.5% 29.0%
Combined 33.5% 53.5% 60.5% 67.5%
• The test collection (Chen & Chen 2001)
– 132,173 Chinese news documents (200MB)
– 50 English query topics
• Title-query (title section only)– Short: Average 3.8 English words – Low performance: 55% of
monolingual performance (Kwok 2001)
– Difficulty: CLIR may fail if anyone key word in short queries can not be translated correctly.
• Can Web mining solve short query translation?
CLIR on NTCIR-2 Evaluation TaskCLIR on NTCIR-2 Evaluation Task
Table 5.1 Examples of Title-Query in NTCIR-2.
English Title Query Chinese Title Query
Q06Q12Q23Q28
Q30Q34Q45
Q46Q47
Kosovar refugeesMichael Jordan's retirementDisneylandCutting down the timber of Chinese cypress in ChilanEl Nino and infectious diseases Side effects of ViagraCloud Gate Dance Theatre of TaiwanMa Yo-yo cello recitalJin Yong kung-fu novels
科索沃難民潮麥可喬登退休迪士尼樂園棲蘭檜木砍伐
聖嬰現象與傳染病威而鋼之副作用雲門舞集
馬友友演奏會金庸武俠小說
• Probabilistic retrieval model (Xu 2001; Hiemstra & de Jong 1999)
– The Web mining approach: P(e | c) = Pweb(e | c) ≈ SCombined(e, c)
– The dictionary-based approach:
P(e | c) = Pdic(e | c) ≈ 1/ne
ne: the number of translations of c
– The hybrid approach:
P(e | c) = [Pweb(e | c) + Pdic(e | c)] / 2
Integration of Web Mining and Probabilistic Retrieval Model
Integration of Web Mining and Probabilistic Retrieval Model
Qe cQe
DcpcePePDePDQP ])|()|()1()([)|()|(
Q: English query
D: Chinese Document
e: English query term
c: Chinese translation
P(e): background probability
P(e|c): translation probability
P(c|D): generation probability
Q: English query
D: Chinese Document
e: English query term
c: Chinese translation
P(e): background probability
P(e|c): translation probability
P(c|D): generation probability
Performance of Query Translation and CLIR for NTCIR-2 English-Chinese Retrieval Task
Performance of Query Translation and CLIR for NTCIR-2 English-Chinese Retrieval Task
Table 5.9 Top-n inclusion rates with Web mining approach for traditional Chinese translations of 178 English title query terms.
Type Number Top-1 Top-2 Top-3 Top-4 Top-5
Terms existing in LDC 156 60.3% 73.7% 77.6% 82.1% 83.3%
Terms not included in LDC 22 68.1% 77.2% 81.8% 86.3% 86.3%
Total 178 61.2% 74.2% 78.1% 82.6% 83.7%
Table 5.10 The MAP values with three different approaches of query translation to the NTCIR-2 English-Chinese retrieval task.
Query translation approach Mean average precision
Dictionary-based approach 0.207
Web mining approach 0.241
The hybrid approach 0.271
• Query translation– Effective
• Local place names: “Chilan” ( 棲蘭 ), “Meinung” ( 美濃 )• Foreign names: “Jordan” ( 喬登 , 喬丹 ), “Kosovar” ( 科索沃 ), “Carter” ( 卡特 )• Aliases/Synonyms: “Disney” ( 迪士尼 , 迪斯尼 , 迪斯奈 , 迪斯奈 , 狄斯奈 , 狄士
尼 )– Ineffective
• Common terms: “victim” ( 受難者 ), “abolishment” ( 廢止 )• Native Chinese names: “Bai Xiao-yan” ( 白曉燕 ), “Bai-feng bean” ( 白鳳豆 )
– Multiple senses• Title query Q01: “The assembly parade law and freedom of speech”
– “assembly” => “ 組合語言” (error), “ 集會” (correct)– “speech” => “ 演講” , “ 語音” (error), “ 言論” (correct)
• CLIR– Effective
• Q23: ”Disneyland”: MAP (mean average precision) from 0 to 0.721 • Q46: “Ma Yo-yo cello recital”: MAP from 0.205 to 0.446
Performance Analysis for Query Translation & CLIRPerformance Analysis for Query Translation & CLIR
• Practical CLWS services have not lived up to expectations due to lacking multilingual translations for diverse unknown queries.
• The Web mining approach, which combines anchor-text-mining and search-result-mining approaches, are complementary in the precision and coverage rates for query translation.
• Anchor texts and search-result pages are useful comparable corpora for query translation, which are contributed continuously by a huge number of volunteers (page authors) around the world.
• LiveTrans can generate translation suggestions and provide an practical CLWS service for the retrieval of both Web pages and images.
ConclusionConclusion
• Currently, the LiveTrans system cannot fully perform in real time. It is necessary to find an more efficient way to reduce the computation cost.
• Employ more language processing techniques to improve the accuracy in phrase translation, word segmentation, unknown word extraction and proper name transliterations.
• Develop an automatic way to collect and exploit other Web resources like bilingual/multilingual Web pages.
• Enhance the LiveTrans system to handle more Asian and European language translation, such as Japanese, Korean, France, etc.
• Apply our Web-mining translation techniques to enhance current machine translation techniques and design a computer-aided English writing system.
Future WorkFuture Work