cross-language retrieval
DESCRIPTION
Cross-Language Retrieval. LBSC 796/CMSC 828o Douglas W. Oard and Jianqiang Wang Session 10, April 5, 2004. The Grand Plan. Until Now: What makes up an IR system? Character-coded text as an example Starting Now: Beyond English text Foreign languages, audio, video, …. Agenda. Questions - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/1.jpg)
Cross-Language Retrieval
LBSC 796/CMSC 828o
Douglas W. Oard and Jianqiang Wang
Session 10, April 5, 2004
![Page 2: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/2.jpg)
The Grand Plan
• Until Now: What makes up an IR system?– Character-coded text as an example
• Starting Now: Beyond English text– Foreign languages, audio, video, …
![Page 3: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/3.jpg)
Agenda
• Questions
• Overview
• Cross-Language Search
• User Interaction
![Page 4: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/4.jpg)
User Needs Assessment
• Who are the potential users?
• What goals do we seek to support?
• What language skills must we accommodate?
![Page 5: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/5.jpg)
31%
18%
9%
7%
7%
5%
4%
3%
3%
2%
11%
English
Chinese
Japanese
Spanish
German
Korean
French
Portuguese
Italian
Russian
Other
Native speakers, Global Reach projection for 2004 (as of Sept, 2003)
Global Internet Users
![Page 6: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/6.jpg)
68%
4%
6%
2%
6%
1%
3%1%
2%2%
5%
31%
18%
9%
7%
7%
5%
4%
3%
3%
2%
11%
English
Chinese
Japanese
Spanish
German
Korean
French
Portuguese
Italian
Russian
Other
Native speakers, Global Reach projection for 2004 (as of Sept, 2003)
Global Internet Users
![Page 7: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/7.jpg)
Global Languages
0
200
400
600
800
Spea
kers
(M
illio
ns)
Chi
nese
Eng
lish
Hin
di-U
rdu
Span
ish
Por
tugu
ese
Ben
gali
Rus
sian
Ara
bic
Japa
nese
Source: http://www.g11n.com/faq.html
![Page 8: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/8.jpg)
Global Trade
0
200
400
600
800
1000U
SA
Ger
man
y
Jap
an
Ch
ina
Fra
nce
UK
Can
ada
Ital
y
Net
her
lan
ds
Bel
giu
m
Kor
ea
Mex
ico
Tai
wan
Sin
gap
ore
Sp
ain
Exports Imports
Bil
lion
s of
US
Dol
lars
(19
99)
Source: World Trade Organization 2000 Annual Report
![Page 9: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/9.jpg)
Who needs Cross-Language Search?
• When users can read several languages– Eliminate multiple queries– Query in most fluent language
• Monolingual users can also benefit– If translations can be provided– If it suffices to know that a document exists– If text captions are used to search for images
![Page 10: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/10.jpg)
European Web Size Projection
0.1
1.0
10.0
100.0
1,000.0
10,000.0
Bil
lio
ns
of
Wo
rds
English Other European
Source: Extrapolated from Grefenstette and Nioche, RIAO 2000
![Page 11: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/11.jpg)
European Web Content
Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997
Bilingual25%
Other8%
Native Tounge Only42%
English Only11%
English Only/UK &
Ireland14%
![Page 12: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/12.jpg)
The Problem Space• Retrospective search
– Web search
– Specialized services (medicine, law, patents)
– Help desks
• Real-time filtering– Email spam
– Web parental control
– News personalization
• Real-time interaction– Instant messaging
– Chat rooms
– Teleconferences
Key Capabilities Map across languages
– For human understanding
– For automated processing
![Page 13: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/13.jpg)
A Little (Confusing) Vocabulary• Multilingual document
– Document containing more than one language
• Multilingual collection– Collection of documents in different languages
• Multilingual system– Can retrieve from a multilingual collection
• Cross-language system– Query in one language finds document in another
• Translingual system– Queries can find documents in any language
![Page 14: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/14.jpg)
TranslationTranslingual
BrowsingTranslingual
Search
Query Document
Select Examine
InformationAccess
InformationUse
![Page 15: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/15.jpg)
Early Work
• 1964 International Road Research– Multilingual thesauri
• 1970 SMART– Dictionary-based free-text cross-language retrieval
• 1978 ISO Standard 5964 (revised 1985)– Guidelines for developing multilingual thesauri
• 1990 Latent Semantic Indexing– Corpus-based free-text translingual retrieval
![Page 16: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/16.jpg)
Multilingual Thesauri
• Build a cross-cultural knowledge structure– Cultural differences influence indexing choices
• Use language-independent descriptors– Matched to language-specific lead-in vocabulary
• Three construction techniques– Build it from scratch– Translate an existing thesaurus– Merge monolingual thesauri
![Page 17: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/17.jpg)
C ross -L an g u ag e R etrieva lIn d exin g L an g u ag esM ach in e-A ss is ted In d exin g
In fo rm ation R e trieva l
M u lt ilin g u a l M e tad a ta
D ig ita l L ib ra ries
In te rn a tion a l In fo rm ation F lowD iffu s ion o f In n ova tion
In fo rm ation U se
A u tom atic A b s trac tin g
Inform ation Science
M ach in e Tran s la tionIn fo rm ation E xtrac tionText S u m m ariza tion
N atu ra l L an g u ag e P rocess in g
M u ltilin g u a l O n to log ies
O n to log ica l E n g in eerin g
Textu a l D a ta M in in g
K n ow led g e D iscovery
M ach in e L earn in g
Artificial Intelligence
L oca liza tionIn fo rm ation V isu a liza tion
H u m an -C om p u ter In te rac tion
W eb In te rn a tion a liza tion
W orld -W id e W eb
Top ic D e tec tion an d Track in g
S p eech P rocess in g
M u ltilin g u a l O C R
D ocu m en t Im ag e U n d ers tan d in g
Other Fields
M ultilingua l In form ation Access
![Page 18: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/18.jpg)
Free Text CLIR
• What to translate?– Queries or documents
• Where to get translation knowledge?– Dictionary or corpus
• How to use it?
![Page 19: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/19.jpg)
The Search Process
Choose Document-Language
Terms
Query-DocumentMatching
InferConcepts
Select Document-Language
Terms
Document
Author
Query
Choose Document-Language
Terms
MonolingualSearcher
Choose Query-Language
Terms
Cross-LanguageSearcher
![Page 20: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/20.jpg)
Translingual Retrieval Architecture
LanguageIdentification
EnglishTerm
Selection
ChineseTerm
Selection
Cross-LanguageRetrieval
MonolingualChineseRetrieval
3: 0.91 4: 0.575: 0.36
1: 0.722: 0.48
ChineseQuery
ChineseTerm
Selection
![Page 21: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/21.jpg)
Evidence for Language Identification
• Metadata– Included in HTTP and HTML
• Word-scale features– Which dictionary gets the most hits?
• Subword features– Character n-gram statistics
![Page 22: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/22.jpg)
Query-Language Retrieval
MonolingualChineseRetrieval
3: 0.91 4: 0.575: 0.36
ChineseQueryTerms
EnglishDocument
Terms
DocumentTranslation
![Page 23: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/23.jpg)
Example: Modular use of MT
• Select a single query language
• Translate every document into that language
• Perform monolingual retrieval
![Page 24: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/24.jpg)
TDT-3 Mandarin Broadcast News
Systran
Balanced 2-best translation
Is Machine Translation Enough?
![Page 25: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/25.jpg)
Document-Language Retrieval
MonolingualEnglish
Retrieval
3: 0.91 4: 0.575: 0.36
QueryTranslation
ChineseQueryTerms
EnglishDocument
Terms
![Page 26: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/26.jpg)
Query vs. Document Translation
• Query translation– Efficient for short queries (not relevance feedback)– Limited context for ambiguous query terms
• Document translation– Rapid support for interactive selection– Need only be done once (if query language is same)
• Merged query and document translation– Can produce better effectiveness than either alone
![Page 27: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/27.jpg)
Interlingual Retrieval
InterlingualRetrieval
3: 0.91 4: 0.575: 0.36
QueryTranslation
ChineseQueryTerms
EnglishDocument
Terms
DocumentTranslation
![Page 28: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/28.jpg)
oilpetroleum
probesurveytake samples
Whichtranslation?
Notranslation!
restrainoilpetroleum
probesurveytake samples
cymbidium goeringiiWrong
segmentation
![Page 29: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/29.jpg)
What’s a “Term?”
• Granularity of a “term” depends on the task– Long for translation, more fine-grained for retrieval
• Phrases improve translation two ways– Less ambiguous than single words– Idiomatic expressions translate as a single concept
• Three ways to identify phrases– Semantic (e.g., appears in a dictionary)– Syntactic (e.g., parse as a noun phrase)– Co-occurrence (appear together unexpectedly often)
![Page 30: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/30.jpg)
Learning to Translate• Lexicons
– Phrase books, bilingual dictionaries, …
• Large text collections– Translations (“parallel”)– Similar topics (“comparable”)
• Similarity– Similar pronunciation
• People
![Page 31: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/31.jpg)
Types of Lexical Resources• Ontology
– Organization of knowledge
• Thesaurus– Ontology specialized to support search
• Dictionary– Rich word list, designed for use by people
• Lexicon– Rich word list, designed for use by a machine
• Bilingual term list– Pairs of translation-equivalent terms
![Page 32: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/32.jpg)
Original query: El Nino and infectious diseases
Term selection: “El Nino” infectious diseases
Term translation:(Dictionary coverage: “El Nino” is not
found)
Translation selection:
Query formulation:Structure:
Dictionary-Based Query Translation
![Page 33: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/33.jpg)
Four-Stage Backoff
• Tralex might contain stems, surface forms, or some combination of the two.
mangez mangez
mangez mangemange
mangezmange mange
mangez mange mangent mange
- eat
- eats eat
- eat
- eat
Document Translation Lexicon
surface form surface form
stem surface form
surface form stem
stem stem
French stemmer: Oard, Levow, and Cabezas (2001); English: Inquiry’s kstem
![Page 34: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/34.jpg)
Results
STRAND corpus tralex (N=1) 0.2320
STRAND corpus tralex (N=2) 0.2440
STRAND corpus tralex (N=3) 0.2499
Merging by voting 0.2892
Baseline: downloaded dictionary 0.2919
Backoff from dictionary to corpus tralex
0.3282
Condition Mean Average Precision
+12% (p < .01) relative
![Page 35: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/35.jpg)
Results Detail
mAP
![Page 36: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/36.jpg)
Exploiting Part-of-Speech (POS)
• Constrain translations by part-of-speech– Requires POS tagger and POS-tagged lexicon
• Works well when queries are full sentences– Short queries provide little basis for tagging
• Constrained matching can hurt monolingual IR– Nouns in queries often match verbs in documents
![Page 37: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/37.jpg)
The Short Query Challenge
0 1 2 3
1997
1998
1999
Number of Terms Per Query
English
Other EuropeanLanguages (German,French, Italian, Dutch,Swedish)
Source: Jack Xu, Excite@Home, 1999
![Page 38: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/38.jpg)
“Structured Queries”
• Weight of term a in a document i depends on:– TF(a,i): Frequency of term a in document i– DF(a): How many documents term a occurs in
• Build pseudo-terms from alternate translations– TF (syn(a,b),i) = TF(a,i)+TF(b,i)– DF (syn(a,b) = |{docs with a}U{docs with b}|
• Downweight terms with any common translation– Particularly effective for long queries
![Page 39: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/39.jpg)
(Query Terms: 1: 2: 3: )
Computing Weights
• Unbalanced:– Overweights query terms that have many translations
• Balanced (#sum): – Sensitive to rare translations
• Pirkola (#syn):– Deemphasizes query terms with any common translation
][3
1
3
3
2
2
1
1
DF
TF
DF
TF
DF
TF
])(2
1[
2
1
3
3
2
2
1
1
DF
TF
DF
TF
DF
TF
][2
1
3
3
21
21
DF
TF
DFDF
TFTF
![Page 40: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/40.jpg)
Ranked Retrieval
English/EnglishTranslationLexicon
Measuring Coverage Effects
Ranked List
113,000CLEF EnglishNews Stories
CLEFRelevance Judgments
Evaluation
Measure of Effectiveness
33 English Queries (TD)
![Page 41: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/41.jpg)
35 Bilingual Term Lists• Chinese (193, 111)
• German (103, 97, 89, 6)
• Hungarian (63)
• Japanese (54)
• Spanish (35, 21, 7)
• Russian (32)
• Italian (28, 13, 5)
• French (20, 17, 3)
• Esperanto (17)
• Swedish (10)
• Dutch (10)
• Norwegian (6)
• Portuguese (6)
• Greek (5)
• Afrikaans (4)
• Danish (4)
• Icelandic (3)
• Finnish (3)
• Latin (2)
• Welsh (1)
• Indonesian (1)
• Old English (1)
• Swahili (1)
• Eskimo (1)
![Page 42: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/42.jpg)
Size Effect
String matching
Stem matching 7% OOV
![Page 43: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/43.jpg)
Out-of-Vocabulary Distribution
![Page 44: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/44.jpg)
Measuring Named Entity Effect
ComputeTerm Weights
Build Index
EnglishDocuments
ComputeTerm Weights
ComputeDocument Score
Sort ScoresRankedList
EnglishQuery
TranslationLexicon
- NamedEntities
+ NamedEntities
![Page 45: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/45.jpg)
Named entities removed
Named entities from term list
Named entities added
Full Query
![Page 46: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/46.jpg)
Hieroglyphic
Egyptian Demotic
Greek
![Page 47: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/47.jpg)
Types of Bilingual Corpora
• Parallel corpora: translation-equivalent pairs– Document pairs– Sentence pairs – Term pairs
• Comparable corpora: topically related– Collection pairs– Document pairs
![Page 48: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/48.jpg)
Exploiting Parallel Corpora
• Automatic acquisition of translation lexicons
• Statistical machine translation
• Corpus-guided translation selection
• Document-linked techniques
![Page 49: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/49.jpg)
Learning From Document Pairs
E1 E2 E3 E4 E5 S1 S2 S3 S4
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
4 2 2 1
8 4 4 2
2 2 2 1
2 1 2 1
4 1 2 1
English Terms Spanish Terms
![Page 50: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/50.jpg)
Similarity “Thesauri”
• For each term, find most similar in other language– Terms E1 & S1 (or E3 & S4) are used in similar ways
• Treat top related terms as candidate translations– Applying dictionary-based techniques
• Performed well on comparable news corpus– Automatically linked based on date and subject codes
![Page 51: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/51.jpg)
Generalized Vector Space Model
• “Term space” of each language is different– Document links define a common “document space”
• Describe documents based on the corpus– Vector of similarities to each corpus document
• Compute cosine similarity in document space
• Very effective in a within-domain evaluation
![Page 52: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/52.jpg)
Latent Semantic Indexing
• Cosine similarity captures noise with signal– Term choice variation and word sense ambiguity
• Signal-preserving dimensionality reduction– Conflates terms with similar usage patterns
• Reduces term choice effect, even across languages
• Computationally expensive
![Page 53: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/53.jpg)
Statistical Machine Translation
Señora Presidenta , había pedido a la administración del Parlamento que garantizase
Madam President , I had asked the administration to ensure that
![Page 54: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/54.jpg)
Evaluating Corpus-Based Techniques
• Within-domain evaluation (upper bound)– Partition a bilingual corpus into training and test– Use the training part to tune the system– Generate relevance judgments for evaluation part
• Cross-domain evaluation (fair)– Use existing corpora and evaluation collections– No good metric for degree of domain shift
![Page 55: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/55.jpg)
Ranked Retrieval EffectivenessTitle Queries
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Threshold
Mea
n A
vera
ge
Pre
cisi
on
Pirkola
Kwok
MDF
WTF
WDF
WTF/DF
Baseline
English queries, Arabic documents
![Page 56: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/56.jpg)
Exploiting Comparable Corpora
• Blind relevance feedback– Existing CLIR technique + collection-linked corpus
• Lexicon enrichment– Existing lexicon + collection-linked corpus
• Dual-space techniques– Document-linked corpus
![Page 57: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/57.jpg)
Blind Relevance Feedback
• Augment a representation with related terms– Find related documents, extract distinguishing terms
• Multiple opportunities:– Before doc translation: Enrich the vocabulary– After doc translation: Mitigate translation errors – Before query translation: Improve the query– After query translation: Mitigate translation errors
• Short queries get the most dramatic improvement
![Page 58: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/58.jpg)
Query Expansion OpportunitiesSource Query (IT):Ritiro delle truppe russe dalla Lettonia
balticrusseestone…
russiatroopsestone
IR
Pre-translation expansion
translate
russiaradartroopsmoscow…
IR
IR
Post-translation expansion
La Stampa
LA Times
![Page 59: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/59.jpg)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0 5,000 10,000 15,000
Unique Dutch Terms
Me
an
Av
era
ge
Pre
cis
ion
Both
Post
Pre
None
Query Expansion Effect
Paul McNamee and James Mayfield, SIGIR-2002
![Page 60: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/60.jpg)
Indexing Time: Doc Translation
0
100
200
300
400
500
Thousands of documents
Inde
tim
e (s
ec)
monolingual cross-language
![Page 61: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/61.jpg)
Post-Translation “Document Expansion”
Mandarin Chinese Documents
Term-to-TermTranslation
EnglishCorpus
IRSystem
Top 5
AutomaticSegmentation
TermSelection
IRSystem
Results
EnglishQuery
Document to be Indexed
Single Document
![Page 62: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/62.jpg)
Why Document Expansion Works
• Story-length objects provide useful context
• Ranked retrieval finds signal amid the noise
• Selective terms discriminate among documents– Enrich index with low DF terms from top documents
• Similar strategies work well in other applications– CLIR query translation
– Monolingual spoken document retrieval
![Page 63: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/63.jpg)
Lexicon Enrichment
… Cross-Language Evaluation Forum …
… Solto Extunifoc Tanixul Knadu …
?
![Page 64: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/64.jpg)
Lexicon Enrichment
• Use a bilingual lexicon to align “context regions”– Regions with high coincidence of known translations
• Pair unknown terms with unmatched terms– Unknown: language A, not in the lexicon– Unmatched: language B, not covered by translation
• Treat the most surprising pairs as new translations
![Page 65: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/65.jpg)
Cognate Matching
• Dictionary coverage is inherently limited– Translation of proper names– Translation of newly coined terms– Translation of unfamiliar technical terms
• Strategy: model derivational translation– Orthography-based– Pronunciation-based
![Page 66: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/66.jpg)
Matching Orthographic Cognates
• Retain untranslatable words unchanged– Often works well between European languages
• Rule-based systems– Even off-the-shelf spelling correction can help!
• Character-level statistical MT– Trained using a set of representative cognates
![Page 67: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/67.jpg)
Matching Phonetic Cognates
• Forward transliteration– Generate all potential transliterations
• Reverse transliteration– Guess source string(s) that produced a transliteration
• Match in phonetic space
![Page 68: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/68.jpg)
Leveraging Cognates
StringComparison
WrittenForm
WrittenForm
AlphabeticTransliteration
Pronunciation
PhoneticTransliteration
Pronunciation
SpokenForm
SpokenForm
PhoneticComparison
Similarity
Similarity
![Page 69: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/69.jpg)
Cross-Language “Retrieval”
Search
Translated Query
Ranked List
QueryTranslation
Query
![Page 70: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/70.jpg)
Interactive Translingual Search
Search
Translated Query
Selection
Ranked List
Examination
Document
Use
Document
QueryFormulation
QueryTranslation
Query
Query Reformulation
MT
Translated “Headlines”
English Definitions
![Page 71: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/71.jpg)
Selection
• Goal: Provide information to support decisions
• May not require very good translations– e.g., Term-by-term title translation
• People can “read past” some ambiguity– May help to display a few alternative translations
![Page 72: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/72.jpg)
Language-Specific Selection
Swiss bankQuery in English: Search
English German(Swiss)(Bankgebäude, bankverbindung, bank)
1 (0.72) Swiss Bankers CriticizedAP / June 14, 1997
2 (0.48) Bank Director ResignsAP / July 24, 1997
1 (0.91) U.S. Senator Warpathing NZZ / June 14, 1997
2 (0.57) [Bankensecret] Law ChangeSDA / August 22, 1997
3 (0.36) Banks Pressure ExistentNZZ / May 3, 1997
![Page 73: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/73.jpg)
Translingual Selection
Swiss bankQuery in English: Search
German Query: (Swiss)(Bankgebäude, bankverbindung, bank)
1 (0.91) U.S. Senator Warpathing NZZ June 14, 19972 (0.57) [Bankensecret] Law Change SDA August 22, 19973 (0.52) Swiss Bankers Criticized AP June 14, 19974 (0.36) Banks Pressure Existent NZZ May 3, 19975 (0.28) Bank Director Resigns AP July 24, 1997
![Page 74: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/74.jpg)
Merging Ranked Lists
• Types of Evidence– Rank– Score
• Evidence Combination– Weighted round robin– Score combination
• Parameter tuning– Condition-based– Query-based
1 voa4062 .22 2 voa3052 .21 3 voa4091 .17 …1000 voa4221 .04
1 voa4062 .52 2 voa2156 .37 3 voa3052 .31 …1000 voa2159 .02
1 voa4062 2 voa3052 3 voa2156 …1000 voa4201
![Page 75: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/75.jpg)
Examination Interface
• Two goals– Refine document delivery decisions– Support vocabulary discovery for query refinement
• Rapid translation is essential– Document translation retrieval strategies are a good fit– Focused on-the-fly translation may be a viable alternative
![Page 76: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/76.jpg)
Uh oh…
![Page 77: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/77.jpg)
Translation for Assessment
Indonesian City of Bali in October last year in the bomb blast in the case of imam accused India of the sea on Monday began to be averted. The attack on getting and its plan to make the charges and decide if it were found guilty, he death sentence of May. Indonesia of the police said that the imam sea bomb blasts in his hand claim to be accepted. A night Club and time in the bomb blast in more than 200 people were killed and several injured were in which most foreign nationals. …
![Page 78: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/78.jpg)
MT in a Month
50 60 70 80 90
bestcompeting
ISI public
ISI public+
ISIunrestricted
ISI late
Human 6
Human 5
PercentHuman CasedNISTr3n4score
![Page 79: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/79.jpg)
![Page 80: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/80.jpg)
Experiment Design
Participant
1
2
3
4
Task Order
Narrow:
Broad:
Topic Key
System Key
System B:
System A:
Topic11, Topic17 Topic13, Topic29
Topic11, Topic17 Topic13, Topic29
Topic17, Topic11 Topic29, Topic13
Topic17, Topic11 Topic29, Topic13
11, 13
17, 29
![Page 81: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/81.jpg)
Maryland Experiments
• MT is almost always better– Significant overall and for narrow topics alone (one-tailed t-test, p<0.05)
• F measure is less insightful for narrow topics– Always near 0 or 1
0
0.2
0.4
0.6
0.8
1
1.2
umd01 umd02 umd03 umd04 umd01 umd02 umd03 umd04
Participant
Ave
rag
e F
_0.8
on
Tw
o T
op
ics MT
GLOSS|---------- Broad topics -----------| |--------- Narrow topics -----------|
![Page 82: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/82.jpg)
iCLEF 2002 Evaluation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 Average
Topics
F-m
easu
re
Automatic
User-Assisted
English QueriesGerman Documents20 minutes/topic
![Page 83: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/83.jpg)
Better Mental Process Models
iCLEF 2003, 10 minute sessions, each bar averages 4 searchers
Num
ber
of Q
uerie
s
![Page 84: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/84.jpg)
Delivery
• Use may require high-quality translation– Machine translation quality is often rough
• Route to best translator based on:– Acceptable delay– Required quality (language and technical skills)– Cost
![Page 85: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/85.jpg)
Where Things Stand• Ranked retrieval works well across languages
– Bonus: easily extended to text classification– Caveat: mostly demonstrated on news stories
• Machine translation is okay for niche markets– Keep an eye on this: accuracy is improving fast
• Building explainable systems seems possible
![Page 86: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/86.jpg)
Recap: Finding What You Can’t Read
• Three key challenges– Segmentation, coverage, evidence combination
• Segmentation objectives differ– Translation: Favor precision over coverage– Retrieval: Balance precision and recall
• Multiple coverage enhancement techniques– Expansion, backoff translation, cognate matching
• Translating evidence beats translating weights
![Page 87: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/87.jpg)
Research Opportunities
User Interaction
Translation Selection
Transliteration
Comparable Corpora
Parallel Corpora
Dictionaries
Term Selection
Percieved Opportunities Past InvestmentSegmentation &Phrase Indexing
Lexical Coverage
![Page 88: Cross-Language Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062423/568148b4550346895db5cb7b/html5/thumbnails/88.jpg)
One Minute Paper
What was the muddiest point in the part that Jianqiang taught?