an effective approach for searching closest sentence translations from the web
DESCRIPTION
Database Research Group. An Effective Approach for Searching Closest Sentence Translations from The Web. Ju Fan , Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University DASFAA 2011 – Apr. 23, Hong Kong. Outline. Introduction Overview of Our Approach - PowerPoint PPT PresentationTRANSCRIPT
An Effective Approach for Searching Closest Sentence Translations from The Web
Ju Fan, Guoliang Li, and Lizhu Zhou
Database Research Group, Tsinghua University
DASFAA 2011 – Apr. 23, Hong Kong
DatabaseResearch
Group
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/24/23 2SCST@DASFAA 2011
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/24/23 3SCST@DASFAA 2011
BackgroundBackground
• Parallel sentences on the Web▪Sentences with the well-translated
counterpart▪An English-to-Chinese Example
• A rich source for translation• Commercial Systems04/24/23 4SCST@DASFAA 2011
Obama said he hopes to get Congress to approve it next year奥巴马总统说他争取让希望国会明年批准该协议。 -- blog.hjenglish.com
Parallel Sentences
E.g.,The result is
good结果很好
BackgroundBackground
04/24/23 5SCST@DASFAA 2011
Parallel SentenceDatabase
Sen 1 (E-C)Sen2 (E-C)
Sen3 (E-C)
sen n (E-C)
……
Closest Sentenceswith Translation
QuerySentence(English)
Web
Parallel SentenceDiscovery and Extraction
Sentence-Level Translation Aid
Sentence Matching
An effective similarity model between sentences in the source language (e.g., English sentences)
Research Issue
MotivationMotivation
04/24/23 6SCST@DASFAA 2011
• Existing approaches:▪ Word-based, e.g., translation model, edit
distance, …▪ Gram-based, e.g., N-gram, V-gram ▪ All subsequences of a sentence
Cannot capture the order of words
Don’t consider the syntactic information
Too expensive
We propose a phrase-based similarity model1.Syntactic information 2.Frequency information3.Lengths of phrases
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/24/23 7SCST@DASFAA 2011
Problem DefinitionProblem Definition
04/24/23 8SCST@DASFAA 2011
DataData: : A Database of A Database of Parallel SentencesParallel Sentences
TranslatorTranslator
QueryQuery: : Query Sentence (Query Sentence (EnglishEnglish))
AnswerAnswer::Sentences with its translationsSentences with its translations
…
Sentence1: English - ChineseSentence2: English - ChineseSentence3: English - Chinese
Phrase-Based Sentence MatchingPhrase-Based Sentence Matching
04/24/23 9SCST@DASFAA 2011
q
Phrase f1
Phrase f2
Phrase fn
……
sPhrase f’1
Phrase f’2
Phrase f’n
……
SimilarityModel
Parallel SentencesParallel SentencesPhrase
Selection
Phrase DatabasePhrase Database
OfflineOffline
OnlineOnline
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/24/23 10SCST@DASFAA 2011
Phrase-Based Similarity ModelPhrase-Based Similarity Model
04/24/23 11SCST@DASFAA 2011
q
Phrase f1
Phrase f2
Phrase fn
……
sPhrase f’1
Phrase f’2
Phrase f’n
……
SimilarityModel
Parallel SentencesParallel SentencesPhrase
Selection
Phrase DatabasePhrase Database
OfflineOffline
OnlineOnline
Similarity ModelSimilarity Model
04/24/23 12SCST@DASFAA 2011
sim(q,s) = ∑f ∈Fq∩Fs φ(q,f) φ(s,f)
Query Sentence, q
A Sentence in the DB, s
PhrasePhraseSet, Set, FFqq
PhrasePhraseSet, Set, FFss
f1, f2, f3, ……, fm
f'1', f'2, f'3, ……, f'n
w(f)
φ(q,f):syntactic importance of f to q
φ(s,f):syntactic importance of f to s
Shared Phrases:
f ∈Fq∩Fs w(f):weight of f
(IDF)
Fq∩Fs
Fs
Syntactic Importance of PhrasesSyntactic Importance of Phrases
04/24/23 13SCST@DASFAA 2011
φ(q,f)
Sentence Sentence qq
Phrase Phrase ff
He has eaten an apple
he eaten apple
= Πm α m Πg β g
has anGapGap
Dependency TreeDependency Tree
eaten
he apple has
an
α0
d·α0 d·α0 d·α0
d2·α0d: a decay factor
β g : penalty(constant)
α m : syntactic weight of matched term
Features of the Similarity ModelFeatures of the Similarity Model
• More General▪Subsumes Jaccard, Cosine similarity,…
• Syntactic Information▪Weight of matched terms▪Weight of terms in the gap
• Frequency Information▪Weight of phrases
04/24/23 14SCST@DASFAA 2011
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/24/23 15SCST@DASFAA 2011
High-Quality Phrase SelectionHigh-Quality Phrase Selection
04/24/23 16SCST@DASFAA 2011
q
Phrase f1
Phrase f2
Phrase fn
……
sPhrase f’1
Phrase f’2
Phrase f’n
……
SimilarityModel
Parallel SentencesParallel SentencesPhrase
Selection
Phrase DatabasePhrase Database
OfflineOffline
OnlineOnline
High-Quality PhraseHigh-Quality Phrase
• Extend grams by allowing discontinuous terms• A heuristic for selecting phrases
▪ Gap constraint: syntactic relationship of discontinuous terms
▪ Frequency constraint: infrequent (large IDF)▪ Maximum constraint: 1) not a prefix; 2) max. length
04/24/23 17SCST@DASFAA 2011
He has eaten an appleSentence Sentence qq
he eaten apple
syntactic
Frequency# of sentences
In the DB having it
Phrase SelectionPhrase Selection
• Selecting phrases with gap and maximum constraints
04/24/23 18SCST@DASFAA 2011
He ate a red appleSentence Sentence ss
he eat red apple
Sentence Graph1)Sequential relationship2)Syntactic relationship
• Longest path from a node = A phrase satisfying• Gap constraint• Maximum constraint
Phrase SelectionPhrase Selection
04/24/23 19SCST@DASFAA 2011
• Select phrases with frequency constraint (Threshold = 2)
Sentences in the DBHe has an apple
He ate a red apple
He has a pencil
He has
N0(8)
N1(4)
N2(3)
N27(1) N4(1)
N28(0) N5(0)
he
have
pencil apple
# #
N9(1)
eat
N11(1)
red
N15(1)
apple
N13(1)
apple
#N14(0)
haveeat red
…
……
Use a frequency trie
N29(0)
#
Prune freq-uent phrases
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/24/23 20SCST@DASFAA 2011
Experiment SetupExperiment Setup
• Data Sets▪DI: 520,899 parallel sentences from ICIBA▪DC: 800,000 parallel sentences from CNKI
• Baseline Methods▪ Jaccard Coefficient, Edit Distance, Cosine
Similarity▪Translation Model Methods (TM)▪Cosine Similarity with VGRAM
04/24/23 21SCST@DASFAA 2011
Experiment SetupExperiment Setup
• Evaluation Metrics▪BLEU
◦ A well known metric for machine translation◦ Example:
▪Precision◦ A user study to label whether the translations are
useful
04/24/23 22SCST@DASFAA 2011
qq: He has eaten an apple: He has eaten an apple
ss: He has a pencil: He has a pencil他吃了一个苹果他吃了一个苹果他有一支铅笔他有一支铅笔
Ref. Translation
Translation
BLEU
Effects of Phrase SelectionEffects of Phrase Selection
04/24/23 23SCST@DASFAA 2011
Effect on max. length on DI Effect on freq. threshold on DC
Comparison with Similarity ModelsComparison with Similarity Models
04/24/23 24SCST@DASFAA 2011
Comparison on the DI data set
Comparison with Existing MethodsComparison with Existing Methods
04/24/23 25SCST@DASFAA 2011
Comparison on the DC data set
User StudiesUser Studies
• Methods used in commercial systems
04/24/23 26SCST@DASFAA 2011Comparison on the DI data set
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/24/23 27SCST@DASFAA 2011
ConclusionConclusion
• Searching closest sentence translations from the Web
• A phrase-based sentence similarity model
• High-quality phrase selection methods
• Extensive experiments and user studies
04/24/23 28SCST@DASFAA 2011
04/24/23 SCST@DASFAA 2011 29
Thanks
My Homepage: http://dbgroup.cs.tsinghua.edu/fanju
Frequency ConstraintFrequency Constraint
• Index structures▪Phrase Sentence
• Frequent phrases large inverted index
04/24/23 30SCST@DASFAA 2011