interface for finding close matches from translation memory

Interface for Finding Close Matches from Translation

Memory

Nipun Edara - 10010119

Priyatham Bollimpalli - 10010148

G Sharath Reddy - 10010174

P V S Dileep - 10010180

IR Search Engine

• To first retrieve top relevant sentences.

• Later meaning equivalent sentences are filtered.

Reasons for our own Search Engine

• Difficulty in customizing the ranking function. Just simple ranking based on BM25 may not be give optimal results since phrasal searches and proximity measures are not considered in BM25 since it is essentially a TF-IDF based ranking system.

• Flexibility in index size. Whoosh has an index size which is larger than the conventional index since it assumes that it user needs all the features in it. Contrary to that building our own index reduced the size by 50%.

• Flexibility in query model. Whoosh has a strict query model where there is only one factor AND/OR between the terms.

Preprocessing Stage: Indexing

Preprocessing is done for getting some overall parameters in the dataset such as average document length, term frequencies etc.

Conventional Indexing

Proximity

Query Expansion

Every query as well as sentences in the documents during indexing is subjected to the following:

• Converting to lower case

• Tokenization and Normalization. For example it’s is converted to it is

• Removing punctuations.

• Stemming. Porter stemmer is used.

• Synonyms using wordnet.

Architectural Overview

Closest matching

Two main problem in applying EBMT are

• As the length of a sentence becomes long, the number of retrieved similarsentences greatly decreases. This often results in no output when translating long sentences.

• The other problem arises due to the differences in style between input sentences and the example corpus.

Meaning Equivalent Sentence

• A sentence that shares the main meaning with the input sentence despite lacking some unimportant information. It does not contain information additional to that in the input sentence.

Features:

• Content Words: Words categorized as either noun, pronoun , adjective, adverb, or verb are recognized as content words. Interrogatives are also included.

• Functional Words: Words such as particles, auxiliary verbs, conjunctions, and interjections are recognized as functional words.

Matching and Ranking:

• # of identical content words

• # of synonymous words

• # of common functional words

• # of different functional words

• # of different content words

Algorithm

• Given a query and a sentence's matching score is calculated as follows

• Get content words of the query (A)

• Get functional words of the query (B)

• Get synonyms of the content words of the query (C)

• Get content words of the sentence (D)

• Get functional words of the sentence (E)

• E1) Identical content words = Number of matching words in A and D

• E2) Identical synonymous words = Number of matching words between C and D

• E3) Identical functional words = Number of matching words between B and E

• Different content words = #(A) + #(D) - 2*( E1 )

• Different functional words = #(B) + #(E) - 2*( E3 )

• Weights are given for the above quantities and total score is calculated.

Sequence Matcher

• Improvement of ‘gestalt pattern matching’ proposed by Ratcliff and Obershelp.

• Idea : To find the longest contiguous matching subsequence that contains no “junk” elements and recursively apply the same to the left and right part of matching subsequence.

• Even though it doesn’t yield minimal edit sequences, but does tend to yield matches that “look right” to people

• Time Complexity : Cubic in length of string for worst case

PATTERN MATCHING: THE GESTALT APPROACH

• Gestalt - describes how people can recognize a pattern as a functional unit that has properties not derivable by summation of its parts.

Example for Gestalt Approach

The Ratcliff/ObershelpPattern-matching algorithm • Works on the similar lines as in the example mentioned above.

• First, locate the largest group of characters in common.

• Using this as anchor, recursively find the largest group of common characters by comparing left parts of both strings and also in right parts of both the strings.

The Ratcliff/ObershelpPattern-matching algorithm • Returns a score, reflecting the percentage match

• Score 2*(# of characters matched) /[len(string1)+len(string2)]

Higher the score implies higher is the matching percentage.

interface for finding close matches from translation memory

Engineering