interface for finding close matches from translation memory

17
Interface for Finding Close Matches from Translation Memory Nipun Edara - 10010119 Priyatham Bollimpalli - 10010148 G Sharath Reddy - 10010174 P V S Dileep - 10010180

Upload: priyatham-bollimpalli

Post on 06-Jul-2015

81 views

Category:

Engineering


2 download

DESCRIPTION

An Interface for Finding Close Matches from Translation Memory.

TRANSCRIPT

Page 1: Interface for Finding Close Matches from Translation Memory

Interface for Finding Close Matches from Translation

Memory

Nipun Edara - 10010119

Priyatham Bollimpalli - 10010148

G Sharath Reddy - 10010174

P V S Dileep - 10010180

Page 2: Interface for Finding Close Matches from Translation Memory

IR Search Engine

• To first retrieve top relevant sentences.

• Later meaning equivalent sentences are filtered.

Page 3: Interface for Finding Close Matches from Translation Memory

Reasons for our own Search Engine

• Difficulty in customizing the ranking function. Just simple ranking based on BM25 may not be give optimal results since phrasal searches and proximity measures are not considered in BM25 since it is essentially a TF-IDF based ranking system.

• Flexibility in index size. Whoosh has an index size which is larger than the conventional index since it assumes that it user needs all the features in it. Contrary to that building our own index reduced the size by 50%.

• Flexibility in query model. Whoosh has a strict query model where there is only one factor AND/OR between the terms.

Page 4: Interface for Finding Close Matches from Translation Memory

Preprocessing Stage: Indexing

Preprocessing is done for getting some overall parameters in the dataset such as average document length, term frequencies etc.

Page 5: Interface for Finding Close Matches from Translation Memory

Conventional Indexing

Page 6: Interface for Finding Close Matches from Translation Memory

Proximity

Page 7: Interface for Finding Close Matches from Translation Memory

Query Expansion

Every query as well as sentences in the documents during indexing is subjected to the following:

• Converting to lower case

• Tokenization and Normalization. For example it’s is converted to it is

• Removing punctuations.

• Stemming. Porter stemmer is used.

• Synonyms using wordnet.

Page 8: Interface for Finding Close Matches from Translation Memory

Architectural Overview

Page 9: Interface for Finding Close Matches from Translation Memory

Closest matching

Two main problem in applying EBMT are

• As the length of a sentence becomes long, the number of retrieved similarsentences greatly decreases. This often results in no output when translating long sentences.

• The other problem arises due to the differences in style between input sentences and the example corpus.

Page 10: Interface for Finding Close Matches from Translation Memory

Meaning Equivalent Sentence

• A sentence that shares the main meaning with the input sentence despite lacking some unimportant information. It does not contain information additional to that in the input sentence.

Features:

• Content Words: Words categorized as either noun, pronoun , adjective, adverb, or verb are recognized as content words. Interrogatives are also included.

• Functional Words: Words such as particles, auxiliary verbs, conjunctions, and interjections are recognized as functional words.

Page 11: Interface for Finding Close Matches from Translation Memory

Matching and Ranking:

• # of identical content words

• # of synonymous words

• # of common functional words

• # of different functional words

• # of different content words

Page 12: Interface for Finding Close Matches from Translation Memory

Algorithm

• Given a query and a sentence's matching score is calculated as follows

• Get content words of the query (A)

• Get functional words of the query (B)

• Get synonyms of the content words of the query (C)

• Get content words of the sentence (D)

• Get functional words of the sentence (E)

• E1) Identical content words = Number of matching words in A and D

• E2) Identical synonymous words = Number of matching words between C and D

• E3) Identical functional words = Number of matching words between B and E

• Different content words = #(A) + #(D) - 2*( E1 )

• Different functional words = #(B) + #(E) - 2*( E3 )

• Weights are given for the above quantities and total score is calculated.

Page 13: Interface for Finding Close Matches from Translation Memory

Sequence Matcher

• Improvement of ‘gestalt pattern matching’ proposed by Ratcliff and Obershelp.

• Idea : To find the longest contiguous matching subsequence that contains no “junk” elements and recursively apply the same to the left and right part of matching subsequence.

• Even though it doesn’t yield minimal edit sequences, but does tend to yield matches that “look right” to people

• Time Complexity : Cubic in length of string for worst case

Page 14: Interface for Finding Close Matches from Translation Memory

PATTERN MATCHING: THE GESTALT APPROACH

• Gestalt - describes how people can recognize a pattern as a functional unit that has properties not derivable by summation of its parts.

Page 15: Interface for Finding Close Matches from Translation Memory

Example for Gestalt Approach

Page 16: Interface for Finding Close Matches from Translation Memory

The Ratcliff/ObershelpPattern-matching algorithm • Works on the similar lines as in the example mentioned above.

• First, locate the largest group of characters in common.

• Using this as anchor, recursively find the largest group of common characters by comparing left parts of both strings and also in right parts of both the strings.

Page 17: Interface for Finding Close Matches from Translation Memory

The Ratcliff/ObershelpPattern-matching algorithm • Returns a score, reflecting the percentage match

• Score 2*(# of characters matched) /[len(string1)+len(string2)]

Higher the score implies higher is the matching percentage.