name conflict resolution

43
Name Conflict Resolution for Company Registration 8/30/2013

Upload: gaurav-goyal

Post on 12-Jul-2015

70 views

Category:

Education


4 download

TRANSCRIPT

Name Conflict Resolution

for Company Registration

8/30/2013

•System to automate company registration process

•Compares the company names using string matching algorithms

•Names are ranked according to their similarity percentage

•A name is rejected if the similarity score is 100%

Introduction

Introduction

8/30/2013

•To develop a system to resolve naming conflict.

•To find names similar to the name proposed by user.

•To provide the ranks of matched proposed name with other existing names.

Objectives

Objectives

8/30/2013

Building Base Dictionary

Keyword Generation

Finding Possible Matches

Finding Duplicates

Finding Ranks

Methodology

Methodology

8/30/2013

System Design

8/30/2013

8/30/2013

User Input“Centre Nepal Metals Industries”

User Input

8/30/2013

8/30/2013

Preprocessing Engine

8/30/2013

Downcasting“Centre Nepal Metals Industries”

Downcasting

“centre nepal metals industries”

• act of casting input from uppercase letters to lowercases

8/30/2013

Transformation

Transformation

“centre nepal metals industries”

• conversion of British English words to American English words

center

8/30/2013

Stopword Removal

Stopword Removal

“center nepal metals industries”

• process of removing predefined stopwordsfrom the string literal

“center nepal metals ”industries“center nepal metals”

8/30/2013

Tokenization

Tokenization

• process of reducing a large string to a set of tokens

“center nepal metals”center nepal metals

8/30/2013

Stemming

Stemming

• process of reducing a word to a root, or simpler form

metal

center

nepal

metals

center

nepal

Tokens Stemmed Tokens

8/30/2013

8/30/2013

Translation

Translation• conversion of the meaning of a source-language

text by means of an equivalent target-language text

metal

center

nepal

Stemmed Tokens Translated Tokens

8/30/2013

Transliteration

Transliteration

• conversion of a text from one script to another

dhatu

kendra

nepal

Translated Tokens Transliterated Tokens

8/30/2013

8/30/2013

Final Token List

dhatumetal

nepalcenter kendra

8/30/2013

8/30/2013

Database Query using Final Token List

•nepal medical centre pvt. ltd.

•nepal dhatu company

•metal nepal pvt. ltd.

•enter nepal

•nepal metal industries

•dhatu sankalan kendra

8/30/2013

From Database Query Result

Nepal Medical Centre pvt. ltd.

8/30/2013

8/30/2013

Database Generated Keywords

centernepal medical

8/30/2013

8/30/2013

Permutation

•kendra nepal dhatu

•kendra nepal metal

•center nepal dhatu

•center nepal metal

8/30/2013

8/30/2013

Levenshtein Distance Calculation

Optimized Maximal Similarity using Hungarian Algorithm

Sorenson Index to Calculate Similarity %

3 steps

1

2

3

Comparison

Comparison

8/30/2013

nepal

medical

center

center

5

6

0

5 6 0

8/30/2013

Levenshtein Distance Calculation

nepal

medical

center

nepal

0

4

5

5 6 00 4 5

8/30/2013

Levenshtein Distance Calculation

nepal

medical

center

metal

2

3

4

2 3 4

5 6 00 4 5

8/30/2013

Levenshtein Distance Calculation

nepal

medical

center

center

1

1

6

1 1 6

8/30/2013

Similarity Weight Calculation

nepal

medical

center

nepal

5

3

1 5 3 11 1 6

8/30/2013

Similarity Weight Calculation

nepal

medical

center

metal

3

4

2

3 4 2

1 1 65 3 1

8/30/2013

Similarity Weight Calculation

nepal

medical

center

5center

nepal

metal

6

3

2

1

1

4

6

5

4

Bipartite Graph

8/30/2013

Sorenson Similarity

8/30/2013

8/30/2013

Final Ranked List

8/30/2013

111160 Registered Company Names

106299 Unique Reg. ID / Company Names

16326 Words in English- Nepali Dictionary

144 British-American Words for Transformation

Dataset

Dataset

8/30/2013

1.6642.204

11.952

37.743

8.959 13.315

39.994

107.498

0

20

40

60

80

100

120

1 Token 2 Tokens 3 Tokens 4 Tokens

Tim

e t

o C

om

pu

te (

sec)

Number of Tokens

Number of Tokens VS Computation Time

Time to compute (sec) in I5 CPU

Time to compute (sec) in Dual Core CPU

Result Analysis

8/30/2013

• Stemming sometimes produces incorrect results if input contains a Nepali word

• Dictionary (English-Nepali) does not contain enough words

• Tokenization is based on whitespace and hyphen only

• Comparison is not phonetic based

Limitations

Limitations

8/30/2013

• Use of Taxonomy for classifying the tokens

• Using some weighing measures to assign weights to tokens

• Implementation of faster searching methods

• Integration of phonetic based similarity measures

Future Enhancements

Future Enhancements

8/30/2013

Thank You

Gaurav Kumar Goyal 16214Janardan Chaudhary 16216Nimesh Mishra 16221Sanat Maharjan 16230

8/30/2013