tools for arabic people names processing and retrieval - ali salhi & adnan yahya

37
TOOLS FOR ARABIC PEOPLE NAMES PROCESSING AND RETRIEVAL A STATISTICAL APPROACH By Ali Salhi Adnan Yahya October 30, 2011 1 ت ي رز ي ب عة م ا ج ي ف ة ف س ل ف ل وا ة ت م ت! الأ$ ن& ي ب ة ت ب ر لع ا ة ع ل ل ا ة ت ب ر لع ا ة ع ل ل ا ي ف ة ت ب و س وجا ة ت ف س ل ف و ة ت ف ط ن م سات دزا

Upload: arabicontology

Post on 25-Jun-2015

618 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

TOOLS FOR ARABIC PEOPLE NAMES PROCESSING AND RETRIEVAL A STATISTICAL APPROACH

By

Ali Salhi

Adnan Yahya

October 30, 2011

1

بيرزيت جامعة في والفلسفة األتمتة بين العربية اللغةالعربية اللغة في وحاسوبية وفلسفية منطقية دراسات

Page 2: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

OUTLINE

Motivation and Background.

What we are trying to build?

Names Tools Resources and Construction.

Names Tables Filtration.

Names Methods and Tools.

Conclusions.2

Page 3: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

MOTIVATION AND BACKGROUND

One of the many problems in Arabic content processing is related to people names.

People names are used in user profiles, registrations, articles, forms, .... with many problems such as different spelling and translation forms.

Other problems: Name expansion and correction in search queries.

Names are frequent in searches/retrieval.

Part of “named entities” (Locations, ...etc).3

Page 4: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

WHAT WE ARE TRYING TO BUILD?

Different names tables with frequency, gender and translation attributes: Infrastucture.

Arabic people names processing tools such

as: Names Gender Detector.

Names Translation tool.

Names Correction tool.

Names Auto Suggestion tool.

Names Extraction tool .4

Page 5: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES TOOLS RESOURCES AND CONSTRUCTION

For names processing tools we employ statistical/Corpus based approach.

Built tables for names use, gender and translation.

Data obtained from two sources with different formats (with Privacy Precautions): Palestinian General High School Certificate

Exam (Tawjihi) student lists for the years 2005 and 2007--2010.

Birzeit University students and employees records (2003 --2010).

5

Page 6: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES TOOLS RESOURCES AND CONSTRUCTION:DIFFERENT FORMATS OF SOURCE DATA

Palestinian Tawjihi list is obtained from the Palestinian ministry of education as “xls” (Microsoft excel) format with student names (first, second, third, last/family), city, school and score attributes.

Birzeit list is obtained from Birzeit University with the following features:

The list contains a bag of student name tuples . Each tuple may be a first name, father name,

grandfather name or a family name. Each tuple (repetition allowed) holds a translation to

English as well as the “gender” (Male, Female or Family).

6

Page 7: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES TABLES AND FILTRATION

The following tables were obtained:Male Names Table.

Female Names Table.

Family Names Table.

Names Translation Table.

General Names Table.7

Page 8: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES TABLES AND FILTRATIONMALE NAMES TABLE:

This is a table that holds male names only. Built by filtering all male names from Birzeit list by selecting all names with gender equal to male then adding males from Tawjihi list.

Since it lacks gender, Tawjihi list first we did the following: Parsed 2nd and 3rd fields in student names. (Those are

considered male by default as father/grandfather names)

Thus obtained male names were used to filter male names from 1st names. (1st name can be male or a female) that appear in 2nd and 3rd fields repeated in the 1st name field.

Reminder: (first Names/(2ndNames +3rd Names)

8

Page 9: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

MALE NAMES TABLE (CONT …)

Problem: Names such as , جهDاد, ضDياء may be female ,نDورor male (multi- classification).

To try to give a fair judgment about such names we assumed the following:

Any name is considered to be have “male” classification if it appears in 2nd or 3rd name in Tawjihi list (regardless the appearance in 1st field) in that list.

If the name appeared in 1st field (a first name) only then it’s considered to be a female name.

Multi classification names: Any 2nd or 3rd field names that are considered to be female in Birzeit list [and found as 1st name in Tawijhi list] are considered multi classification name.

Number of distinct male names processed is 3570.

9

Page 10: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

MALE NAMES TABLE (CONT …)Male names are processed to have a table of unique names with frequencies.

Top 20 Male Names

10

Item Name Frequency Item Name Frequency

1 محمد 41280 11 مصطفى 5031

2 محمود 15662 12 موسى 4649

3 أحمد 11752 13 خالد 4199

4 ابرهيم 9287 14 سليمان 4042

5 حسن 8359 15 سعيد 3897

6 علي 8008 16 الله عبد 3893

7 يوسف 7965 17 جمال 3442

8 احمد 7714 18 اسماعيل 3438

9 خليل 5483 19 صالح 3431

10 حسين 5341 20 عمر 3093

Page 11: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

FEMALE NAMES TABLE.

How the table was build ?

Selecting female names from Birzeit list with gender “female”.

Adding all female names found in Tawjihi lists.

Adding all multi classification names with female gender in Birzeit list and in 1st field in Tawjihi list.

Number of distinct female names processed is 2633.

11

Page 12: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

FEMALE NAMES TABLE (CONT …)

12

Female names are processed to only save unique names with a frequency counter.

Top 20 Female Names

Item Name Frequency Item Name Frequency

1 ايمان 2177 11 هبة 1178

2 دعاء 2034 12 نداء 1065

3 االء 1998 13 سماح 1037

4 والء 1673 14 روان 1030

5 حنين 1663 15 هديل 1015

6 اسماء 150616

مريم 946

7 اسراء 1297 17 حنان 943

8فداء 1268 18 فاطمة 912

9 ياسمين 1218 19 صابرين 875

10 عبير 119020

اماني 871

Page 13: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

FAMILY NAMES TABLE

How the table was build?

Merge all names with gender equal to family in Birzeit list, with:

All 4th (family) field names in Tawjihi lists, then

Subtract male/female names (from respective tables).

The total number of distinct family names is 11209. 13

Page 14: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

FAMILY NAMES TABLE (CONT …)

14

Family names are processed to only save unique names with a frequency counter.

Top 20 Family Names

Item Name Frequency Item Name Frequency

1 تكروري 952 11 مصري 268

2 حلواني 940 12 جرار 208

3 النجار 450 13 حروب 208

4 عاصي 438 14 الشاعر 203

5 دراغمه 356 15 ربايعة 198

6 بشارات 33516

رجوب 181

7 جرادات 319 17 سويطي 177

8دويكات 318 18 صالحات 175

9 المصري 308 19 شويكي 170

10 الرب ابو 28020

صوافطه 162

Page 15: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

ENGLISH TRANSLATION TABLE

15

This table holds Arabic names and different translated forms and a frequency counter for each translated form of a given name

Example: The name دDDمحم and its top 20 different English translated forms Item Name Freq Item Name Freq

1 Mohammad 5513 11 Mohmad 8

2 Muhammad 783 12 Moh'd 8

3 Mohammed 181 13 Mohamd 5

4 Mohamad 168 14 Mohmmed 5

5 Mohummad 157 15 Mouhamad 4

6 Mohamed 44 16 Mouhammad 4

7 Mohmmad 20 17 Mhamad 4

8Mohammd 12 18 Mhammed 3

9 Muhamad 11 19 Mhmmad 3

10 Muhammed 11 20 Mhmmed 3

Page 16: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

GENERAL NAMES TABLE

This is a large table holding all names appearing in Birzeit and Tawjihi lists and is a merge of the male, female and family tables.

For each name one gender is assigned as well as a frequency of appearance.

If a name has more than one classification/gender then the frequencies of occurrences in all classifications are summed and assigned to the classification with the highest frequency. 16

Page 17: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

GENERAL NAMES TABLE (CONT …)

Example : The Name ورDن can be used as a male name

and a female name , ورDن as a male name holds a frequency of 32 and ورDنas a female name has a frequency of 847.

Both frequencies are summed (32+847=879) and the gender female is given to the name (is this always fair?).

Multi classification flag is added where needed to give multi classification indication to the name.

17

Page 18: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES METHODS AND TOOLS

Names Correction.

Name Gender Detector (NGD).

Names Translation.

Names Auto Suggestion.

Names extractor.

18

Page 19: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

ERROR CORRECTION IN NAMES

Two types of errors are common :

Names with multiple forms errors.

Compound names errors .

19

Page 20: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

ERROR CORRECTION (CONT…)NAMES WITH DIFFERENT FORMS ERRORS

Fixing common misspelling\errors: is looking for the best form to represent a name.

For us, the best form is that with the highest frequency (is this fair? Democracy!).

Example: The name is دDاحم and it has three different forms ( , إحمDد, احمDد أحمDد however , ( أحمDد is the one with the highest frequency. دأDمDح is considered to be the correct format. The frequency of occurrence of دDاحم and دDإحم are summed and added to the frequency of أحمد .

20

Page 21: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

ERROR CORRECTION(CONT…)NAMES WITH DIFFERENT FORMS ERRORS

Levenshtein Distance (LD): a measure of how far names (or forms) are from each other (# of Edits).

We use LD to select “the correct form” from a group of possible names.

The group is a list of names with n-letter difference (built using LD and our general table).

Based on common errors studies, we use the common errors letters ( , , , , , , ة, ه ي،ى و آ إ ا to find (أwhich name must be selected (or given preference). 21

Page 22: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

ERROR CORRECTION: SIMPLE EXAMPLE

Assuming a group of ( , , , سDيمه, ديمDة ريمDه ديمDا (ديمDهand the group key (incorrect entry) ديمه .

سيمه and هDريم can be dropped (No common error letters ( , , , , , , ة, ه ي،ى و آ إ ا . (أ

We end up with ) , هDا ديمjة, ديمDديم ) then اDديم is selected for having the highest frequency and the frequency of the other two forms is summed and added to ديما frequency.

22

Page 23: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

ERROR CORRECTION: COMPLEX EXAMPLE

Assume a name like اسامه which differs by 2 from the name أسامة and won’t be in the same group.

To fix that, we rejoined groups that have common elements with potential common errors.

has a group of four shapes after joining اسامه two smaller groups ( , اسامة, أسامه اسامه ) and ( أسامهأسامة, ).

has the largest frequency of appearance أسامة so the final result will be the sum of all frequencies. 23

Page 24: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

ERROR CORRECTION: COMPLEX EXAMPLE

Consider: عدير , عبير two possible corrections غدير

Choice depends on 3 components:

• Enhanced Names Table (our dictionary ) .

• Levenshtein distance.

• Ranking system: Given error sources we used a ranking system based on a combination of 4 elements:

1. Frequency of Appearance (in our lists).

2. Shape Similarity (of letters).

3. Location Measurement (keyboard).

4. Soundex Function (sound similarity). 24

Page 25: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES CORRECTION TOOL (CONT…)FORMS RANKING General Ranking Equation :

Rank (word) = A*Frequency + B*ShapeSimilarity + C*LetterLocation + D*Soudex.

A, B, C, D are percentages with summation of 100% .

Consider A = 0.5 and B = 0.20 and C= 0.25 and D = 0.05

Frequency, ShapeSimilarity, LetterLocation , Soudex are parameters of the forms of a name, with obvious interpretation.

The chosen values for A, B, C and D are not necessarily the best. They are based on experimentation and thus need more testing to decide the best range (or values).

25

Page 26: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES CORRECTION TOOL (CONT…) Some test samples:

26

# Input Output(s) # Input Output(s)1 ديم ريم, ديما,كيم,

نديم5 اية راية, آية

2 شوشن سوسن, شوكت, سوزان, روان

6 نوزالدين نور الدين

3 خاقلين تالين, جاكلين, مارلين, كاثلين,

مادلين

7 رمري رمزي, رازي

4 اقراجيم إبراهيم 8 غبير عبير, غدير

Page 27: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES CORRECTION TOOL (CONT…) TEST RESULTS

General test ( each test consist of 100 misspelled name ) :

27

# Test Type Pass Percentage1 Speed Writing (test1)

Speed Writing (test2)Speed Writing (test3)

87%

84%

85%

2 Auto generated errors -One Error-Two Errors -Three Errors

91%

79%

70%

Page 28: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES METHODS AND TOOLS NAME GENDER DETECTOR (NGD)

NGD: A tool that detects the classification of an input name into : Male, Female or Family.

How it works ? The NGD tool receives the name, issues a query to

check the existence of the name in the enhanced names table.

If found, NGD returns the gender and its percentage of the whole names lists.

If not, it returns a null statement with no results found, and the tool pushes the input string to the correction tool to check whether the “not found” result happened due to spelling/common error .

Can work in reverse: given the gender, limit the correction/suggestion to names in that gender.

28

Page 29: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES METHODS AND TOOLS NAMES TRANSLATION TOOL

Names translation tool finds the correct (or widely accepted) English translation of a given name.

Many Arabic names have different equivalent English forms as seen in the following table:

29

# Arabic Name

English Translati

onFreq # Arabic

NameEnglish

Translation

Freq

1 سميرSamir 299

4 أحمد

Ahmad 1875

Sameer 85 Ahmed 48

2 نورا

Noura 19 Ahamad 6

Nora 7

5 مؤيد

Mo'ayad 10

Nura 5 Mu'ayad 9

Noora 3 Moayad 5

3 رياض

Riyad 148 Mu'ayyad 5

Riad 24 Mo'ayyad 3

Reyad 8 Muayad 3

Page 30: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES TRANSLATION TOOL (CONT…)

The translation tool searches in the English Translation table and builds a table that holds all possible translations sorted in a descending order of frequency of use.

Usually we output the top 3 translations, to give the user a choice if needed, with the default being the most frequent form.

30

Page 31: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

AUTO SUGGESTION TOOL

A general autosuggestion for names which can be used in applications where name entry is needed.

It suggests names while typing (completion? Not quite).

The challenge is to guess intended names even when users start incorrectly. For example, a user wants to enter دDأحم but starts with ا not أ and thus will never end up reaching أحمد by completion.

Solution : A modification on user input is needed and the tool will automatically take the possibility of changing the first letter ( ا to أ or آ or إ) and then wait for the next letter.

The same is said in case of middle letter. دDمؤي for example the user might enter the name as دDموي and the tool will take the possibility of ؤ while typing.

31

Page 32: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES EXTRACTION TOOL

Names extraction is a method to isolate people names (Full and Single) from an Arabic text.

Since names may be misspelled, the reference table in use is the general table (has all forms of all the names).

How it works ? The extraction function first parses text comparing

words with the general names table entries. If the table has the word then the function directly

parses and checks three words ahead (word+1, word+2, word+3) to detect full names and single names.

The series of words is compared with predefined names types (<male[0],male[1],…male[i] || family>, <female, male[1],…male[i] || family> ) 32

Page 33: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES EXTRACTION TOOL (CONT…)

Examples :

حمدان محمد <matches <male, male, family رامي

حمjدان محمjد رامي هنjد doesn’t match anything need splitting.

حمدان محمد <matches <female, male, family هند

حمjدان محمjد رامي هنjد matches <male> , <female, male, family>

حمjدان محمjد رامي هنjد | matches : single name , full name. 33

Page 34: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

NAMES EXTRACTION TOOL(CONT…)

Not every string matching a name is considered a name. For example the name لDجمي might be an adjective not a person name.

To consider a word to be a single name (when it can double as a name) the following rules are applied:

Either appears more than N times in the text (currently N = 3).

Appears in a full name in the text, for example سمير جميلة will be سDمير and ( جميلDه) and its other form جميلDة then النتشDةdetected.

Appears in “anded” series: إلى ذهبوا انس و سامي و ذكي و علي.الجامعة

Appears after a defining term (such as , اآلنسDة, الDدكتور ... السDيد etc).

34

Page 35: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

POSSIBLE USES OF DEVELOPED TOOLS Our tools should be useful in form filling and

data entry, as well for batch processing of existing name lists (say, correction, translation).

They can be incorporated into search tools/engines to make sure that misspelled occurrences of a name and multilingual forms are accounted for.

Reporting on individuals by detecting name occurrences in documents.

The statistical basis can be overridden by expert knowledge in the field of correct spelling.

35

Page 36: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

CONCLUSIONS We presented some useful tools that can help

processing people names in digital documents and web content.

Our work aimed to design and deploy query/forms pre-processing name tools able to efficiently process and identify Arabic people names in queries and documents.

Employed a statistical/Corpus-based approach, and constructed databases that contain names from different resources.

Some regional accent results from the source data: may be rectified.

Promising testing, though more is needed.

36

Page 37: Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

Thank you

37