[ieee 2010 international conference on emerging technologies (icet) - islamabad, pakistan...

6
Knowledge Representation of Urdu Text Using Predicate Logic Amjad Ali, Mohammad Abid Khan Department of Computer Science University of Peshawar Peshawar, Pakistan [email protected], [email protected] Abstract—Knowledge representation is a key area of research in artificial intelligence which deals with the proper storage and retrieval of knowledge for various useful applications. This research paper proves that knowledge can be easily and efficiently represented in predicate logic. The algorithm in this paper splits the Urdu text/sentences into phrases/constituents and then represents these in predicate logic. This algorithm also generates the original sentences from the representation in order to check the accuracy of representation. The algorithm has been tested on real text/sentences of Urdu. The algorithm has achieved an accuracy of 88%. As the algorithms works on pre-tagged input file, so if the tagging is done correctly then the algorithm achieves high level of accuracy. Therefore it is required that there should be proper rules by the help of which one can correctly tag the input text into phrases/ constituents. The algorithm accurately represents such text in predicate logic. The algorithm also accurately retrieves the original text/sentences from such representation. Keywords-Simple discourse units; Phrases; Syntax; Semantics I. INTRODUCTION Knowledge consists of facts, theories, concepts, procedures and relationships [1]. Knowledge is that information which has been organized and analyzed to make it understandable. It is used in decision making and problem solving [1]. Just like other fields of Artificial Intelligence, knowledge representation is a key area of research. Knowledge should be represented in computer in such a manner that can be processed and retrieved easily and efficiently. Knowledge representation is used in key areas of computational linguistics such as machine translation, question-answering, information retrieval and information extraction. In knowledge representation, early researchers mainly focused on the data/knowledge storage capabilities of the representation and did not give much importance to the efficiency of representation. But later on, it was realized that Knowledge should be represented in computer in such a manner that can be processed and retrieved accurately, easily and efficiently. The basic problem of knowledge representation is the development of a sufficiently precise notation for knowledge representing. Such notation is referred to as a knowledge representation scheme or technique [2]. Using such a scheme, one can specify a knowledge base consisting of facts. Knowledge representation should be language independent [3]. Our basic objective is the design of natural language analysis and generation system, which encode the syntactic, semantic and pragmatic capabilities of the system in an easily, comprehensible and extensible form. These encoding should be also capable of supporting efficient algorithm for parsing and logical representation of a text. The achievement of this objective requires careful structure separation of the system into modules. These modules specify possible constituent structure (syntax) of a sentence and representation of constituents in logical form (part of semantics) [4]. II. KNOWLEDGE REPRESENTATION OF URDU TEXT AND PREDICATE LOGIC Knowledge representation is a core area of research in artificial intelligence especially computational linguistics. If knowledge is represented in an easy and efficient manner in computer, then one can obtain an efficient, comprehensible and extendible knowledge representation and generation system. Therefore it is extremely important to use such a knowledge representation scheme, in which knowledge can be easily and efficiently represented in computer. In predicate logic, knowledge can be easily, efficiently and accurately represented and retrieved as compared to other knowledge representation schemes [5]. It has the capability of developing an efficient knowledge understanding and generation system for Urdu text. It is also effectively used in machine translation and question- answering. Predicate logic employs the notions of constant, variable, function, predicate, logical connectives and quantifiers to represent facts [6].In predicate logic, Urdu sentences can be split into words e.g. nouns, verbs and adjectives or even phrases. As there are mostly finite numbers of words or phrases in Urdu language, therefore one can easily store words or phrases for representing the knowledge e.g. in the form of text. Now the question arises whether the word- level storage and representation is better than the phrase-level storage and representation or vice versa. In the word-level storage and representation, all the words of Urdu language can be stored uniquely in the dictionary along with their grammatical categories in files. The ambiguous words can be stored in a separate file. As the words are stored in the dictionary without any duplication, therefore small dictionary or storage space is needed to store all words. Saving of storage space is the main advantage of word-level approach. Now, if one inputs an Urdu text or sentence, then in 978-1-4244-8058-6/10/$26.00 ©2010 IEEE 2010 6th International Conference on Emerging Technologies (ICET) 293

Upload: mohammad-abid

Post on 14-Apr-2017

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2010 International Conference on Emerging Technologies (ICET) - Islamabad, Pakistan (2010.10.18-2010.10.19)] 2010 6th International Conference on Emerging Technologies (ICET)

Knowledge Representation of Urdu Text Using Predicate Logic

Amjad Ali, Mohammad Abid Khan Department of Computer Science

University of Peshawar Peshawar, Pakistan

[email protected], [email protected]

Abstract—Knowledge representation is a key area of research in artificial intelligence which deals with the proper storage and retrieval of knowledge for various useful applications. This research paper proves that knowledge can be easily and efficiently represented in predicate logic. The algorithm in this paper splits the Urdu text/sentences into phrases/constituents and then represents these in predicate logic. This algorithm also generates the original sentences from the representation in order to check the accuracy of representation. The algorithm has been tested on real text/sentences of Urdu. The algorithm has achieved an accuracy of 88%. As the algorithms works on pre-tagged input file, so if the tagging is done correctly then the algorithm achieves high level of accuracy. Therefore it is required that there should be proper rules by the help of which one can correctly tag the input text into phrases/ constituents. The algorithm accurately represents such text in predicate logic. The algorithm also accurately retrieves the original text/sentences from such representation.

Keywords-Simple discourse units; Phrases; Syntax; Semantics

I. INTRODUCTION Knowledge consists of facts, theories, concepts, procedures

and relationships [1]. Knowledge is that information which has been organized and analyzed to make it understandable. It is used in decision making and problem solving [1]. Just like other fields of Artificial Intelligence, knowledge representation is a key area of research. Knowledge should be represented in computer in such a manner that can be processed and retrieved easily and efficiently. Knowledge representation is used in key areas of computational linguistics such as machine translation, question-answering, information retrieval and information extraction. In knowledge representation, early researchers mainly focused on the data/knowledge storage capabilities of the representation and did not give much importance to the efficiency of representation. But later on, it was realized that Knowledge should be represented in computer in such a manner that can be processed and retrieved accurately, easily and efficiently. The basic problem of knowledge representation is the development of a sufficiently precise notation for knowledge representing. Such notation is referred to as a knowledge representation scheme or technique [2]. Using such a scheme, one can specify a knowledge base consisting of facts. Knowledge representation should be language independent [3].

Our basic objective is the design of natural language analysis and generation system, which encode the syntactic, semantic and pragmatic capabilities of the system in an easily, comprehensible and extensible form. These encoding should be also capable of supporting efficient algorithm for parsing and logical representation of a text. The achievement of this objective requires careful structure separation of the system into modules. These modules specify possible constituent structure (syntax) of a sentence and representation of constituents in logical form (part of semantics) [4].

II. KNOWLEDGE REPRESENTATION OF URDU TEXT AND PREDICATE LOGIC

Knowledge representation is a core area of research in artificial intelligence especially computational linguistics. If knowledge is represented in an easy and efficient manner in computer, then one can obtain an efficient, comprehensible and extendible knowledge representation and generation system. Therefore it is extremely important to use such a knowledge representation scheme, in which knowledge can be easily and efficiently represented in computer. In predicate logic, knowledge can be easily, efficiently and accurately represented and retrieved as compared to other knowledge representation schemes [5]. It has the capability of developing an efficient knowledge understanding and generation system for Urdu text. It is also effectively used in machine translation and question-answering. Predicate logic employs the notions of constant, variable, function, predicate, logical connectives and quantifiers to represent facts [6].In predicate logic, Urdu sentences can be split into words e.g. nouns, verbs and adjectives or even phrases. As there are mostly finite numbers of words or phrases in Urdu language, therefore one can easily store words or phrases for representing the knowledge e.g. in the form of text. Now the question arises whether the word-level storage and representation is better than the phrase-level storage and representation or vice versa.

In the word-level storage and representation, all the words of Urdu language can be stored uniquely in the dictionary along with their grammatical categories in files. The ambiguous words can be stored in a separate file. As the words are stored in the dictionary without any duplication, therefore small dictionary or storage space is needed to store all words. Saving of storage space is the main advantage of word-level approach. Now, if one inputs an Urdu text or sentence, then in

978-1-4244-8058-6/10/$26.00 ©2010 IEEE

2010 6th International Conference on Emerging Technologies (ICET)

293

Page 2: [IEEE 2010 International Conference on Emerging Technologies (ICET) - Islamabad, Pakistan (2010.10.18-2010.10.19)] 2010 6th International Conference on Emerging Technologies (ICET)

this approach it will be splitted into words and tagger will assign grammatical categories to each word. If there is an ambiguous word, then some time overhead will be involved to assign the correct grammatical category to that word. This time consumption is a disadvantage of word-level approach.

In word-level representation, as all the words of the Urdu text are not represented, therefore it will be extremely difficult to retrieve the original text/sentences from the representation of lengthy sentences due to the missing words. Take the example of “ - ”. This sentence can be represented as,

( ).

In the above simple sentence, the missing/supporting word “ ” and “ ”can easily be retrieved only from the representation, if one specifies pattern representation rule for some specific grammar rule or tense. But if lengthy sentences are represented, the word-level representation and retrieval of missing/supporting words from representation will become very difficult and complex. Consider the example.

The representation and retrieval of the above sentence is very difficult and complex in word-level approach, which is the main disadvantage of this approach.

In phrase-level storage and representation, phrases of Urdu language are stored in a repository. As in phrase-level storage, repeated words are stored in a repository, therefore a large dictionary and storage space is needed as compared to word-level approach, which is the main disadvantage of phrase-level approach. As phrases are stored in a repository, therefore there is no ambiguity and no extra overhead is involved. This is the advantage of phrase-level approach. If knowledge is represented in phrases/chunks, then its accuracy, precision and recall values are very high [7]. The representation of words sequences (phrases) instead of single words is effectively used in machine translation. It provides machine translation with robustness in word selection and local words reordering. A significant improvement in English-Danish machine translation has been achieved [8]. The phrase level representation of source language, converting the source language phrases into target language phrases and reordering the target language phrases generate fluent target output [9]. Phrase based representation and translation develops better Chinese-English, Arabic-English and Urdu-English machine translation systems [9]. Representation of text/sentences in the form of phrases produces efficient translation [10]. Consider the following examples, in which Urdu sentences are splitted into phrases/constituents according to some grammar rules and then these are represented in predicate logic according to the various logical patterns. The abbreviations used are shown in TABLE I below.

TABLE I. LIST OF ABBREVIATIONS

Abbreviations Meaning Det Determiner Adj Adjective Pron Pronoun Num Numerals Conj Conjunction

Neg Negation Posp Postposition Adv Adverb V Verb VC Verb Command N Noun NP Noun Phrase VP Verb Phrase ,AP Adjective Phrase NPP Noun Postposition

Phrase VPP Verb Postposition

Phrase APP Adjective

Postposition Phrase

PP Postposition Phrase

Example 1

[Is] [din] [h m ri:][m k d s][s rz mi:n]

[p r] [h m r ] [p r m] [peIhli:] [b r] [fIz ] [meIn] [s rb l ñd] [h : ].

“That day for the first time, our flag was hoisted on our

holy land”.

The tagged version of the above Urdu sentence is given below:

)NP( )PP( )NP( )PP( )V(

Grammar Rule 1: NP PP NP PP V.

Logical Pattern 1: V (PP, NP, PP, NP).

Logical Representation:

) (

In this example, only one predicate ” “ is used.

Example 2:

''

294

Page 3: [IEEE 2010 International Conference on Emerging Technologies (ICET) - Islamabad, Pakistan (2010.10.18-2010.10.19)] 2010 6th International Conference on Emerging Technologies (ICET)

[p rk] [meIn][t kri:b n] [eIk] [gh nt ]

[g :z rneI] [keI] [b d] [d n :n] [d st :n] [neI] [ pneI] [ pneI] [gh (r)][k ] [r :kh] [kiy ].

“After having been in the park for almost an hour, both friends started going towards their respective homes”.

The tagged version of the above Urdu sentence is given below:

)PP ('')V ( )PN ( )PP ( )PP ( )PP ( )V (

Grammar Rule 2: PP V NP PP PP PP V.

Logical Pattern 2: V( PP, PP, PP, NP, V( PP))

Logical Representation:

)) ('' (

In this example, two predicates ”'' ” and “ ”are used.

Example 3

[ b] [teIl] [ r] [p ni:] [k ] [am l] [h t ].

“Oil and water would react now”.

The tagged version of the above Urdu sentence is given below:

)V ( )VC( )PP (

Grammar Rule 3: VC PP V.

Logical Pattern 3: V (VC(PP)).

Logical Representation:

( ( ))

In this example, only one argument “ ” and two predicates “ ” and “ ” are used.

Example 4

[k bhi:] [y h] [ry s t] [keI] [n v b]

[k ] [g rm i] [m k m] [th ].

“Some times in the past, it was the summer capital of the Nawab”.

The tagged version of the above Urdu sentence is given below:

)VC( )PP ( )PP ( )PN ()V (

Grammar Rule 4: VC PP PP NP V.

Logical Pattern 4: V (NP, PP,VC(PP)).

Logical Representation:

( ( ))

In this example, more than one argument and two predicates “ “and “ ” are used.

Example 5

.

[meIheIl] [keI] [s bz z r]

[meIn][bhi][s ngeIm rm r] [seI] [baneI] [h :eI] [t kht] [b heI] [h :eI][h eñ].

295

Page 4: [IEEE 2010 International Conference on Emerging Technologies (ICET) - Islamabad, Pakistan (2010.10.18-2010.10.19)] 2010 6th International Conference on Emerging Technologies (ICET)

“The lawn of the palace was carpeted with marble thrones”.

The tagged version of the above Urdu sentence is given below:

)PP ( )PP ()V ( )PP ( )V ()NP ( )V (

Grammar Rule 5: PP PP V PP V NP V.

Logical Pattern 5: V( NP, V( PP, V( PP, PP)))

Logical Representation:

))) ( ( (

In this example, three predicates “ ”, “ ” and “ ”are used.

As all the knowledge or text is represented without any missing/supporting words, therefore the whole text can be retrieved easily from this representation. The main advantage of this phrase-level approach is its accuracy.

In predicate logic, Urdu text/sentences are split into phrases/constituents according to certain rules as shown in TABLE II. Then these phrases/constituents are represented in predicate logic.

Now, the following is a real Urdu text which is first tagged into phrases/constituents and then represented in predicate logic.

The Text

“Today was the last day of month of Ramazan. Shawal’s crescent has been sighted. Since morning, most of the people

were of the opinion that it would be Eid tomorrow. Today was my 29th fasting day as well. I have observed all fasting days. Everyone else over at home also fasted. My father and Asad also observed the fasting days. Fasting in the month of Ramazan is obligatory upon all Muslims. I feel very happy today. Along with my prayers, I have observed the fasting days of the whole month of Ramazan. Last year I was too young to fast. My mother made me fast for two or four days. My elder sister, faiza observed all fasting days. Now Asad can also fast for two or four days. My father thinks that he should observed all fasting days of Ramazan the next year. Every Muslim should offer prayers and fast regularly”.

The Tagged Version

(PP) (NP) (V) (PP) (NP) (V) (PP) (PP) (V)

(NP) (V) (NP) (V) (NP) (V) (PP) (NP) (V)

(PP) (NP) (V) (PP) (NP) (V) (PP) (NP) (PP) (NP) (V)

(NP) (V) (PP) (PP) (PP) (NP) (V) (NP)

(AP) (V) (PP) (NP) (V) (PP) (V) (PP) (NP) (V) (NP) (PP) (V) (PP)

(V) (NP) (NP) (V) (PP) (PP) (AP) (V)

Representation in Predicate Logic

) , (

) ,(

) , (

) (

)) (, (

) , (

) ,(

) , (

,( , ) ,

) ,(

) , , , (

) ,(

) , (

) (

)) ,(, , (

) (

), (

) , ,(

296

Page 5: [IEEE 2010 International Conference on Emerging Technologies (ICET) - Islamabad, Pakistan (2010.10.18-2010.10.19)] 2010 6th International Conference on Emerging Technologies (ICET)

From the above examples, it is clear that phrases are represented in the sequence as arguments and the predicate V is used at the end of its arguments in logical representation. If the predicate comes before argument(s), then VC should be used as a predicate only after the first argument. If an additional predicate comes after the argument(s), then use an additional opening bracket for each predicate at the start of logical representation. Now if one retrieves the text/sentence from such a representation, then first, arguments should be retrieved in their sequence from right to left and then the predicate should be retrieved after its argument(s). But if the predicate is a command verb (VC), then it should be retrieved before its argument.

TABLE II. RULES FOR THE FORMATION OF DIFFERENT PHRASES/CONSTITUENTS OF URDU LANGUAGE

S. No Phrases Rules Examples 1 NP NP = N

NP = Det Adj N iii.NP = Det N NP = Pron NP = Pron N NP = Num N NP = Num N N NP = N Conj N NP = N N NP = Det N N NP = Adj N NP = Pron N N NP = N Pron NP = Adj Adj N NP = Adj N N NP = N Num NP = Adj Adj Conj Adj N NP = Det Adj N Conj Pron NP = N Pron N NP = N Conj N N NP = Num NP = Adj Adj Adj N NP = N Conj Pron NP = Adj Conj Adj N NP = Adj N N Conj N NP = Adj Pron NP = Adj N Conj N NP = Adj Conj Adj Adj N NP = Num Num N NP = Num N N N NP = Num N Pron N NP = Pron Pron N NP = N Conj Num N NP = N N Conj N N NP = Pron N Conj Pron N

1907

1907

2 NPP NPP = NP Posp NPP = NP Posp Posp NPP = NP Posp Posp Posp

3 AP AP = Adj AP = Adj Adj AP = Adj Conj Adj AP = Adj Adj Conj Adj

4 APP APP = AP Posp APP = AP Posp Posp

5 V V = V V = V V V = V V Conj V V = V V V V = V V V V V = Adv

V = Adv V V V = V Neg V V = Adv V V = V V V V V V = Adv V V V = Adv V V V V = V V Adv Adv V = Adv Adv V V V = Adv Adv V = Adv Neg V V = V Adv Neg V V = Adv V Neg V

6 VPP VPP = V Posp

VPP = V Posp Posp VPP = V V Posp VPP = Adv V V Posp

7 PP PP = NPP PP = APP PP = VPP

III. PROPOSED APPROACH FOR KNOWLEDGE REPRESENTATION OF URDU TEXT

The word-level approach and phrase-level approach can be combined to get word-phrase-level approach to remove the disadvantages of both approaches and to get benefit of the advantages of both approaches. The main components of this approach are as follow,

1. Urdu Dictionary.

2. Urdu Sentence splitter module.

3. Module for converting Urdu phrases/constituents into logical form.

4. Module for converting logical form into phrases/constituents and then back to Urdu sentences.

In this novel approach, Urdu words are stored in a dictionary along with their grammatical categories to occupy relatively little storage space. In this new approach, the sentence splitter module splits the input Urdu text/sentences into phrases/constituents according to stored rules by using dictionary and then these phrases/constituents are represented in logical form by logic representation module. From the logic representation, the original input text/sentences are retrieved by the text/sentence retrieval module. Thus we can obtain an efficient system for knowledge representation and retrieval of Urdu text, which will make processing easy, accurate and efficient

IV. ALGORITHM The following algorithm splits the Urdu text/sentences into

phrases/constituents and then represents these in predicate logic. The algorithm also accurately retrieves the original text from such representation.

1. Tag the Urdu text/sentences to identify V, NP, AP and PP from the parser.

297

Page 6: [IEEE 2010 International Conference on Emerging Technologies (ICET) - Islamabad, Pakistan (2010.10.18-2010.10.19)] 2010 6th International Conference on Emerging Technologies (ICET)

2. Represent phrases in the sequence as arguments from right to left and Use V as a predicate at the end of arguments.

3. If the predicate V comes before argument(s), then

Replace V with VC in the representation.

4. If an additional predicate comes after the arguments, then

Use an additional opening bracket for each predicate at the start of representation.

5. To retrieve text/sentences from predicate logic representation,

i) If the predicate is V, then

a) Retrieve arguments in the sequence

from right to left.

b) Retrieve the predicate.

ii) If the predicate is VC, then

a) Retrieve the predicate.

b) Retrieve the remaining arguments

in the sequence from right to left.

The above algorithm relies on the correct identification of phrases/constituents in the text. This algorithm has the capability of developing an efficient and accurate knowledge understanding and generation system for Urdu text. It can also be effectively extended to machine translation and question-answering.

V. IMPLEMENTATION AND EVALUATION The algorithm is implemented in VB.Net. The algorithm is

tested on Urdu sentences which are manually tagged into V, NP, AP, and PP. It has shown success rate of 88%. The algorithm for Urdu language relies on the identification of phrases/constituents in the input text. If the phrases/constituents are not accurately identified in the input text/sentences, then it will influence the efficiency and accuracy of the logical representation and retrieval of the text/sentences. These errors will propagate to the logical representation and retrieval of the text and the accuracy of the algorithm will be decreased. Therefore, it is necessary to accurately identify the phrases/constituents in the given text for the accurate and efficient representation and retrieval of the text. Initially, a very limited rules were used for splitting the input text into phrases/constituents. Therefore the accuracy of the algorithm was low. But when more rules were added, then the accuracy level was increased. After some time, when the rules reached to 66, the accuracy became 88%. All these are summarized in TABLE III.

TABLE III. RESULTS OF KNOWLEDGE REPRESENTATION

No. of Rules Knowledge Representation Accuracy

20 50%

35 66%

45 75%

66 88%

From the above table, it is clear that the accuracy is

increased with increasing the number of rules for splitting the input text into phrases/constituents. This algorithm for knowledge representation gave 12% errors. These errors are due to no proper rules for splitting the input text into phrases/constituents. It is required that such a parser should be developed which can accurately identify phrases/constituents in the input text. The algorithm represents and retrieves such text accurately.

VI. CONCLUSION This research paper presents an algorithm for representing

Urdu text/sentences in predicate logic. The algorithm uses syntactic information of Urdu text/sentences and achieves an accuracy of 88%. Several factors important in knowledge representation in predicate logic are identified and discussed in detail which makes representation very accurate.

REFERENCES [1] E. Louis, and J. Frenzel, “Crash course in artificial intelligence and

expert system”, Howard, W. Sams & Company, Indianapolis, USA, 1987.

[2] P. J. Hayes, “Some Problems and Non-Problems in Representation Theory”, Proceedings AISB Summer Conference, Essex University, July 1974.

[3] S. Sawai, H. Fukushima, M. Sugimoto, and N. Ukai, “Knowledge representation and machine translation”, COLING 1982, Prague, pp. 351-356.

[4] L. K. Schubert, and F.J. Pelletier., “ From English to logic: context-free computation of conventional logical translation”, American journal of computational linguistics, vol. 8, no. 1, 1982.

[5] A. Ali and M. A. Khan, “Selecting predicate logic for knowledge representation by comparative study of knowledge representation schemes“, ICET 2009, Pakistan.

[6] Mylopoulos, J., “An overview of knowledge representation”, ACM press New York, NY, USA, 1981.

[7] J. Veenstra, “ Memory-Bases Text Chunking “, Proceedings of EACL’99, pages 118-125, Bergen, Norway, 1999.

[8] J. Elming “Syntactic reordering integrated with phrase-based SMT”, proceeding of the 22nd international conference on computational linguistics(Coling 2008), pages 209-216 Manchester, August 2008.

[9] A. Zollmann, A. Venugopal, F. Och and J. Ponte “A systematic comparison of phrase-based, hierarchical and syntax-augmented statistical MT”, proceeding of the 22nd international conference on computational linguistics(Coling 2008), pages 1145-1152 Manchester, August 2008.

[10] D. Chiang, “ Hierarchical phrase-based translation”, computational linguistics, pages 201-228, 2007.

298