a domain specific automatic text summarization using fuzzy logic

13
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME 449 A DOMAIN-SPECIFIC AUTOMATIC TEXT SUMMARIZATION USING FUZZY LOGIC 1 Mrs. A.R.Kulkarni Assistant Professor, Computer Science & Engg Department, Walchand Institute of Technology, Solapur 2 Dr. Mrs. S.S.Apte HEAD, Computer Science & Engg Department, Walchand Institute of Technology, Solapur ABSTRACT The amount of information on World Wide Web has increased enormously. In this context there is a need for text summarization. It creates summaries of the documents that consist of important sentences in the document. The summaries help the readers to make decision as to read the whole document or not thus acting as a time saver. Various Techniques have been proposed for text summarization by researchers that can be broadly classified into two types: Extraction and Abstraction. This Paper focuses on Text Summarization by Extraction using Fuzzy Logic.. Many Automatic text Summarization techniques have used either Statistics or Linguistics. Very Few works has used a combination of both. Our Paper uses the idea of both Statistical and Linguistic methods. This hybrid approach has been applied to news article dataset in the domain of technical news and we have evaluated their performances by using precision and recall method. It is found that this method generates good quality of summary. Keywords: Summarization, Statistics, Linguistics, fuzzifier, Defuzzifier, Rule-Base, Extraction. INTRODUCTION Text Summarization” is a process of creating a shorter version of original text that contains the important information. The amount of information on the web is growing day by day. A considerable amount of time is wasted in searching for relevant documents. Hence text summarization technique came into existence which created a short summary for the text document INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), pp. 449-461 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET © I A E M E

Upload: iaeme

Post on 18-Nov-2014

731 views

Category:

Technology


5 download

DESCRIPTION

 

TRANSCRIPT

Page 1: A domain specific automatic text summarization using fuzzy logic

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

449

A DOMAIN-SPECIFIC AUTOMATIC TEXT SUMMARIZATION USING

FUZZY LOGIC

1Mrs. A.R.Kulkarni

Assistant Professor, Computer Science & Engg Department,

Walchand Institute of Technology, Solapur

2Dr. Mrs. S.S.Apte

HEAD, Computer Science & Engg Department,

Walchand Institute of Technology, Solapur

ABSTRACT

The amount of information on World Wide Web has increased enormously. In this context

there is a need for text summarization. It creates summaries of the documents that consist of

important sentences in the document. The summaries help the readers to make decision as to read the

whole document or not thus acting as a time saver. Various Techniques have been proposed for text

summarization by researchers that can be broadly classified into two types: Extraction and

Abstraction. This Paper focuses on Text Summarization by Extraction using Fuzzy Logic.. Many

Automatic text Summarization techniques have used either Statistics or Linguistics. Very Few works

has used a combination of both. Our Paper uses the idea of both Statistical and Linguistic methods.

This hybrid approach has been applied to news article dataset in the domain of technical news and

we have evaluated their performances by using precision and recall method. It is found that this

method generates good quality of summary.

Keywords: Summarization, Statistics, Linguistics, fuzzifier, Defuzzifier, Rule-Base, Extraction.

INTRODUCTION

“Text Summarization” is a process of creating a shorter version of original text that contains

the important information. The amount of information on the web is growing day by day. A

considerable amount of time is wasted in searching for relevant documents. Hence text

summarization technique came into existence which created a short summary for the text document

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &

TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), pp. 449-461 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com

IJCET

© I A E M E

Page 2: A domain specific automatic text summarization using fuzzy logic

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

450

by choosing important sentences of the document. An Automatic text summarization works very

well on structured documents such as news articles, research publications and reports.

Text summarization has two approaches: Extraction and Abstraction. Extraction involves selecting

sentences of high relevance (rank) from the document based on word and sentence features and put

them together to generate summary. It uses mostly statistical methods. Abstraction procedure

examines the text, interprets it and generates summary using different sentences. It uses Linguistic

methods. This paper focuses on extractive summarization technique. It uses a combination of both

Statistical and Linguistic methods on fusion of various features to generate a better quality summary.

RELATED WORK

Since late 50s text summarization has been a crucial and important research area. The first

Automatic text summarization was created by Luhn in 1958[1] based on term frequency. Then G. J.

Rath, A. Resnick, and T. R. Savage[2] have proposed the evidences of problems in generating the

summaries using term frequency feature in 1961. Both studies are characterized by surface level

approaches. In late 60s, entity level approaches appeared: the first of its kind used syntactic analysis

proposed by Climenson [3].This was followed by Edmundson’s work[4] which used term

frequency, location features and cue words .Earliest instances of research on summarization was

done on scientific documents followed by various works published in other domains, mostly on

newswire data. In 1990s. with the advent of machine learning techniques in Natural Language

Processing, many publications came that used statistical techniques to produce document summaries.

They have used a combination of appropriate features and learning algorithms. Other approaches

have used hidden Markov models[5] and log-linear models to improve extractive summarization.

Recently, neural networks are used to generate summary for single documents using

extraction[6]. Very little work is done on automatic text summarization based on Artificial

Intelligence and evolutionary techniques. M.S.Binwale l[7] has designed automatic text

summarization using integrated hybrid model. He has used Diversity-based methods and Swarm

based methods followed by Fuzzy logic. Experimental results have shown that this model produces

good quality of summary.

Ladda Suanmali[8] in his work has used sentence weight ,a numerical measure assigned to

each sentence and then selecting sentences in descending order of their sentence weight for the

summary.

L.Antiqueira [9] has proposed a method for extractive summarization using concept of

complex networks and its metrics. It has shown that this method is capable of capturing important

text features as expected.

For MEDLINE citations, .an automatic summarization system has been introduced by

Marcelo Fiszman[10] . It is an domain-specific abstractive summarization which outperformed the

baseline summarizer considerably.

A lot of work has been done in single document and multi document summarization using

statistical methods. A lot of researchers are trying to apply this technology to a variety of new and

challenging areas, including multilingual summarization and multimedia news broadcast.

SURVEY ON NEED AND SCOPE OF TEXT SUMMARIZATION

Text Summarization is increasingly being used in the commercial sector such as

• Telephone communication industry, e.g BT’s ProSum.

• In data mining of text databases, E.g. Oracle’s Context.

• In filters for web-based information retrieval, E.g. Inxight’s summarizer used in Alta Vista

Discovery

Page 3: A domain specific automatic text summarization using fuzzy logic

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

451

• In word Processing tools e.g. Microsoft’s AutoSummarize

• A variety of new applications are using multilingual summarization, multimedia news

broadcast, audio scanning services for the blind etc.

• To summarize news to SMS or WAP-format for mobile phones.

Many approaches differ on the manner of their problem formulations.

A BETTER APPROACH TO SUMMARIZATION

This approach uses both statistical and Linguistic methods [11]to improve the quality of

generated summary. It uses Fuzzy logic for effective Text Summarization[12]. Fuzzy logic uses

decision module that determines the degree of importance of each sentence based on its rated

features. Decision module is designed using a fuzzy inference system.

This approach is illustrated in figure 4.1

Text summarization approach consists of following stages:

• Preprocessing

• Feature Extraction

• Fuzzy logic scoring

• Sentence selection and Assembly

PREPROCESSING

It has 4 steps:

Segmentation: It is a process of dividing a given document into sentences.

Removal of Stop words: Stop words are frequently occurring words such as ‘a’ an’, the’ that

provides less meaning and contains noise. The Stop words are predefined and stored in an array.

Tokenization and POS Tagging: A standard Parser cum Tagger is used to generate tokens and tag

them with proper parts of speech such as such as nouns(NN), verbs(VBZ), adjectives(JJ) and

adverbs(ADVB), determiners(DT) coordinating conjunction(CC) etc. It also groups syntactically

correlated words into phrases such as noun phrase, verb phrase, adjective phrase etc.

Word Stemming: converts every word into its root form by removing its prefix and suffix so that it

can be used for comparison with other words.

Preprocessing Feature

selection

Fuzzification Rule Base

defuzzification

Selection of

sentences &

assembly

Summary

Document

Page 4: A domain specific automatic text summarization using fuzzy logic

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

452

FEATURE EXTRACTION

The text document is represented by set, D= {S1, S2,- - - , Sk} where, Si signifies a sentence

contained in the document D .The document is subjected to feature extraction. The important word

and sentence features to be used are decided .This work uses features such as Title word, Sentence

length, Sentence position, numerical data, Term weight, sentence similarity, existence of Thematic

words and proper Nouns .

1. Title word: A high score is given to the sentence if it contains words occurring in the title as the main

content of the document is expressed via the title words. This feature is computed as follows:

If Nt is the number of words in the sentence that occur in the title and Ntotal is the total number of

words in the title, then

�� ���

������

2. Sentence Length: We eliminate the sentences which are too short such as datelines or author names. For every

sentence the normalized length of sentence is calculated as

���� �� �� ����� �� ��� ��������

��� �� �� ����� �� ��� ������� ��������

3. Sentence Position:

The sentences occurring first in the paragraph have highest score. Suppose a paragraph has n

sentences then the score of every sentence for this feature is calculated as follows:

F3(S1) = n/n; F3(S2)=4/5; F3(S3)=3/5; F3(S4)=2/5; and so on.

4. Numerical data: The sentences having numerical data can reflect important statistics of the document and may

be selected for summary. Its score is calculated as:

����� ���� �� �� ��������� ���� �� ��� �������� ��

�������� ������

5. Thematic words: These are domain specific words with maximum possible relativity. The score for this feature

is calculated as the ratio of the number of thematic words that occurs in a sentence over the

maximum number of thematic words in a sentence.

����� ���� �� �� ������� ���� �� ��� �������� ��

!�" �� �� �������� �����

6. Sentence to Sentence Similarity: For each sentence S, the similarity between S and every other sentence is computed by the

method of token matching. The [N][N] matrix is formed where N is the total number of sentence in a

document. The diagonal elements of a matrix are set to zero as the sentence should not be compared

with itself. The similarity of each sentence pair is calculated as follows

Page 5: A domain specific automatic text summarization using fuzzy logic

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

453

�# �∑%Sim�Si, �Sj�+

Max%Sim�Si, �Sj�+

Where i=1 to N and j=1 to N.

7. Term weight:

The score of this feature is given by the ratio of summation of term frequencies of all terms in

a sentence over the maximum of summation values of all sentences in a document.

It is calculated by the following equation.

F7=∑TFI

--------------- Where i=1 to n, n is the number of terms in a sentence.

MAX(∑TFI )

8. Proper Nouns:

The sentence that contains maximum number of proper nouns is considered to be important.

Its score is given by

F8= Number of proper nouns in the sentence s

--------------------------------------------------

Sentence length(s)

Thus each sentence is associated with 8 feature vector. Using all the 8 feature scores, the

score for each sentence are derived using fuzzy logic method. The fuzzy logic method uses the fuzzy

rules and triangular membership function .The fuzzy rules are in the form of IF-THEN .The

triangular membership function fuzzifies each score into one of 3 values that is LOW,MEDIUM &

HIGH. Then we apply fuzzy rules to determine whether sentence is unimportant, average or

important. This is also known as defuzzification.

For example IF (F1is H) and (F2 is M) and (F3 is H) and (F4 is M) and (F5 is M) and (F6 is

M) and (F7 is H) and (F8 is H) THEN (sentence is important).

All the sentences of a document are ranked in a descending order based on their scores. Top n

sentences of highest score are extracted as document summary based on compression rate. Finally

the sentences in summary are arranged in the order they occur in the original document.

EVALUATION METHODOLOGY

The evaluation of the summaries is done based on two factors mentioned in Fig. 5. We used

2 documents from news articles belonging to technical domain as an input to the system. Here the

human generated summaries are used as reference summaries for evaluation of our results. The

human generated summary acts as a reference summary since humans can capture and relate deep

meanings of the text as compared to machines. We received human generated summaries for our

input documents from different Experts. Here we call the summaries of Fuzzy summarizer, online

summarizer 1,online summarizer 2 as the candidate summaries.

The performance of the proposed approach will be evaluated using precision, recall and F-

measure[12]. Precision evaluates the proportion of correctness for the sentences in the summary

whereas recall is utilized to evaluate the proportion of relevant sentences included in the summary.

Page 6: A domain specific automatic text summarization using fuzzy logic

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

454

For precision, the higher the values, the better the system is in omitting irrelevant sentences.

Similarly, the higher the recall values the more successful the system would be in fetching the

relevant sentences. The weighted harmonic mean of precision and recall is called as F-measure. The

detail formula for Precision, recall and F-measure is as shown below.

Precision = │ {Retrieved sentences} ∩ {Relevant sentences}│

-------------------------------------------------------------

│ {Retrieved Sentences} │

Recall= │ {Retrieved sentences} ∩ {Relevant sentences} │

__________________________________________

│ {relevant sentences} │

F-measure= 2 x /01234356 7 8129::

/01234356;8129::

EXPERIMENTAL RESULTS

The two sports news articles , their manual summaries, summaries generated by our

algorithm and summaries generated by two online summarizers are shown below. The chart showing

the comparision between results of online summarizers and our proposed summarizer.

Original Document 1

Page 7: A domain specific automatic text summarization using fuzzy logic

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

455

Original Document 2

Manual summary for Document 1

Page 8: A domain specific automatic text summarization using fuzzy logic

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

456

Manual summary for Document2

For document 1, the summary generated by our algorithm is:

Page 9: A domain specific automatic text summarization using fuzzy logic

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

457

For document 2, the summary generated by our algorithm is:

Online summarizer1

• Document 1

Page 10: A domain specific automatic text summarization using fuzzy logic

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

458

• Document 2

Online Summarizer 2

• Document 1

Page 11: A domain specific automatic text summarization using fuzzy logic

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

459

• Document 2

Comparison Graphs

• Document 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Summarizer 1 Summarizer 2 Our

Summarizer

Precision

Recall

f-measure

Page 12: A domain specific automatic text summarization using fuzzy logic

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

460

Document 2

CONCLUSION

Automatic summarization is a complex task that consists of several sub-tasks. Each of the

sub-task directly affects the ability to generate high quality summaries. In extraction based

summarization the important part of the process is the identification of important relevant sentences

of text. Use of fuzzy logic as a summarization sub-task improved the quality of summary by a great

amount. The results are clearly visible in the comparison graphs. Our algorithm shows better results

as compared to the output produced by two online summarizers.

FUTURE SCOPE

The quality of summary can still be improved by using topic segmentation and semantic

analysis of the text in addition to the features considered above. We applied our method for single

document summarization which could be extended for multiple document summarizations.

REFERENCES

1. LUHN. H.P.1958. “Automatic Creation of Literature abstracts”,IBM Journal of Research &

Development 2 April p-159.

2. G.J.Rath, A Rensick and T.R.Savage “The formation of abstracts by selection of sentences”,

at IBM Foundation, Yorktown Heights, New York.

3. Climenson, W.D., Hardwick, N.H., Jacobson, S.N. (1961).”Automatic Syntax Analysis in

Machine Indexing and Abstracting”.

4. Edmundson, H.P. (1969).New Methods in Automatic Extracting.

0

0.2

0.4

0.6

0.8

1

Summarizer 1 Summarizer 2 Our

Summarizer

Precision

Recall

f-measure

Page 13: A domain specific automatic text summarization using fuzzy logic

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

461

5. M. Suneeta & S.Sameen Fatima “Corpus based Automatic Text Summarization System with

HMM Tagger” at IJSCE ISSN: 2231- 2307,Volume-1, Issue-3, July 2011

6. Kaikhah.K “Automatic Text Summarization using neural networks” at Intelligent systems

2004 Proceedings,2004 2nd

International IEEE Conference Volume 1.

7. Binwahlan.M.S., Salim.N.,& Suanmali.L (2009d),”Fuzzy Swarm based text summarization”,

Journal of Computer Science, 5(5), 338-346

8. LaddaSuanmali, NaomieSalimand Mohammed Salem Binwahlan, “Fuzzy Logic Based

Method for Improving Text Summarization”, (IJCSIS) International Journal of Computer

Science and Information Security, Vol. 2, No. 1, 2009

9. L. Antiqueira, O. N. Oliveira Jr., L. F. Costa, and M. G. V. Nunes.”A complex network

approach to text summarization “at Information Sciences 179(5):584-599 (2009) ”.

10. Marcelo Fiszman,Thomas.C.Rendflesh, Halil Kilicoglu “Abstraction summarization for

managing the biomedical research literature”,CLS '04 Proceedings of the HLT-NAACL

Workshop on Computational Lexical Semantics Pages76-83 Association for Computational

Linguistics Stroudsburg, PA, USA ©2004.

11. Rushdi Shams, M.M.A. Hashem, Afrina Hossain, Suraiya Rumana Akter, and Monika

Gope,”A corpus based web document summarization using statistical & Linguistic

approach”,

12. Ladda Suanmali , Naomie Salim and Mohammed Salem Binwahlan, “Improving Text

Summarization using Fuzzy Logic”, (IJCSIS) International Journal of Computer Science and

Information ecurity, Vol. 2, No. 1, 2009.

13. Meghana.N.Ingole, M.S.Bewoor and S.H.Patil, “Context Sensitive Text Summarization using

Hierarchical Clustering Algorithm”, International Journal of Computer Engineering &

Technology (IJCET), Volume 3, Issue 1, 2012, pp. 322 - 329, ISSN Print: 0976 – 6367,

ISSN Online: 0976 – 6375.

14. Roma V J, M S Bewoor and Dr.S.H.Patil, “Automation Tool for Evaluation of the Quality of

NLP Based Text Summary Generated Through Summarization and Clustering Techniques by

Quantitative and Qualitative Metrics”, International Journal of Computer Engineering &

Technology (IJCET), Volume 4, Issue 3, 2013, pp. 77 - 85, ISSN Print: 0976 – 6367,

ISSN Online: 0976 – 6375.

15. V.Sujatha, K.Sriraman, K. Ganapathi Babu and B.V.R.R.Nagrajuna, “Testing and Test Case

Generation by using Fuzzy Logic and N.L.P Techniques”, International Journal of Computer

Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 531 - 538, ISSN Print:

0976 – 6367, ISSN Online: 0976 – 6375.