an optimization text summarization method based on naïve bayes

17
Applied Mathematical Sciences, Vol. 8, 2014, no. 3, 99 - 115 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2014.36319 An Optimization Text Summarization Method Based on Naïve Bayes and Topic Word for Single Syllable Language Ha Nguyen Thi Thu Departement of Computer Science Viet nam Electric Power University 235 Hoang Quoc Viet, Tu Liem, Hanoi, Vietnam Copyright © 2014 Ha Nguyen Thi Thu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Text summarization since the late 50’s of the 20th century by the simple technical based on term frequency and it applied for technical text summarization at IBM institute. During more than 50 years of development, text summarization is still a hot topic that attracting many researchers, scholars in the field of data mining and natural language processing proposals development of the text summarization system. For the English, there are some automatic text summary systems was built as SUMARIST, SWESUM,… But for single syllable languages like Chinese, Vietnamese, Japanese, Thai, Mongolian and other "native" languages in Southeast and East Asia. Amount people use single syllable language more than 60% of all language on the world. So that, processing of single syllable language is very necessary. However, it is very complex for language processing problem because it’s very hard to determine word or term based on white space and all word segmentation tools not reach 100% accuracy currently. In this paper, we propose a text summarization method based on Naïve Bayes algorithm and topic words set. We’ve experimented with 320 Vietnamese texts (equivalent to 11,670 Vietnamese sentences) show that our method is really effective; text summary is readable, understandable and closer with summary of the human. Keywords: Single syllable language, text summarization, dimensionality reduction, Naïve Bayes, supervised learning, natural language processing, topic word

Upload: hathu

Post on 28-Jan-2017

232 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: An Optimization Text Summarization Method Based on Naïve Bayes

Applied Mathematical Sciences, Vol. 8, 2014, no. 3, 99 - 115 HIKARI Ltd, www.m-hikari.com

http://dx.doi.org/10.12988/ams.2014.36319

An Optimization Text Summarization Method

Based on Naïve Bayes and Topic Word

for Single Syllable Language

Ha Nguyen Thi Thu

Departement of Computer Science Viet nam Electric Power University

235 Hoang Quoc Viet, Tu Liem, Hanoi, Vietnam

Copyright © 2014 Ha Nguyen Thi Thu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Text summarization since the late 50’s of the 20th century by the simple technical based on term frequency and it applied for technical text summarization at IBM institute. During more than 50 years of development, text summarization is still a hot topic that attracting many researchers, scholars in the field of data mining and natural language processing proposals development of the text summarization system. For the English, there are some automatic text summary systems was built as SUMARIST, SWESUM,… But for single syllable languages like Chinese, Vietnamese, Japanese, Thai, Mongolian and other "native" languages in Southeast and East Asia. Amount people use single syllable language more than 60% of all language on the world. So that, processing of single syllable language is very necessary. However, it is very complex for language processing problem because it’s very hard to determine word or term based on white space and all word segmentation tools not reach 100% accuracy currently. In this paper, we propose a text summarization method based on Naïve Bayes algorithm and topic words set. We’ve experimented with 320 Vietnamese texts (equivalent to 11,670 Vietnamese sentences) show that our method is really effective; text summary is readable, understandable and closer with summary of the human.

Keywords: Single syllable language, text summarization, dimensionality reduction, Naïve Bayes, supervised learning, natural language processing, topic word

Page 2: An Optimization Text Summarization Method Based on Naïve Bayes

100 Ha Nguyen Thi Thu

1 INTRODUCTION

The rapid increase in the amount of data on Internet has brought to users the convenience of retrieving and searching information but also difficult to read because texts is often too long. Text Summarization is a subfield of natural language processing, it return a short text (shorter than original text) and maintaining the gist of information from original text [4], [5].

During more than 50 years of text summarization development, there are many methods proposed for the purpose of building the automatic text summarization that satisfying the user's requirements. There are many approaches but has two main approaches: supervised learning and unsupervised learning. In the recent research’s often approached by unsupervised learning methods. With this method, often reduce the cost when build data set, reduce time, costs in training phrase and reduce the computation complexity through learning model. However, this approach quality of text summarization is not high, because of calculating sentences score by linear combination of features and select sentences that has high score [4], [5], [6], [19], [26].

Text summarization based on Naïve Bayes was Kupiec and colleagues using with English text in 1995 with simple features such as length of sentence, word frequency [21]. In 1999, Aone gives a method based on word frequency (tf) and inverse document frequency (idf). In his method, he use vocabulary and WordNet to increases in natural language processing in summary system, and this method also used to build the automatic text summarization system (call Dimsum) and is considered as the first foundation method that developed system automatic text summarization successful [1].

For some single syllable languages as Chinese, Japanese, Vietnamese, Thai ... when we build automatic summarization system, automatic classification system, retrieval system… we usually word segmentation tool as a phase in process of system building. Because words in single syllable language cannot determine based on whitespace, for example: preparation (English) is translated into Vietnamese: chuẩn bị, and Chinese is: 准备(this is a compound). On the other hand, word segmentation tools are not high accuracy: With Chinese, accuracy rate is only about 85% [17], [19], [34], [38], with Vietnamese exact rate at approximately 80% [37]. Therefore, if this problem resolved the quality of summary text can be better.

Because of enhance quality of text summary, accuracy and reduce time for computing in single syllable language, we have chosen a approach based on supervised learning method using Naïve Bayes with the goal of improving the quality of summaries through the selection of training data and training data by experts, and we also use tagger tool for extracting topic words set in the training phase. In summary phase, without segmentation tool will reduce computational complexity in terms of time (Figure 1).

Page 3: An Optimization Text Summarization Method Based on Naïve Bayes

Optimization text summarization method 101

Figure 1. Process of single syllable language text summarization.

The rest of this paper is structured as follows: Section 2 introduces the related works in the field of text summarization. Section 3 introduces construction method of topic word and how to reduce the computational complexity. Methodology of summary base on Naïve Bayes and topic word will be presented in section 4, section 5 is the experimental and results and we use Vietnamese for experimental, finally is conclusion.

2 RELATE WORKS

The first research text summarization is usually based on some simple characteristics to identify and extract the important sentences in a document, features are often used: sentence length, sentence position, the title appears, the frequency of keywords (or terms). While some text summarization methods based on unsupervised learning techniques [2], [7], [10], [24], [38] still has a number of published works that based on supervised learning techniques like Edmundson, he implement data set includes 400 technical documents for training [6].

Since 1990, along with the development of computer science and machine learning techniques in natural language processing has a new look at text summarization, the researchers used machine learning techniques to summarize text. The machine learning models are mentioned as: Naive Bayes, artificial neural networks, decision trees, SVM models, hidden Markov models.

Kupiec et al (1995) have described a method of text summary based on the ideas that has inherited from Edmunson by training data set. Naive Bayes classifier function used to divide into two sentences class with meanings extracted or not. Kupiec used some features : heading word, length of sentences and words frequency [21]. In 1999, Aone also propose a method using Naïve Bayesian classification to develop an automatic text summarization system called Dimsum, in his method, tfxidf be used as a main characteristic and he also used WordNet dictionary and vocabulary to increase the quality of summaries [1].

Text summarization using neural network is presented since 2000’s, with the typical study by Lin, Nenkova and Svore. In this study, they have used the standard database with more than 1000 documents labeled by experts [23], [33], [36].

Page 4: An Optimization Text Summarization Method Based on Naïve Bayes

102 Ha Nguyen Thi Thu

Text summarization based on Hidden Markov Model (HMM) were Conroy and O’Leary announced in 2001, this method based on idea: features is dependence. There are three features were used: position of the sentence in the document (built into the state structure of the HMM), number of terms in the sentence, and likeliness of the sentence terms given the document terms [18].

For single-syllable language, the proposed methods is mainly used unsupervised learning techniques, word segmentation tools often mentioned in the proposal. With Chinese text, some automatic text summarization systems based on surface information of context likes: Jun-Jie Li and Key-Sun Choi proposed a method that based on corpus, in this method they use segmentation tool and designed a new full text indexing method called Natural Hierarchical Network (includes 6 levels) for calculating sentence score through 2 key features: important words and few number of sub-sentences [19]. Yanmin Chen, Xiaolong Wang, and Yi Guan proposed a chinese text summarization by computes lexical chains based on the HowNet knowledge database [38]. And with Vietnamese text, some text summarization method was proposed likes: Thanh Le Ha in his article used include word frequency, sentence position, sentence length, title words. In his method also mentioned word segmentation tool for separate words from document and calculate tfxidf [37] ... Nguyen Le Minh proposed method Vietnamese text summary based on SVM [30]…

3 FEATURE REDUCTION

3.1 Text Representation

Text representation methods that describe the content or characteristics of the text. When performing vector text as it is common to use the vector components represent corresponding features are the words (terms). With the value of each characteristic is called the weight of the word (term), describe the frequency of words (terms) appear in the text.

Suppose that we have a document d, figure 1 illustrates representing d. In which, wij is the weight of word (or term) tij, i is sentence ith in document d, j is position of word (term) in the sentence ith from left to right.

⎪⎪

⎪⎪

⎪⎪

⎪⎪

=

w... ww... ... ...

w... w w... ww

nm,m,2 m,1

n2,2,21,2

n1,1,21,1

wT

Figure 2. Matrix of text representation.

Example 1: Have an original Vietnamese text include 34 words.

“Các nhà nghiên cứu thuộc trường Đại học Michigan vừa tạo ra một nguyên mẫu đầu tiên cho hệ thống tính toán quy mô nhỏ, có thể chứa dữ liệu một tuần khi tích hợp chúng vào trong những bộ phận rất nhỏ như mắt người.”

Page 5: An Optimization Text Summarization Method Based on Naïve Bayes

Optimization text summarization method 103

Translate in to Chinese:

“密歇根大学的研究者已经创建了一个小规模计算的第一个原型系统,它可以存储一个星期的资料,还可以集成在很小的部分如人眼睛。”

Pinyin:”michigen da xue de yan jiu zhe yi jing chuang jian le yi ge gui mo ji suan de di yi ge yuan xing xi tong, ta ke yi cun chu yi ge xing qi de zi liao hai ke yi ji cheng zai hen xiao de bu fen ru ren yan jing”

Translate in to English:

"Researchers at the University of Michigan have created a first prototype system for small-scale computing, which can contain data for a week while integrating them into very small parts as the human eyes."

Like this document, we must calculate weight for 34 words. And representation matrix with 1 row and 34 columns like below

T={t1,1, t1,2,…t1,34} (1)

In the general case, the original text often includes multiple sentences, each sentence consists of many words, so if expressed in the above, the weight matrix representation of the text will have some very large, requiring more time for computation. With common Vietnamese text, first we need separate all the words (use word segmentation tool) in the text and the compute weight of each word will require more and more time treatment.

3.2 Methodology of feature reduction.

Feature selection is one of the key topics in machine learning and other related fields. Real-life datasets are often characterized by a large number of irrelevant or redundant features that may significantly hamper model accuracy and learning speed if they are not properly excluded. Feature selection involves finding a subset of features to improve prediction accuracy or decrease the size of the structure without significantly decreasing prediction accuracy of the classifier built using only the selected features.

To overcome the disadvantages of large feature vector and the cost to build a large set of terms for each topic. In this paper, we propose a new method which can reduce complexity computing of large feature set by using a word segmentation tool for separating word into two word sets: nouns set (called topic word) and other words set. In a any text, nouns contain information of text. So, when we extract nouns from text, remarkable reduction of large feature set.

In example 1, we separate document d into two sets, one set include nouns, called nouns set, if we use Vietnamese text, nouns set is below:

Nouns set T’ = {nhà_nghiên_cứu, trường, đại_học, Michigan, nguyên_mẫu, hệ_thống, quy_mô, dữ_liệu, tuần, bộ_phận, mắt, người}.

If we use Chinese text to separate words, nouns set is T’’ like below

Noun set T’’ = {密歇根,大学, 研究者, 规模, 原型,系统, 星期, 资料,部分,人,眼睛}.

Page 6: An Optimization Text Summarization Method Based on Naïve Bayes

104 Ha Nguyen Thi Thu

Pin yin T’’={mi_ke_gen, da_xue, yan_jiu_zhe, gui_mo, yuan_xing, xi_tong, xing_qi, zi_liao, bu_fen, ren, yan_jing}.

Use text separation technique in two sets, the size of the matrix T will be reduced, for example, with the original text in example 1, instead of using the T matrix contains one row and 34 columns, we only need the matrix T' consists of one row and 14 columns:

T={t1,1, t1,2,…t1,12} (2) So now, we reduced from 34 feature vectors to 12 feature vectors.

4 SINGLE SYLLABLE TEXT SUMMARIZATION

4.1 Features Selection

In previous studies, we also use this features reduction method with purpose: reduce processing time, reduce computational complexity and enhance quality of summary. In this method, we use three key features for calculating weight of sentences as: Information significant, amount of information and position of sentence.

• Information Significant (topic word information Significant): It express important of topic word in sentences and it is calculated by number of sentences that appear this word divide by all sentences in training set.

• Amount of information in a sentence: Other method usually mentions sentence length as a key feature, but in our method, we use amount of information feature instead of sentence length feature. It is calculated by total number of topic word in sentence.

• Position of sentence: Sentence in a text has an each difference of important. Commonly, position of sentence at first paragraph or title of text has high value. Position of sentence score is calculated by equation Pos (si) = 1 / i. i = 1 when si in the top of paragraph.

4.2 Naïve Bayes classification

Naïve Bayes classification used to calculate the probability of a sentence s with k features like F1, F2, ..., Fk be the set S or not? by the following formula

),...,,(/)()|,...,,(),...,,|( 212121 kkk FFFPSsPSsFFFPFFFSsP ∈×∈=∈ (3)

Assuming that the feature is independent formula (3) converted to

)(/)()|(),...,,|( 21 jjk FPSsPSsFPFFFSsP ∏∈∈∏=∈ (4)

Using logarithmic rule (4) into:

∑+=∈ )|(log)(log(),...,,|( 21 sFPsPFFFSsP jk (5)

In which:

Page 7: An Optimization Text Summarization Method Based on Naïve Bayes

Optimization text summarization method 105

P(s)= C(s)/C(w) where C(s) is the number of sentences in training set and C(s) are in class C, C(w) is total of sentences in training set.

P(Fj|s)=C(Fj,s)/C(s). Where C(Fj,s) is the number of ocuracy Fj in sentence of class C.

We use Naïve Bayes to classify in to two classes (extracted class and non-extracted class). So,

we compute the probability corresponding to each case )|( jFSsP ∈ and )|( jFSsP ∉ .

Sentence will be selected for extraction when )|( jFSsP ∈ > )|( jFSsP ∉ .

4.3 Using Naïve Bayes classification for single syllable text summarization

We use Naïve Bayes classification for single syllable text with two phases: Training and summarization. In Traning phase, we trained from data and with the participation of human to build a set of extracted sentences.

- Training Phase:

• Step 1: Build the training data set D consists of n documents D = {d1, d2, ..., dn}

• Step 2: Use the POS tagging tool to create the set of nouns V.

• Step 3: Human will select out the most important sentences in each ds to creat text summary.

• Step 4: Division into two classes: Selected sentences labeled (+), and not-select sentence is labeled (-). Save position of sentences (including the labeled (+) sentences and the labeled (-) sentences in the text ds.

• Step 5: Calculate the value of topic word information significant in from both labeled (+) sentences and (-) sentences. Save this values to database table.

• The figure 3 below illustrated 5 steps in training phase

Page 8: An Optimization Text Summarization Method Based on Naïve Bayes

106 Ha Nguyen Thi Thu

FEATURE SCORE (WEIGHT) Input S: set of sentences; V: dictionary of nouns; Output W’: list of sentences and value coresponding of it 1. Initalization

W’=φ ; I=φ ; P=φ ; M=φ ; x=0;y=0; 2. For i=1 to count(S) do 2.1 For k=1 to length(si) do 2.1.1 If (si[k] is noun) 2.1.1.1 x=x+I(si[k]); 2.1.2 I[k]←x; 2.1.3 x=0; 2.2 For k=1 to length(si) do 2.2.1 If (si[k] is noun) calculate

2.2.1.1 y++; 2.2.2 M[k] ←y; 2.2.3 y=0;

2.3 P[k]= ←MATCH(si,d); Figure 3. Calculating value of topic word in S and S’.

-Extraction phase

• Step 1: For T original text (the text to be summarized).

• Step 2: Splitting T into a set of sentences S = {s1, s2, ..., sn}

• Step 3: For each sentence si in the training set, calculate the probability of each feature Fj, and then use Naïve Bayes classification to compute probabilities of si with both classes (+) and (-).

• Step 4. In the case of the probability of si with class (+) greater than the probability of si with class (-), si will be selected for summary.

• Figure 4 below describes the algorithm of the extraction phase.

Page 9: An Optimization Text Summarization Method Based on Naïve Bayes

Optimization text summarization method 107

EXTRACTION ALGORITHM (TSBN) Input: - C: Original text; - K: Table of topic words and value of its; - n: Number of sentences that has labeled (+); - n’: Nuber of sentences that has labeled (-); Output: - T: Text summary Initialization T=φ ; S=Split( C); // split sentences from C F=φ ; F’=φ ; m=0; 1. For each sentence si in C do 1.1 For j=1 to length si do 1.1.1 If w(j) ∈V then 1.1.1.1 match_hush(w(j), K) // Matching with K 1.1.1.2 F(k)←n(j) // Frequency w(j) that occur in labeled (+) sentences 1.1.1.3 F’(k) ← n’(j) // Frequency w(j) that occur in labeled (+)

sentences 1.1.1.4 m=m+1; //count topic word 1.1.1.5 ))(log()()(

nkFsWsW ii += ;// calculate information significant in

labeled (+) sentence. 1.1.1.6 )

')('log()(')('

nkFsWsW ii += ; // calculate information significant in

labeled (-) sentence. 1.2 ))(log()log()

'log()()(

nsPos

nm

nnnsWsW i

ii +++

+= ; // Probability of si with labeled

(+) class 1.3 )

')(log()

'log()

'log()(')('

nsPos

nm

nnnsWsW i

ii +++

+= ; // Probability of si with labeled

(-) class 1.4 If W(si)> W’(si) then 1.4.1 T=T∪W’(si); 1.5 m=0;

Figure 4. Sentence extraction algorithm.

Page 10: An Optimization Text Summarization Method Based on Naïve Bayes

108 Ha Nguyen Thi Thu

5 EXPERIMENTALS

5.1 Corpus

For experiments, we use Vietnamese text for experimental, with proposed algorithms, we also developed an automatic text summarization system to process text summarization effectively and convenient. During of Vietnamese text study, we have built a Vietnamese text corpus for summary experimental purposes. This corpus is done by selecting documents Vietnamese websites like: http://thongtincongnghe.com, http://echip.com, http://vnexpress.net, http://vietnamnet.vn, http:/ / tin247.com, .... Curently, Corpus includes more than 300 Vietnamese text with 11.670 sentences.

Figure 5 below shows the documents stored .txt file in the database.

Figure 5. Stored training set in database system.

With each document in training set, human classify into two classes. There are 1,200 labeled (+) sentences. Figure 6. following described labeling process of the training phase. Sentences is assigned a value of 1 is equivalent to the label (+), the sentence is assigned a value of 0 is equivalent to the label (-).

Page 11: An Optimization Text Summarization Method Based on Naïve Bayes

Optimization text summarization method 109

Figure 6. Labeling sentence phase.

5.2 Building topic words dictionary

When built topic words dictionary, we use Vietnamese text pos and tagging tool. Figure 13 illustrate a text after used pos tagging tool.

Figure 7. Text after tagging.

Next, we extracted from the above files (that has used pos tagging tool) all nouns (labeled "N"). All nouns will be stored in database. Figure 8 illustrated topic word dictionary and it updated score corresponding to each class (+) and class (-).

Page 12: An Optimization Text Summarization Method Based on Naïve Bayes

110 Ha Nguyen Thi Thu

Figure 8 . Topic words dictionary.

5.3 Results and evaluation

With the proposed method, we built a test system. The input text (original text) after preprocessing, saved as a .txt file in system. Figure 9. Is a original text.

Trong những năm qua, botnet đã trở thành nguồn thu nhập ổn định của các nhóm tội phạm mạng bởi chi phí bỏ ra thấp trong khi việc kiểm soát chúng ngày càng trở nên dễ dàng hơn. Maria Garnayera và Alexey Kadiev, hai chuyên gia của Kaspersky Lab, đã đề cập những phát hiện của mình về mô hình hoạt động và kinh doanh của các botnet phổ biến trong bài viết “Botnet quảng cáo”, đăng trên website uy tín chuyên về bảo mật securelist.com. Việc tạo ra lượng truy cập giả mang đến nguồn lợi nhuận thường xuyên cho botnet. Các nhà quảng cáo muốn gia tăng lượng người truy cập vào website của mình bằng cách đặt nhiều liên kết đến trang web hiển thị trên những website thuộc mạng lưới quảng cáo. Họ phải trả tiền cho chủ sở hữu mạng lưới quảng cáo một mức phí cho mỗi lần có người dùng kết nối trực tiếp đến website của mình hoặc gián tiếp đi từ website của hệ thống quảng cáo. Tuy nhiên, mô hình kinh doanh này sẽ trở nên bất hợp pháp nếu người tham gia trong mạng lưới quảng cáo sử dụng một botnet để tạo lượng truy cập giả vì lợi nhuận. Một ví dụ điển hình của botnet quảng cáo là mạng ma Artro nổi lên từ năm 2008 và không ngừng phát triển. Tháng 1/2011 Kaspersky Lab phát hiện ra Trojan Dowloader.Win32.CodecPack được tìm thấy ít nhất một lần trên máy tính của khoảng 140.000 người dùng. Theo số liệu từ những tập tin đăng nhập, máy tính tại 235 quốc gia trên toàn cầu đã bị lây nhiễm trojan này. Các bot Artro liên tục chuyển hướng để người dùng nhấp chuột vào liên kết dẫn đến website đang được quảng cáo, tạo ra lượng truy cập đáng kể. Tội phạm mạng cũng sử dụng malware download từ bên thứ ba lây nhiễm cho máy tính cá nhân để kiếm tiền. Số lượng máy tính bị lây nhiễm và lượng người download khó kiểm soát khiến botnet CodecPack hay Artro trở thành công cụ hết sức nguy hiểm trong tay các nhóm tội phạm mạng. Qua thống kê số lần nhấp chuột trung bình một ngày mà mỗi module thực hiện trên các liên kết, mức thanh toán trung bình từ dịch vụ quảng cáo cho mỗi cú nhấp chuột, số lượng bot tối thiểu trên mạng ma, lợi ích khi bán lượng truy cập, hai chuyên gia Kaspersky Lab đã ước tính thu nhập của tội phạm mạng sở hữu botnet có thể đạt được từ 1.000 đến 2.000 USD một ngày.

Figure 9. Original text

Page 13: An Optimization Text Summarization Method Based on Naïve Bayes

Optimization text summarization method 111

Figure. 10 below is the result of the system. Original text has 13 sentences and text summary (target text) 6 sentences.

Maria Garnayera và Alexey Kadiev, hai chuyên gia của Kaspersky Lab, đã đề cập những phát hiện của mình về mô hình hoạt động và kinh doanh của các botnet phổ biến trong bài viết “Botnet quảng cáo”, đăng trên website uy tín chuyên về bảo mật securelist.com. Họ phải trả tiền cho chủ sở hữu mạng lưới quảng cáo một mức phí cho mỗi lần có người dùng kết nối trực tiếp đến website của mình hoặc gián tiếp đi từ website của hệ thống quảng cáo. Một ví dụ điển hình của botnet quảng cáo là mạng ma Artro nổi lên từ năm 2008 và không ngừng phát triển. Theo số liệu từ những tập tin đăng nhập, máy tính tại 235 quốc gia trên toàn cầu đã bị lây nhiễm trojan này. Tội phạm mạng cũng sử dụng malware download từ bên thứ ba lây nhiễm cho máy tính cá nhân để kiếm tiền. Qua thống kê số lần nhấp chuột trung bình một ngày mà mỗi module thực hiện trên các liên kết, mức thanh toán trung bình từ dịch vụ quảng cáo cho mỗi cú nhấp chuột, số lượng bot tối thiểu trên mạng ma, lợi ích khi bán lượng truy cập, hai chuyên gia Kaspersky Lab đã ước tính thu nhập của tội phạm mạng sở hữu botnet có thể đạt được từ 1.000 đến 2.000 USD một ngày.

Figure 10.Target text.

In the present time, there is no automatic Vietnamese text summarization evaluation system. Therefore, we use the precision measure and compared with some other methods to assess the effectiveness of the proposed method. The two main methods we use the baseline method, an automatic Vietnamese text summarization system that has address at http://labs.baomoi.com/demoTS.aspx (called VTSonline) and an other automatic text summarization at http://www.textcompactor.com/ (called Textcompactor).

Due an other methods often used summary rate, so that to perform this comparison, we implement a summary rate. Summary rate r is defined by the following formula:

%___

_arg__textoriginaloflength

textettoflengthr = (6)

Table 1 below is a comparison of these methods.

Page 14: An Optimization Text Summarization Method Based on Naïve Bayes

112 Ha Nguyen Thi Thu Table 1. Results of experimentals.

Method Rate

80% 60% 40% 20%

Ours 0.88 0.86 0.82 0.6

HLT 0.82 0.75 0.69 0.54

Baseline 0.81 0.8 0.84 0.63

Textcompactor 0.85 0.82 0.65 0.57

VTSonline 0.72 0.68 0.51 0.48

The results are expressed as the chart below (Figure 11)

Figure 11. Comparison with some methods.

6 CONCLUSION

The fundamental problem when processing single syllable language is word processing. When building automatic text summarization system, text classification, text retrieval, ... In general case, word segmentation tool is often embedded in the systems, so processing time for word processing is long special when work with more texts. Addition, segmentation tools have not high accuracy, and haven’t mentioned the noise processing problem in texts.

In this paper, we present a method using Naïve Bayesian and topic words set to solution syllable language text summarization problem. The experimental results show that the proposed method can solution some problems exist in single syllable language text, reduce processing time, computational complexity and quality of summaries is higher.

With this approach, we expect can be extended to cross language summary (cross language) for some single syllable languages in the future and build automatic cross language text effectively.

Page 15: An Optimization Text Summarization Method Based on Naïve Bayes

Optimization text summarization method 113

REFERENCES

[1]. Aone, C., Gorlinsky, J., Larsen, B., & Okurowski, M. E.. A Trainable Summarizer with Knowledge Acquired from Robust NLP Techniques. In Mani, I. & Maybury, M. Advances in Automatic Text Summarization. Cambridge, Mass.: MIT Press. Pp. 71-80, 1997.

[2]. Baxendale, P. Machine-made index for technical literature - an experiment, IBM Journal of Research Development, 2(4):354-361, 1958.

[3]. C. Lopez, V. Prince, and M. Roche, Text titling application (demonstration session, to appear), in Proceedings of EKAW’10, 2010.

[4]. Daniel Jurafsky & James, Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition, Prentice Hall, 2008.

[5]. Dipanjan Das, Andre F.T. Martins, A Survey on Automatic Text Summarization, November 21, 2007.

[6]. Edmundson, H. P, New methods in automatic extracting, Journal of the Association for Computing Machinery pp. 264–285,1969.

[7]. Ha. N.T.T, Quynh. N. H, Tao. N. Q, A new method for calculating weight of sentence based on amount of information and linguistic score, International Journal of Advanced Computer Engineering, Vol.4 No.2, pp. 91-95, 2011.

[8]. Ha. N.T.T, Quynh. N. H, Khanh N.T.H, Hung L.M, Optimization for Vietnamese Text classification problem by reducing feature set, Proc of 6th International Conference on New Trends in Information Science, Service Science and Data Mining, pp 209-212, 2012.

[9]. Ha. N.T.T, Quynh. N. H, Tu. N.N, A Supervised Learning Method Combine with Dimensionality Reduction in Vietnamese Text Summarization, Proc IEEE of Computer, communication and application 2013, pp 69-73, 2013

[10]. Hovy, E. and Lin, C. , Automated text summarization and the summarist system, TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998, pp.197–214, 1998.

[11]. Hsin-Hsi Chen, Jen-Chang Lee, Identification and Classification of Proper Nouns in Chinese Text in COLING’96, Aug, 1996, Vol.1, pp 222-229.

[12]. Hu, P., He, T., Ji, D., and Wang, M. A Study of Chinese Text Summarization Using Adaptive Clustering of Paragraphs, Proc. 4th Intern. Conf. Computers and Information Technology (CIT’04), Wuhan, China, September 14– 16, 2004, pp. 1159–1164.

[13]. http://labs.baomoi.com/demoTS.aspx

[14]. http://swesum.nada.kth.se/index-eng.html

Page 16: An Optimization Text Summarization Method Based on Naïve Bayes

114 Ha Nguyen Thi Thu

[15]. http://vlsp.vietlp.org:8080/demo/

[16]. http://www.textcompactor.com/

[17]. Jian-Zhou Liu, Ting-Ting He, and Dong-Hong Ji. 2003. Extracting Chinese term based on open corpus. In Proceedings of the 20th International, Conference on Computer Processing of Oriental Languages, pp 43-49. ACM, New York.

[18]. John M.Conroy, Judith D.Schlesinger and Dianne P.O’Leary, Using HMM and Logistic Regression to Generate Extract Summaries for Duc, DUC 2001.

[19]. Jun-Jie Li, Key-Sun Choi, Corpus-Based Chinese Text Summarizatioon System, International Conference on Research on Computational Linguistics, Serial. 10, Taiwan, P. 237 - 241, 1997.

[20]. Karel Jezek, Josef Steinberger, Automatic Text Summarization (The state of the art 2007 and new challenges), Znalosti ,2008, ISBN 978-80-227-2827-0.

[21]. Kupiec, J., Pedersen, J., and Chen, F. (1995). A trainable document summarizer, In Proceedings SIGIR '95, pages 68-73, New York, NY, USA, 1995

[22]. Kuang-hua Chen and Hsin-Hsi Chen, Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and Its Automatic Evaluation, Proceedings of the 32nd ACL Annual Meeting, pp. 234-241, 1994.

[23]. Lin, C.-Y. Training a selection function for extraction. In Proceedings of CIKM '99, pages 55-62, New York, NY, USA, 1999.

[24]. Luhn, H. P, The Automatic Creation of Literature Abstracts, In: I. Mani, & M. T.Maybury (eds.): Advances in automatic text summarization. Cambridge, MA: The MIT Press , pp. 15–21, 1999

[25]. Luhn, H. P. The automatic creation of literature abstracts. IBM Journal of Research Development, 2(2):159 -165, 1958.

[26]. MANI, I. AND MAYBURY, M.. Advances in Automatic Text Summarization. MIT Press, Cambridge, MA, 1999.

[27]. MARIA SOLEDAD PERA, YIU-KAI NG, A Naïve Bayes Classify for web document summaries created by using word similariy and significant factor, International Journal on Artificial Intelligence Tools, Vol. 19, No. 4 pp. 465–486, 2010.

[28]. M. Osborne, Using Maximum Entropy for Sentence Extraction, In Proceeding on ACL Workshop on Automatic Summarization, Philadelphia, Pennsylavania, USA, July 11-13, 2002.

[29]. M.L. Nguyen, Statistical Machine Learning Approach to Cross Language Text Summarization, Doctoral Thesis. Japan Advance Institute of Science and Technology, 2004.

[30]. M.L. Nguyen, Shimazu, Akira, Xuan, Hieu Phan, Tu, Bao Ho, Horiguchi, Susumu, Sentence Extraction with Support Vector Machine Ensemble, Proceedings of the First World Congress of the International Federation for Systems Research : The New Roles of Systems Sciences For a Knowledge-based Society 2005-11.

Page 17: An Optimization Text Summarization Method Based on Naïve Bayes

Optimization text summarization method 115

[31]. Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui, Sentence extraction – based presentation summarization techniques and evaluation metrics, ICASSP 2005, pp. I – 1065- I – 1068, 2005.

[32]. Mark Andrews, Gabriella Vigliocco, The Hidden Markov Topic Model: A Probabilistic Model of Semantic Representation, Topics in Cognitive Science 2 101–113, 2010.

[33]. Nenkova, A. Automatic text summarization of newswire: Lessons learned from the document understanding conference. In Proceedings of AAAI 2005,Pittsburgh, USA.

[34]. Po Hu, Tingting He, Donghong Ji, Chinese Text Summarization Based on Thematic Area Detection, ACL’04 workshop: Text Summarization Branches Out, Barcelona, Spain, Association for Computational Linguistics, pp.112-119, 2004.

[35]. Stanislaw Osinski, Dimensionality Reduction Techniques for search result clustering, Msc thesis, Department of Computer Science, The University of Sheffield, UK, 2004.

[36]. Svore, K., Vanderwende, L., and Burges, C. Enhancing single-document summarization by combining RankNet and third-party sources. In Proceedings of the EMNLP-CoNLL, pages 448-457, 2007.

[37]. Thanh, Le Ha; Quyet, Thang Huynh; Chi, Mai Luong, A Primary Study on Summarization of Documents in Vietnamese, Proceedings of the First World Congress of the International Federation for Systems Research : The New Roles of Systems Sciences For a Knowledge-based Society 2005-11.

[38]. Yanmin Chen, Xiaolong Wang, and Yi Guan, Automatic Text Summarization Based on Lexical Chains, ICNC 2005, LNCS 3610, pp. 947-951.

[39]. Xiao-yu Jiang, Xiao zhong Fan, Kang Chen, Chinese Text Classification Based on Summarization Technique, Third International Conference on Semantics, Knowledge and Grid, 2007, pp. 362-365.

Received : June 15, 2013