[ieee 2014 ieee/acis 13th international conference on computer and information science (icis) -...

Machine Learning Tool and Meta-heuristic Based On Genetic Algorithms For Plagiarism Detection Over

Mail Service

Hadj Ahmed Bouarara1 GeCode Laboratory, Department of Computer Science

Tahar Moulay University of Saida Algeria Amine Rahmani3

GeCode Laboratory, Department of Computer Science Tahar Moulay University of Saida Algeria

Reda Mohamed Hamou2 GeCode Laboratory, Department of Computer Science

Tahar Moulay University of Saida Algeria Abdelmalek Amine 4

GeCode Laboratory, Department of Computer Science Tahar Moulay University of Saida Algeria

Abstract—One of the most modern problems that

computer science try to resolve is the plagiarism, in this article we present a new approach for automatic plagiarism detection in world of mail service. Our system is based on the n-gram character for the representation of the texts and tfidf as weighting to calculate the importance of term in the corpus, we use also a combination between the machine learning methods as a way to detect if a document is plagiarized or not, we use pan 09 corpus for the construction and evaluation of the prediction model then we simulate a meta-heuristic method based on genetic algorithms with a variations of parameters to know if it can improve the results. The main objective of our work is to protect intellectual property and improve the efficiency of plagiarism detection system.

Keywords—Plagiarism Detective, Machine Learning, Email Service, Meta-heuristics.

I. INTRODUCTION AND PROBLEMATIC We live in a world that information is everywhere, but

it is sometimes difficult to find and know the original author. In the digital society, it’s easy to find text for plagiarize; these texts may be from Internet, publishers and other content providers. Internet contains readily available texts, which people can benefit in their writing by simply using the copy and paste method, there are many websites which articles and ready documents are available, this sites are ideally suited for the plagiarists thus as the ways of communication via internet that allow the exchange of information in different formats (text, audio, video, images) through specific tools.

A few years before, the only way to detect plagiarism was to manually examining each document. Usually, it is a slow process, for e.g. in an international conference, candidates must submit their papers by e-mail in PDF format and each article must be analysed manually by a domain expert to decide if it is plagiarized or not, this analysis can take a long time especially when the number of documents is very large, for this reason it is important to introduce plagiarism detector and integrate it in e-mail service.

II. STATE OF THE ARTS

In this section we present a preliminary view about our work, the existence of e-mail dates from the 1960s, in these consecutive years e-mail has become the most popular internet application in the world due to his speed transmission of messages, the ability to send the messages even if the recipient was not connected and the conservation of archiving messages. Each user has a quota mail that represents the maximum of disk space that will be allocated for the storage of those emails in his mailbox located in the mail server [1].

With the coming of World Wide Web the mass of information available increases especially the textual information, the availability of handwritten documents in electronic format, the unstructured information become predominant in the web and the reuse of text has become much easier, that’s what conducted to a very large number of plagiarism’s cases with time. Plagiarism is the fact of stealing ideas, expressions and images of someone or from an external source in the same language or in deferent language [3], We can see that there are different types of plagiarism like plagiarism verbatim when the plagiarist copies the words or sentences from external source else as it is, the paraphraser when the words or the syntax of sentence copied are changing and finally the cases of plagiarism the most difficult to detect are plagiarism of idea and plagiarism with translation.

Since 1990 and with the availability of more powerful hardware such as disk large capacity thus as faster processor has given the birth of mechanisms and tools for automatic processing of information as data mining for structured data, text mining for unstructured and semi structured data. For this reason a lot of researches have been launched in the domain of automatic plagiarism detection and two major techniques have emerged: external plagiarism detection when comparing a suspect document against collections of externals source documents as possible to identify plagiarized passages. On the other hand, internal plagiarism detection is the ability to identify the change of style in the documents without taking as reference the external sources [2].

978-1-4799-4860-4/14/$31.00 copyright 2014 IEEE ICIS 2014, June 4-6, 2014, Taiyuan, China

III. RELATED WORKS

This section devote to existing work concerning automatic plagiarism detection, this domain has attracted a considerable attention from research and industry, various documents have been published on the subject, for example (N. Akiva 2001) in [4] using an internal plagiarism detection method based on clustering to identify outlier chunks of text and (M. Zechner, M. Mohr, N. Kern 2009) in [2] that have used the vector space model to determine the similarity between two documents based on an analysis of outlier values to detect the plagiarized passages. There are also approaches proposed by (Y. Palkovskii, A. Belov, I. Vuzyka (2011); Uzuner, Katz and Nahnsen (2005)) in [10] [8] based on the semantic similarity between the words of the documents and the linguistic information using the wordnet and electronic dictionary.

We observed in literature that there are many papers based on the pertinent characters and n-gram texts representation (Hloomfield (2004); C. Basile (2009); J.Kasprzak, M.Brandejs, and M. Kripac (2009)) in [6] [3] [14] the degree of similarity will be calculated depending on the number of identical character between the suspicious and sources documents. Not forgetting (S. Meyer Z. Eissen, B. Stein, and M. Kulig 2007;E. Stamatatos 2009 ) [11] [7] these authors have invented a new systems using n-gram character profile without reference Collections based on style changing of a text and each document should have a dominant style.

There are researchers like (A. Barron-Cedeno, P. Rosso, D. Pinto and A. Juan 2011) make hybridization between the internal and external techniques. A new method is based on information retrieval introduced by (A. Ghosh, P. Bhaskar, S. Pal 2011) [5] .

IV. OUR APPROACH E-mail included in his services, the functionality of

cloud storage by providing the ability to store, retrieve and send data (documents, picture ....). For a client to use the email services such as (Gmail, Yahoo, Hotmail…) must have a connection to the internet and an email account on the server mail protected by a user name and password.

We are interested by the textual documents saved by the customer that will be stored in his quota mail located in the server and when he sends a document to someone, this document will be stored in the mail server of the recipient until it raises its messages. Most servers have a limit storage that cannot exceed, this limit may vary from one server to another, for example it is usually around 100-500MB for some enterprise servers and virtually unlimited for some other email providers like Gmail (which offer more than 7600 MB), Yahoo (Unlimited) and GMX too. A few years before, the only way to detect plagiarism was to manually examining each document usually it is a slow process. For e.g. we are facing a set of documents and asking a human to analyze every document manually to identify if this is a plagiarized paper or not, this is a process that can take a long time, in order to save time and trying to limit the transfer of plagiarized documents, our solution is the integration of a plagiarism detector that will be located between the costumer and his server mail; each

sent document will be verified by the system, and the server informs the recipient about the outcome of the analysis as shown in figure 1, an email is a most popular way of communication where hundreds of documents are transferred, for the reasons of minimizing the phenomenon of plagiarism and warn e-mail users of the status of every document stored or transferred, we had the idea of integrating a plagiarism detector as a service email. Once the document is parsed if it is identified as plagiarism then it will be stored in a box called plagiarism otherwise it will be stored as a normal message in the inbox (The same placement of message as spam and ham).

Fig .1. The different steps of our system.

Our system can be used by all users of email service e.g. when a teacher asks the students to send their works or reports by email, a university asked student to send their memories by email, a conference receives articles researcher by e-mail and a user wants to store his paper. Once (presentations, theses end of the study. and the articles) received, the system will automatically place these documents either as a plagiarism message or not.

Our principal work locate in the achievement of a detector of plagiarism using a pseudo boosting algorithm based on two learning algorithms, then we optimize the obtained results using the genetic algorithms (GAs).

V. ARCHITECTURE OF THE SYSTEM We implement a Plagiarism Detector over email

service based on two principal modes, authentication mode in order to ensure the identity of the user. Storage or mailings mode to see how the document is transferred or stored and check if the content of a textual document can be considered as copy past or not.

A. Authentication Mode This mode appears in the beginning of the system and

contains two mechanisms: the first is the inscription mechanism that is reserved for the new users which haven’t any account in our system and the second is the connection mechanism that represents the passage from the first mode to the second

FIG .2. INSCRIPTION PROCESS.

Figure 2 illustrate the inscription mechanism in our system, this mechanism work as follows: when a new user decide to use the system he must subscribe, for that a formula appears contains the information relative of the user (first-name, last-name, username, and password) then sends the information via an http request, the system verifies the existence of the username in users data base, the existence of the username signifies that it is already used by another user, in this case an error message will be delivered to the user, if the username does not exist which means that it is not used in this case the system hashes the password in order to protect the information against SQL Injection attacks and stores the information in users data base after creates a path named by the username in every datacenter of the service. If the user already has an account he must logged in to use the functionality of storage, mailing and plagiarism detection service [12].

The hash code used in this case is the one exists in java language which uses the following formula:

H(s) = ∑ s i 31 (1)

B. Storage /Mailing Mode Once the customer is connected it means that the user

name and password are correctly verified by the server mail. when a client wants to send or store a document, this document will be transferred firstly from the user to a plagiarism detector to analyze it, and once the analysis is completed, the system will assign to the documents an information about the results obtained and it will be transferred to the server as shown in figure 1 and depending on the choice of the customer, the document will be stored in the mail server or sent to a destination, in this case it will be transferred to the recipient’s mail server so that it can be consulted by the recipient after[13].

The plagiarism detection problem will be treated in two ways as detailed in the rest of this section:

C. Machine Learning Tool We clearly notice that we are facing to a problem with two-class for that we chose to combine between two supervised learning algorithms: the decision tree (C4.5) which produces a pattern in the form of decision rules and the K-Nearest Neighbour (KNN) that does not produce a model.

1) Decision Tree (C4.5)

C4.5 algorithm is an improvement of ID3 used to produce classification procedures understandable to the user, because decision trees graphically represent a set of rules and it is easily interpretable, it can process all types of data (quantitative and qualitative) , fast classification compared to other algorithms (neural networks, SVM) and robust to noise and missing values. In this method there are two different phases, the first phase is to construct the decision tree recursively by dividing the training set using a ratio of gain, secondly the pruning phase used to solve the problem of over-fitting [9].

2) K-Nearest Neighbour (KNN)

K-nearest neighbour is the most naive supervised classification method used to classify the new examples on the basis of their similarity with the predefined examples of the learning base already stored in the memory, it is easy to implement and comprehensible [9].

Algorithm

A set of learning data (D) and the number of (K).

The similarity function to compare the new case with the cases already classified

New example n.

Begin

For each ((x′, c) ∈ D) do Calculate the distance dis (n, x′)

End for

For each x′ ∈ kppv(n) do counting the number of occurrences for each class .

End for

Assign to n the most frequent class;

End.

In our case we chose K=1 and Euclidian distance as a function to calculate the similarity.

Our idea to detect a plagiarism using data mining is composed on three principal steps:

a. Preprocessing and indexing

• Eliminate the special characters, punctuations and numbers.

• Use n-gram characters for representing documents in our corpus and the parameter "N" are going to be varied (2, 3, 4 and 5-gram character) we chose this method because it Is multilingual, independent of the removal of Stop Words or lemmatization, not sensitive to the misspelling and can represent a text where did not exist separator.

• Build a list of terms and each document will be represented by a vector and the components of each vector represents the importance of each

term in the corpus calculated by the TFIDF weighting.

b. Learning and testing phase 1

In this step we use 10% of our corpus pan 09 as a set for learning, that’s the entry for our first algorithm Decision Tree C4.5 to build our first learning model and 90% of the documents for the evaluation and testing phase, we calculate the performance measurement precision / recall / f-measure / entropy.

c. Learning and Testing Phase 2

Documents correctly classified by the model prediction will be used as a new learning base for the second algorithm KNN to construct our 2nd model and documents misclassified will be reclassified as shown in the following figure 3.

Fig .3. architecture of our Plagiarism detector system.

D. Meta-heuristic Method By observing the problem that we are facing we discover a combinatorial problem. For this reason we made the recourse to a meta-heuristic method using GAs to see if bio-inspired approach basing on the random can improve the quality of our system.

A meta-heuristic is an optimization algorithm used to solve difficult problems (often from the fields of operations research, engineering or artificial intelligence) for which there is not a classical methods more efficient and permit to guide the research towards optimum solution. These methods do not guarantee optimality and can include mechanisms to avoid being blocked in the regions of the search space, there are different meta-heuristic methods like ant colony, taboo search and simulate annealing …etc. To see if meta-heuristic method can give better results than obtained by the machine learning methods, we use the genetic algorithms to simulate an optimization of results [15].

Genetic Algorithms: presents the union of the theory of evolution and modern genetics Based on the principles of natural evolution. We start with an arbitrarily chosen population initial of potential solutions (chromosomes) and we evaluate their performance, after a new population of potential solutions will be created using operators of reproduction [16]. We integrate this approach as follows

• Initial Population: We start with a population of individuals (chromosomes) initially randomized, each individual constituted of 6 parameters (true positive (TP), true negative (TN), false positive (FP), false negative (FN), f-measure and entropy)

• For coding we use the binary encoding because it gives the best results compared to other types.

• We evaluate the performance of each individual by a fitness function which was obtained after combining different parameters, until found the following function fitness which offers best result.

F(x) = f-measure / entropy (2)

For example we have two individual if F (1)> F (2) so Individual 1 is better adapted than the individual 2. Based on these results, we create a new population of potential solutions using simple evolutionary operators: selection, crossover and/or mutation.

• Stop criterion: repeat this cycle a number of iteration.

For the implementation of our system we have used the java programming language and SQLite

I. CORPUS PAN Corpus 09 (PAN-PC-09) is a corpus for the

detection of plagiarism used in international competition plagiarism detection pan 09. This corpus contains a set of brute document plagiarized in which plagiarism has been inserted manually and non-plagiarized text documents (source document).

The corpus is constructed 22,874 documents that are in the public domain, so this corpus is available to other researchers as benchmark. 50% of the documents are small (pages 1-10), 35% of medium-sized (10-100 pages), and 15% large (100-1000 pages). 90% of documents are unilingual English. 50% of documents are identified as suspicious documents, and 50% are designated as source documents. The length of a case of plagiarism is equally distributed between 50 and 5000 words .In our case we chose only 1000 document 600 paper as plagiarism and 400 no-plagiarism paper [18].

I. RESULTS AND DISCUSSION

A. Data Mining Tool Results To discuss the quality of the results we rely on the

experimental protocol , this part reflects the results obtained after the application of different classifiers (decision tree (C4.5), C4.5+KNN ) on our corpus pan 09 with different types of texts representation (2,3,4 and 5_gramme character). In order to make a comparison between the performances of the various technics we use the evaluations measures: precision, recall, f-measure (F)

and entropy (E), the results are detailed itable.

TABLE.1. RESULT OF F-MEASURE AND ENTROPY W

GRAM CHARACTER.

Fig .4. f-measure of C4.5 and C4.5+Knn with var

Table1 and figure 4 show that the mewith 3-gram character for the representatia better f-measure = 0.889 compared withalgorithm alone (yellow box) and the wousing 2-gram character with C4.5 (red bmeasure represents the average harmonic and recall so automatically the textspenalizes the algorithms that degrade precor one to the detriment of the other [17].

The use of the f-measure alone cannot gjudgment for this reason we use the entrothe loss of information and the informexhibited in the table 2 and figure 5 likanalysis C4.5+KNN gives best result (ye(3-gram representation) while C4.5 representation has the lowest performanentropy(red box).

Table 2: RESULT OF ENTROPY WITH 2, 3, 4 AND 5-G

0.7

0.75

0.8

0.85

0.9

C4,5 C4,5+KNN

F-measure

Algorithm F-measure

2-gram 3-gram 4-gram

C4.5 0,771 0,882 0,821

C4.5+Knn 0,820 0,891 0,847

Algorithm Entropy

2-gram 3-gram 4-gram

C4.5 0.181 0.101 0.169

C4.5+Knn 0.16 0.10 0.124

in the following

WITH 2, 3, 4 AND 5-

riety of N-gram.

ethod C4.5+knn ion of texts give the use of C4.5 rst one is given

box), because f-of the precision representation

cision and recall,

give an effective opy to calculate mation gain as ke the previous ellow box) with

with 5gram nce in term of

GRAM CHARACTER.

Fig .5. Entropy of C4.5 and C4.5+Ktext repre

By analyzing the resultpreceding, we clearly see tinfluences the quality of thperformance is given with N entropy 0.10 (yellow boxesrepresentation for the corpusof C4.5 + KNN gives better r

B. Genetic Algorithms The following table s

obtained after the applicatiothe variation of the following

• Selection method (1

• Probability of cros0.9.

• Initial population (I

• Number of Iteration

• Mutation probability

TABLE 3: THE BEST RESULTS OF TDIFFERENT P

Selection method

NI PC IP VP

½ Elitist 700 0,3 600 400

Tournament 900 0,4 700 596

Rank 500 0.2 500 252

Fig.6. Best results of the Genetic A

2-gram

3-gram

4-gram

5-gram

0

0.05

0.1

0.15

0.2

C4,5

Entropy

0.944 0.973

0.147

½ elitist tournF-measur

5-gram

0,795

0,851

5-gram

0.182

0.116

Knn with 2, 3, 4 and 5-gram character esentation.

ts in the tables and figures that the N parameter greatly he results obtained. the best = 3 with f-measure 0.891 and

s), which presents the ideal s pan 09 and the combination results than C4.5 alone.

s Results summarizes the best results on of genetic algorithms with g parameters:

/2 elitism, tournament, rank).

sover (PC) between 0.1 and

IP) between 100 and 1000.

n (NI) between 100 and 1000.

y =0.001.

THE GENETIC ALGORITHMS WITH PARAMETERS.

P FP VN FN F E

0 223 250 127 0,944 0.147

6 27 220 157 0.973 0.133

2 325 73 350 0.909 0.115

Algorithms with different parameters.

C4,5+KNN

2-gram

3-gram

4-gram

5-gram

y

3 0.909

0.133 0.115

nament Rankre Entropy

Observing table 3 and figure 6 we can clearly note that the tournament method gives better results than the two other selection methods with f-measure=0,973 and entropy=0.133 (yellow boxes) using the following parameters: initial population = 900, crossover probability = 0.4 and initial population =700.

II. CONCLUSION We concluded with the decisions that can be used by

other researchers in the future that using (C4.5 + KNN) gives better results compared to (C4.5) only and the representation 3-gram character gives the best results compared to the 2,4 and 5-gram, genetic algorithms based on random can give better result compared to the traditional learning methods in the domain of plagiarism detection, thus that the parameters like the number of iterations, initial population, crossover probability and selection method influences a lot the quality of the result but the best result are given by the tournament method compared to the other methods.

Automatically plagiarism Detecting is a relatively new subject in our modern life, that attracts the attention of many researchers, this is a very difficult problem to solve because there are different languages in the world and the ambiguity of human language which represent the major problem. We can further improve our approach by integration of an automatic translator in our system in order to solve the problem of multilingual plagiarism. We can also Applied other methods of meta-heuristics based on bio-inspired technics such as artificial fishes swarm optimization, firefly algorithm and other evolutionary algorithms.

REFERENCES [1] K. Hamlem, M. Kantarcioglu, L. Khan, B. Thuraisighan ,

“Security Issues for Cloud Computing”, International Journal of Information Security and Privacy , 2010,pp.39-51.

[2] M. zechner , M. muhr,R kern. “external and interstinc plagiarism detection using vector space model”,proceeding of the SEPLN ’09 pan 09 3rd workshop and 1st international compétition on plagiarism ,san sebastian (donostia) spain, vol-502, pp.47-55 issn 1613-0073, 10 September 2009.

[3] C Basile. “A plagiarism detection procedure in three steps: selection, matches and squares”. proceeding of the SEPLN ’09 pan 09 3rd workshop and 1st international compétition on plagiarism san sebastian (donostia),spain .vol-502 pp.19-23. 10 September 2009 .issn 1613-0073.

[4] N Akiva .”Using Clustering to Identify Outlier Chunks of Text.2011”. CLEF 2011 Conference on Multilingual and Multimodal Information Access Evaluation Amsterdam, September 2011 pp.19-22.

[5] J. Kasprzak, M. Brandejs, and Miroslav Kripac. “Finding Plagiarism by Evaluating Document Similarities”. proceeding of the SEPLN ’09 pan 09 3rd workshop and 1st international compétition on plagiarism san sebastian (donostia),spain . vol-502 pp. 24-28. 10 September 2009 .issn 1613-0073.

[6] A. Ghosh, P. Bhaskar, S. Pal, S. Bandyopadhyay . “Rule Based Plagiarism Detection using Information Retrieval”. CLEF 2011 Conference on Multilingual and Multimodal Information Access Evaluation Amsterdam. pp.19-22, September 2011 .

[7] E. Stamatatos . “Intrinsic Plagiarism Detection Using Character n-gram Profiles”. proceeding of the SEPLN ’09 pan 09 3rd workshop and 1st international compétition on plagiarism san sebastian (donostia),spain. vol-502 pp. 38-46. 10 September 2009 .issn 1613-0073.

[8] O. Uzuner , B. Katz, and T. Nahnsen. “Using Syntactic Information to Identify Plagiarism”. Proceedings of the 2nd Workshop on Building Educational Applications Using NLP, June 2005,pages 37–44.

[9] http://www.saedsayad.com/k_nearest_neighbors.htm , Novembre, 20th 2013, 23 :17

[10] Y. Palkovskii, A. Belov, I. Muzyka. “Using WordNet-based semantic similarity measurement in External Plagiarism Detection”. CLEF 2011 Conference on Multilingual and Multimodal Information Access Evaluation, Amsterdam September 2011, pp.19-22.

[11] S Meyer, Z. Eissen, B. Stein, and M. Kulig, “Plagiarism Detection without Reference Collections”. the 30th Annual Conference of the German Classification Society (GfKl) Berlin, 2007, pp. 359-366, ISBN 978-3-540-70980-0.

[12] V. Goyal, A. Sahai . “AttributeBased Encryption for Fine-Grained Access Control of Encrypted Data”, Conference on Computer and Communications Security: Proceedings of the 13th ACM conference on Computer and communications security, 2006.

[13] B.Chor,O.Goldreich,E.Kushile-vitz and M.Sudan.Private “Information Retrival”,36th IEEE Conference on the Foundations of Computer Science (FOCS),octobre 1995, pp.41-50.

[14] E.V. Balaguer . “Putting Ourselves in SME’s Shoes: Automatic Detection of Plagiarism by the WCopyFind Tool”. proceeding of the SEPLN ’09 work shop on incovering plagiarisme ,authorship and social misuese san sebastian (donostia),spain . 10 September 2009 , pp.34-35, vol-502 issn 1613-0073.

[15] N. Samatha, K. V. Chandu, P. R. S. Reddy ,” Query Optimization Issues for Data Retrieval in Cloud Computing”, International Journal Of Computational Engineering Research,2009,ISSN:2250-3005.

[16] P. R. Rao, S. V. Sridhar, V. R. Krishna, “an Optimistic Approach for Query Construction and Execution in Cloud Computing Environment”, International Journal of Advanced Research in Computer Science and Software Engineering, 2013, ISSN:2277-128X.

[17] M. Cebrian, A. Ortega and M. Alfonseca .”Towards the Validation of Plagiaridsm Detection Tools by Means of Grammar Evolution”,IEEE Transactions on Evolutionary Computation, June2009,pp.477-485,ISSN:1089-778X.

[18] http://www.webis.de/research/events/pan-09. Septembre, 17th 2012,13 :43.

[19] R. M. Hamou, A. Abdelmalek, et A. C. Lokbani “Study of Sensitive Parameters of PSO Application to Clustering of Texts”. International Journal of Applied Evolutionary Computation (IJAEC), 2013, vol. 4, no 2, p. 41-55.

[20] R. M. Hamou, A. Abdelmalek and A. C. Lokbani. “The Social Spiders in the Clustering of Texts:Towards an Aspect of Visual Classification”, International Journal of Artificial Life Research (IJALR), 2012, vol. 3, no 3, p. 1-14