[ieee 2010 international conference on management and service science (mass 2010) - wuhan, china...

4
Based on grid anti-spam system research Jiang Liu, Xia Yan,Lv Hai Hua Department of Information Engineering Shenyang Institute of Engineering Shenyang, China [email protected] AbstractThrough the grid services research and analysis, on the basis of the heterogeneous network and share resources, the design model of E-mail system which is based on the service grid anti-spam system is proposed and some functions are achieved. Through the GT containers the system with the pooling of anti-spam servers resources filter E-mail with distributed statistical method used in the entire network. Keyword-E-mail; Deviation anti-spam; Service grid; Distributional statistics; Bayesian 1 Introduction With the rapid popularization of Internet, E-mail because of its convenient, fast, low cost and so on gradually replacing the traditional means of communication has become one of the main means of communication in modern society. In recent years, some companies, organizations or individuals for commercial interests, without the consent of the E-mail users to use E-mail to send a large number of commercial advertising and a variety of adverse information, form a very bad influence, the consequences of a serious spam problem. In order to reduce junk E-mail to the various kinds of damage brought about the current anti-spam technology studies to become the focus of all walks of life, making anti-spam technology is rapid development. However, as the anti-spam servers for their own array, information can not be shared [1], making a single server, it is difficult effective and timely manner to the spam filter, requiring cooperation among the various servers to share resources before they can do it. The grid is the realization of computing resources, storage resources, data resources, information resources, knowledge resources, expert resources, full sharing of the most popular techniques. Therefore, this article uses a distributed statistical technique; through the grid service has established a collaborative anti-spam system. The system is mainly features and innovations are: (1) The use of service grid utilization of system resources; (2) The statistical algorithm in the whole network of distributed filtering, to more effectively filter spam; (3) for the current widespread use of The HTTP proxy-related E-mail system design; (4) the use of scheduler and effective use of the anti-spam server load. 2 Grid Architecture and Anti-Spam System 2.1 Grid Architecture Grid is integrated the entire Internet into a giant super-computer to achieve computing resources, storage resources, data resources, information resources, knowledge resources, comprehensive sharing of expert resources. It has the autonomy and multiplicity of the characteristics, which makes the resources of owners have the ability of self-management and this self-management while also under unified management. The current grid architecture proposed two important models are: Five-layer structure of an hourglass and Open Grid Services Architecture (OGSA). 2.1.1 Computational Grid five-layer structure of an hourglass Hourglass five-layer structure, one of the most important ideas is to "agreement" as the center, and the strong emphasis on service about APT (Application Programming Interface) and SDK (Software Development Kits) importance. Five-layer structure of an hourglass does not provide strictly regulated, but the various parts of the structure components of the definition of common requirements, and will develop a certain level of these components the relationship between the components of each layer has the same characteristics, the upper group of pieces can be in any one component based on the underlying structure. The structure is an abstraction hierarchy, the main feature was the overall structure of the hourglass shape, its inner meaning is due to the number of the various parts of the agreement is different. For the most part of the core - hourglass bottleneck is to define the core abstractions and protocols of a small collection, from a resource layer and link layer together form. A number of different high-level (the top of the hourglass) behavior is mapped to the top of them; they themselves can also be mapped to a different basic techniques on (the bottom of the hourglass). At the same time the structure to achieve a higher degree of sharing. 2.1.2 Open Grid Services Architecture (OGSA) Open Grid Services Architecture (OGSA) is the development of five-layer structure of an hourglassis product of the integration of grid technology and Web services technology. OGSA change the grid computing from the application of science and engineering calculations to be extended to a wider range of distributed system service integration as the main features of the commercial applications, the establishment of the basic concepts of grid 978-1-4244-5326-9/10/$26.00 ©2010 IEEE

Upload: hai-hua

Post on 11-Apr-2017

212 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2010 International Conference on Management and Service Science (MASS 2010) - Wuhan, China (2010.08.24-2010.08.26)] 2010 International Conference on Management and Service Science

Based on grid anti-spam system researchJiang Liu, Xia Yan,Lv Hai Hua

Department of Information Engineering Shenyang Institute of Engineering

Shenyang, China [email protected]

Abstract— Through the grid services research and analysis, on the basis of the heterogeneous network and share resources, the design model of E-mail system which is based on the service grid anti-spam system is proposed and some functions are achieved. Through the GT containers the system with the pooling of anti-spam servers resources filter E-mail with distributed statistical method used in the entire network.

Keyword-E-mail; Deviation anti-spam; Service grid; Distributional statistics; Bayesian

1 Introduction

With the rapid popularization of Internet, E-mail because of its convenient, fast, low cost and so on gradually replacing the traditional means of communication has become one of the main means of communication in modern society. In recent years, some companies, organizations or individuals for commercial interests, without the consent of the E-mail users to use E-mail to send a large number of commercial advertising and a variety of adverse information, form a very bad influence, the consequences of a serious spam problem.

In order to reduce junk E-mail to the various kinds of damage brought about the current anti-spam technology studies to become the focus of all walks of life, making anti-spam technology is rapid development. However, as the anti-spam servers for their own array, information can not be shared [1], making a single server, it is difficult effective and timely manner to the spam filter, requiring cooperation among the various servers to share resources before they can do it. The grid is the realization of computing resources, storage resources, data resources, information resources, knowledge resources, expert resources, full sharing of the most popular techniques. Therefore, this article uses a distributed statistical technique; through the grid service has established a collaborative anti-spam system. The system is mainly features and innovations are: (1) The use of service grid utilization of system resources; (2) The statistical algorithm in the whole network of distributed filtering, to more effectively filter spam; (3) for the current widespread use of The HTTP proxy-related E-mail system design; (4) the use of scheduler and effective use of the anti-spam server load.

2 Grid Architecture and Anti-Spam System

2.1 Grid Architecture

Grid is integrated the entire Internet into a giant super-computer to achieve computing resources, storage resources, data resources, information resources, knowledge resources, comprehensive sharing of expert resources. It has the autonomy and multiplicity of the characteristics, which makes the resources of owners have the ability of self-management and this self-management while also under unified management.

The current grid architecture proposed two important models are: Five-layer structure of an hourglass and Open Grid Services Architecture (OGSA).

2.1.1 Computational Grid five-layer structure of an hourglass

Hourglass five-layer structure, one of the most important ideas is to "agreement" as the center, and the strong emphasis on service about APT (Application Programming Interface) and SDK (Software Development Kits) importance. Five-layer structure of an hourglass does not provide strictly regulated, but the various parts of the structure components of the definition of common requirements, and will develop a certain level of these components the relationship between the components of each layer has the same characteristics, the upper group of pieces can be in any one component based on the underlying structure. The structure is an abstraction hierarchy, the main feature was the overall structure of the hourglass shape, its inner meaning is due to the number of the various parts of the agreement is different. For the most part of the core - hourglass bottleneck is to define the core abstractions and protocols of a small collection, from a resource layer and link layer together form. A number of different high-level (the top of the hourglass) behavior is mapped to the top of them; they themselves can also be mapped to a different basic techniques on (the bottom of the hourglass). At the same time the structure to achieve a higher degree of sharing.

2.1.2 Open Grid Services Architecture (OGSA)

Open Grid Services Architecture (OGSA) is the development of five-layer structure of an hourglass,is product of the integration of grid technology and Web services technology. OGSA change the grid computing from the application of science and engineering calculations to be extended to a wider range of distributed system service integration as the main features of the commercial applications, the establishment of the basic concepts of grid

978-1-4244-5326-9/10/$26.00 ©2010 IEEE

Page 2: [IEEE 2010 International Conference on Management and Service Science (MASS 2010) - Wuhan, China (2010.08.24-2010.08.26)] 2010 International Conference on Management and Service Science

services. OGSA uses Web services, WSDL and SOAP specifications. Follow the OGSA standard system can be linked together, users can easily integrate and share all kinds of system functions, can save the users’ development costs and improve development efficiency [3].

OGSA is divided into four levels, namely the physical and logical resources layer, Web services layer, service layer architecture based on OGSA and grid service application layer. The structure shown in Figure 1 [2]:

Figure 1: OGSA Architecture Structure

2.2 Anti-Spam System

Spam generally refers to a large number E-mails without the user's permission, but was forced into the user's mailbox.

Common spam messages include money, adult advertising, commercial or personal website advertising, e-magazines, and chain letters and so on. Junk E-mail in general has the following characteristics: the same content is repeatedly sent; the same sender a specific time period of non-normal communication; not legal address; from the international public RBL list of IP requests. In many literatures, the virus distributed by E-mail, also known as E-mail spam, there are no E-mail included in the study of the virus within the scope of junk E-mail. While viruses and spam have characteristics of sending messages to large quantities of the unknown users, but the junk E-mail content on the advertising of a strong, generally without attachments, the virus E-mail almost no content, usually with a virus Annex. The virus E-mail content-based filtering is difficult to identify, but by features of the E-mail attachments for viruses scan can effectively detect the type of E-mail.

2.3Grid in anti-spam System

The global spread of spam has also made it to the anti-spam technology to achieve resource sharing, working together, to resist junk E-mail. At this stage to achieve this objective, the grid is one of the most effective means. Grid is a collaboration and through more efficient use of resources to solve real business challenges of practical skills. In addition, grid technology also allows users to create a life longer and more powerful features to provide the infrastructure, improve efficiency and reduce costs, and should be able to provide immediate return on investment.

This feature for the grid, experts proposed the concept anti-spam of grid, and various related theory and research have been generated, it has a major study intended significance: First, the realization of the technology can be makes the existing anti-spam technology, collaborate more effectively fight spam; Secondly, through this technology research that can deepen their understanding of grid technology for the future development and application on this basis, the accumulation of experience; Third, through the establishment of a prototype anti-spam framework of the structure of a resource-sharing, collaborative work in grid system provides the opportunity to practice.

3 System Design and Construction

3.1 Grid anti-spam key technologies

The use of grid technology anti-spam technology based on the following two basic principles:

(1) For these addresses cover the entire network of spam, we can give each message to specify a value called the CopyRank, is used to indicate the number of people received the E-mail may be based on an E-mail CopyRank value to determine an E-mail is not spam. CopyRank, and Google's PageRank principles used in similar [4].

Brightmail has the same ideas, it contains a number of fake E-mail address. All messages sent to these addresses is definitely spam, when a user receives the same message can be filtered out. The system will calculate signatures for each message, different E-mail should have different signatures, you can E-mail signature by comparing the two to determine if they are not the same message. Unfortunately, spammers can E-mail signature to add random elements to make each

Page 3: [IEEE 2010 International Conference on Management and Service Science (MASS 2010) - Wuhan, China (2010.08.24-2010.08.26)] 2010 International Conference on Management and Service Science

message is different. In the end, this method can only capture 50-70% of the E-mail.

In order to avoid this situation, you can use the fuzzy CopyRank instead of an accurate value. By detecting and removing most of the interference by the machine-generated information, E-mail will be the first to be purified. And then use an algorithm to generate a checksum, making a similar E-mail with a similar value. By the value of close CopyRank, an entire E-mail CopyRank value will be calculated. Although the length of the checksum for long enough to minimize the different messages have the same opportunities for checksum, but with the number of messages increases, this opportunity is still there, so delayed the checksum. In this way, you can identify the above-mentioned E-mail to a large extent inside the random number added.

(2)The information collected from many computers is more accurate and complete than only from one machine to collect the information. Using a distributed Bayesian algorithm, which in the hundreds of client computers to perform a Bayesian learning process, and then to all the activities of the client-side collection and dissemination of timely information. Usually, the server add a large number of fake E-mail accounts and then analysis the accounts. Where to these fake E-mail address to send E-mail is spam, these E-mail messages added to the database to a later encounter the same message as junk E-mail address directly.

Also it can be used to improve the Bayesian algorithm, which will not only be the word or phrase as a weight, also include some special features of E-mail. For example, since about 95% of spam contains links to web pages, and these links will not be hidden, if it is also a great CopyRank value, then the information in the Bayesian model will have very heavy weights. Other filters used in the rule-based methods in general can be added to the Bayesian algorithm to, for example, the user is received from this sender had a legitimate E-mail, E-mail and attachments in the diagram is the same and so on[ 5].

3.2 The principle of anti-spam network system

In order to avoid some common filtering algorithms loopholes, anti-spam developed a new filter mode. It not only contains mature international anti-spam algorithms such as Bayesian algorithm, also contains some of the latest technologies such as grid computing, cluster computing, through the trials prove the technology on a large-scale spammers have a very a good filtering effect. Anti-spam operational mechanism of the grid points based on the following considerations:

(1)Junk mail is sent to the entire Internet, the need to establish an overall infrastructure to collect junk E-mail information;

(2)Each message is needed for fingerprint calculation, thus we need a distributed computing environment;

(3)As the dynamic nature of system is strong, all the servers, and E-mail clients are constantly kept up to date; we need a flexible platform able to adapt to change [6].

Anti-spam Grid system architecture includes client-side anti-spam filtering server and scheduling servers. Under the system, whenever a new message received, it will automatically generate a digital signature, sent to the grid in one filtering server, the server according to global virtual database, determine the signature of the repeated number of times, and returned to the client. According to the number of the client, you can know that the number of times to send the message to repeat, sending the more the number, which is the higher the possibility of spam. And then with a distributed Bayesian algorithm, we can more accurately identify spam, and the possibility of false-positive errors down to close to zero.

The system reflects the true thinking of the grid, each enrolled in the system the user is both an object of service, and features a complete distributed statistical information node. With the constant expansion of the system, the system accuracy of spam filtering will also increases. Using a large-scale statistical methods to filter spam, the practice of artificial intelligence methods than with more mature, it is not prone to misjudgment of the situation, very practical; distributed Bayesian method is the traditional Bayesian methods and grid environment combination of product, it will single-point distribution of the learning process and collaborative technology, shortening the time to learn and share learning experiences. The combination of these two methods is the mainstream in the current anti-spam methods based on the sublimation of improvement of practical value [7].

3.3 The structure of anti-spam network system

The anti-spam service grid management services, identity verification, service aggregation group and a variety of specific anti-spam component grid services. Management Services is responsible for the coordination of various other modules to run; identity verification services to provide user identity verification; service aggregation grouping module for all grid service life cycle management, timely packet collated and a consistent interface to the customer; Anti-Spam distribution of grid services to provide these services on each host, Grid service providers to run their service to inform service aggregation grouping module. Anti-spam services, the grid structure shown in Figure 2:

Figure 2: Anti-Spam Service Grid Structure

The system includes e-mail recipient, the client, grid

Blacklist Service

Bayesian Services

E-mail Rules Service

Other services

Identity Verification

Service

Service aggre-

gation group

Management

Client

Page 4: [IEEE 2010 International Conference on Management and Service Science (MASS 2010) - Wuhan, China (2010.08.24-2010.08.26)] 2010 International Conference on Management and Service Science

resource allocation and management services, anti-spam servers and registration services roles.

4 Conclusion and Outlook

In the analysis of the current number of anti-spam products, based on the principle and introduced an E-mail anti-spam technology, the current situation, and through research of service grid, using a grid of heterogeneous and resource-sharing is proposed based on Service Grid's anti-spam system, the overall design model, and achieved some of its functions. The system is able to GT container pooling the resources of the various anti-spam servers, using a distributed statistical algorithm to approximate the whole network within the E-mail filtering, and in the anti-spam servers to enable pre-processing mechanism to allow network traffic greatly reduced. With the anti-spam technology continues to update and the rapid development of grid technology, using the grid to achieve anti-spam technology will continue to add new content. Although we made in this respect certain amount of study, but because of time constraints, the system there are many inadequacies in need of improvement. For example, distributed statistical algorithm research, Bayesian filtering resource sharing, the server computing, storage, resource sharing and so on. All these will be a new anti-spam field of study.

References:

[1] Zhan Chuan, Lu Xian-Liang. Anti-spam technology research. Electronic Science and Technology University, 2005, 3: 3 - 8.

[2] Yuan-ShunDai, et.al. GridEmail: A Case for Economically Regulated Internet-based Interpersonal Communications, In: Advanced Parallel and Distributed Computing, p.279-295.

[3] Zhong Wei-ming. Service grid in the field of anti-spam clock applied research. Nanchang University, Information Engineering College,2006,5: 45 - 47.

[4] E-Mail Services on Hybrid P2P Networks, Grid and cooperative computing 3rd, Lecture notes in computer science, 3251 ,p. 610-617.

[5] Chinese anti-spam alliance, anti-spam Introduction of new technology Sender ID http://tech.ccidnet.com/art/1099 /20041021 / 167440_1 .html

[6] Zhang Zhuo, Guo Kim Kyung. Open Grid Services Architecture research and application development. The PLA Information Engineering University, 2004,4: 55 - 63.

[7] Ying Wang, grid system, the composition and architecture analysis, Southwest Normal University (Natural Science Edition), 2004,8.