chapter 2 distributed data mining on data...

26
12 CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRID 2.1 KNOWLEDGE DISCOVERY Knowledge Discovery in Databases (KDD) refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from the data in databases. While data mining and knowledge discovery in databases are frequently treated as synonyms, data mining is actually part of the knowledge discovery process. The KDD employs data mining to find useful patterns, models and trends in a large volume of data. In many scientific and commercial applications, it is necessary to perform the analysis of large data sets. With the enormous amount of data stored in files, databases, and other repositories, it is increasingly important, if not necessary, to develop a powerful means for the analysis and perhaps interpretation of such data, and for the extraction of interesting knowledge that could help in decision-making. The knowledge discovery in databases is the process of automatically searching large volumes of data for patterns, using tools such as classification, association rule mining, clustering, etc. Data mining is a complex topic, and has links with multiple core fields, such as computer science and adds value to rich seminal computational techniques from statistics, information retrieval, machine learning and pattern recognition. Based on the latest development of grid computing, a Distributed Data Mining (DDM) system architecture on the grid environment is proposed in this thesis.

Upload: others

Post on 25-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

12

CHAPTER 2

DISTRIBUTED DATA MINING ON DATA GRID

2.1 KNOWLEDGE DISCOVERY

Knowledge Discovery in Databases (KDD) refers to the nontrivial

extraction of implicit, previously unknown and potentially useful information

from the data in databases. While data mining and knowledge discovery in

databases are frequently treated as synonyms, data mining is actually part of

the knowledge discovery process. The KDD employs data mining to find

useful patterns, models and trends in a large volume of data. In many

scientific and commercial applications, it is necessary to perform the analysis

of large data sets. With the enormous amount of data stored in files,

databases, and other repositories, it is increasingly important, if not necessary,

to develop a powerful means for the analysis and perhaps interpretation of

such data, and for the extraction of interesting knowledge that could help in

decision-making.

The knowledge discovery in databases is the process of

automatically searching large volumes of data for patterns, using tools such as

classification, association rule mining, clustering, etc. Data mining is a

complex topic, and has links with multiple core fields, such as computer

science and adds value to rich seminal computational techniques from

statistics, information retrieval, machine learning and pattern recognition.

Based on the latest development of grid computing, a Distributed Data

Mining (DDM) system architecture on the grid environment is proposed in

this thesis.

Page 2: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

13

2.2 DATA MINING

The concept of data mining on the grid is a hot research topic of

late. Data mining is the process of analyzing data from different perspectives

and summarizing it into useful information - information that can be used to

increase revenue, cuts costs, or both. Data mining software is one of a number

of analytical tools for analyzing data. It allows users to analyze data from

many different dimensions or angles, categorize it, and summarize the

relationships identified. Technically, data mining is the process of finding

correlations or patterns among dozens of fields in large relational databases.

The ultimate goal of data mining is prediction.

2.2.1 Data Mining Steps

Data mining is a powerful analytical tool that enables business

executives to advance from describing historical customer behavior to

predicting the future. It finds patterns that unlock the mysteries of customer

behavior. These findings can be used to increase revenue, reduce expenses

and identify business opportunities, offering new competitive advantages. The

KDD process comprises of a few steps, leading from raw data collections to

some form of new knowledge. There are various steps that are involved in

mining data as shown in Figure 2.1.

‚ Data warehouse: The term data warehouse was coined by

Bill Inmon in 1990; he defined it in the following way: "A

warehouse is a subject-oriented, integrated, time-variant and

non-volatile collection of data in support of the management's

decision making process". All mobile users moving and

service request logs are stored in the data warehouse, and will

be provided as input to the data mining process.

‚ Data selection: The data relevant for the analysis is decided

on and retrieved from the data warehouse.

Page 3: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

14

‚ Data cleaning: In this step, noise and irrelevant data are

removed from the selected data.

‚ Data transformation: In this, the selected data is transformed

in to forms appropriate for the mining process.

‚ Data mining: Applying data mining techniques, such as the

Apriori algorithm for association rule mining a on data base to

discover interesting patterns.

‚ Interpretation: It is the process of converting the result of the

data mining process in to knowledge, which is presented to the

user.

‚ Action: The user can make use of knowledge to perform

better action.

Figure 2.1 Data mining steps

Page 4: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

15

2.2.2 Data Mining Techniques

2.2.2.1 Association

Association analysis is the discovery of what are commonly called

association rules. It studies the frequency of items occurring together in

transactional databases, and based on a threshold called support, identifies the

frequent item sets. Another threshold, confidence, which is the conditional

probability that an item appears in a transaction when another item appears, is

used to pinpoint association rules. Association rule mining is a data mining

technique used to find interesting relations among a large set of data items.

The discovered association rules may help decision making in different areas.

The problem of association rule mining originated for market

analysis on sales basket data. In a market-basket analysis, the buying habits of

customers are analyzed to find the association between the data items

purchased. The discovery of such associations can help the retailers develop

marketing and planning strategies. The association rule is in the form of XåY, where X and Y are item sets. Items that are purchased together by

customers can be identified. Whenever the customer bought X, he bought Y

also. Ex. {Bread, Jam}å {Butter}.

An association rule is an implication of the form X s Y, where

X = {x1, …, xm} and Y= {y1, …, yn} are sets of items with X y Y = Г. The

rule X s Y has a support s, if the s% of all itemsets contain X ж Y. The rule

X s Y has confidence c, if the c% of itemsets that contain X also contain Y.

The problem of mining association rules, is to generate all the association

rules that have a support and confidence greater than the user-specified

minimum support min_supp and minimum confidence min_conf,

respectively.

Page 5: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

16

The problem of discovering association rules can be decomposed

into two subproblems:

1) Find the set F of all itemsets with support above minimum

support min_supp. These itemsets are called frequent itemsets.

2) Use the frequent itemsets to generate the desired rules. For

every XЁ F, check the confidence of all rules X \ Y s Y, Y

胃 X, Y Œ Г, and eliminate those that do not achieve

min_conf. It is sufficient to calculate all the support values of

the subsets of X to determine the confidence of each rule.

2.2.2.2 Classification

Classification is a data mining (machine learning) technique used to

predict the group membership for data instances. For example, you may wish

to use classification to predict whether the weather on a particular day will be

“sunny”, “rainy” or “cloudy”. A classification analysis is the organization of

data in given classes. Also known as supervised classification, the

classification uses given class labels to order the objects in the data collection.

Classification approaches normally use a training set, where all objects are

already associated with known class labels. The classification algorithm

learns from the training set and builds a model. The model is used to classify

new objects.

2.2.2.3 Clustering

Similar to classification, clustering is the organization of data in

classes. However, unlike classification, in clustering, class labels are

unknown, and it is up to the clustering algorithm to discover acceptable

classes. Clustering is also called unsupervised classification, because the

Page 6: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

17

classification is not dictated by given class labels. There are many clustering

approaches all based on the principle of maximizing the similarity between

objects in the same class (intra-class similarity), and minimizing the similarity

between objects of different classes (inter-class similarity).

2.2.2.4 Prediction

Prediction has attracted considerable attention, given the potential

implications of successful forecasting in a business context. There are two

major types of predictions: one can either try to predict some unavailable data

values or pending trends, or predict a class label for some data. The latter is

tied to classification. Once a classification model is built based on a training

set, the class label of an object can be foreseen, based on the attribute values

of the object and the attribute values of the classes. Prediction, however, more

often, refers to the forecast of missing numerical values, or increasing/

decreasing trends in time related data. The major idea is to use a large number

of past values to consider probable future values.

2.2.2.5 Deviation Analysis

Deviation analysis, on the other hand, considers differences

between the measured values and expected values, and attempts to find the

cause of the deviations from the anticipated values.

2.2.2.6 Outlier Analysis

Outliers are data elements that cannot be grouped in a given class

or cluster. Also known as exceptions or surprises, they are often very

important to identify. While outliers can be considered noise and discarded in

some applications, they can reveal important knowledge in other domains,

and thus can be very significant and their analysis valuable.

Page 7: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

18

2.3 MOBILE WEB ENVIRONMENTS

The mobile web refers to the use of Internet-connected

applications, or browser-based access to the Internet from a mobile device -

such as a laptop or tablet PC - connected to a wireless network. Advances in

Internet and network technology and the rapidly growing number of mobile

personal devices have resulted in the fast growth of mobile E-Commerce and

M-Commerce. The goal of the initiative is to make browsing the web from

mobile devices more reliable and accessible.

2.3.1 Mobile Web Services

In a mobile web, LBSs are information services accessible with

mobile devices through the mobile network, and utilizing the ability to make

use of the current location of the mobile devices. In this research work, the

predicted location pattern and service request patterns are effectively used by

service providers to provide location-based services. Web services are

software components identified by the URI. Mobile web services can be

accessed over the Internet using popular Web mechanisms and protocols,

such as the HTTP.

2.3.2 Mobility Prediction

Mobile users move from one location to another location in a

wireless PCS network. The coverage area of the network is divided into a

number of location areas. Each mobile device is linked with the Base Station

(BS). Each Base station contains a Home Location Register (HLR), which

stores permanent details of the mobile users, and a Visiting Location Register

(VLR), which stores temporary details of the mobile users. These registers

include attributes like the user ID, user location, call time, call duration, etc.

The user ID acts as a key for mobile user records. Mobility patterns are mined

from the User Access Path (UAP), and mobility rules are generated using

Page 8: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

19

mobility patterns. Finally the current location of mobile user is compared with

the mobility rule to predict the next locations of the mobile user. By using the

predicted movement, the system can effectively allocate resources and

provide location-based services to the mobile users.

2.3.3 Location Based Web Services on the Mobile Web

A location-based service consists roughly of two phases:

determining the position of the customer, and providing service or contents

based on the position. LBSs define those mobile commerce services that

utilize information about the current location of the person using a mobile

device. Ideally, the information provided should be both location-specific and

personalized, based on the personal profile of the user.

2.3.4 Prediction of Mobile User Behavior

Predicted location based services can be used by the service

providers to provide location based services to the mobile users. Location-

based services can be categorized in two possible categories, based on the

perspective of the user:

‚ End User Application Perspective: Traffic and weather

information as to where you are, driving directions,

entertainment applications, wireless advertising, store

location, etc.

‚ Developer and vendor perspective: mapping, routing

(Directions), GPS Navigation (real-time turn by turn

navigation), proximity searches, destination guides, tracking

(vehicles, assets, friend or buddy finders), telematics, location-

based billing, advertising, etc.

Page 9: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

20

2.4 DISTRIBUTED DATA MINING

Distributed computing plays an important role in this process for

several reasons. First, data mining often requires huge amounts of resources in

storage space and computation time. To make systems scalable, it is important

to develop mechanisms that distribute the work load among several sites in a

flexible way. Second, data is often inherently distributed over several databases,

making a centralized processing of this data very inefficient and prone to security

risks. Finally, many data mining tasks require connecting heterogeneous

resources, such as data sources, processing nodes and end user applications.

2.4.1 Centralized Data Mining

Centralized data mining algorithms are used for the analysis of

large datasets stored in a single site. Data mining and data warehousing go

hand-in-hand: most tools operate on the principle of gathering all data into a

central site, and then running an algorithm against that data. A simple

architecture of centralized data mining is shown in Figure 2.2.

Figure 2.2 Centralized data mining

Local

Data

Data

Warehouse

Data Mining

Algorithm

Local

Data

Local

Data

Page 10: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

21

2.4.2 Limitations of Centralized Data Mining

Mining large data sets requires powerful computational resources.

A major issue in data miming is scalability with respect to the very large

databases. Centralized data mining algorithms take longer time to work on

large databases. There are a number of applications that are infeasible under

such a methodology, leading to the need for distributed data mining.

There are several issues where distributed data mining arises.

1. Connectivity: Transmitting a large amount of data to a

centralized node may by infeasible

2. Heterogeneity of source data: Integrating heterogeneous data

from different places into one database is not easy.

3. Privacy of data: Organizations may be willing to share data

mining results, but not data.

2.4.3 Need for Distributed Data Mining

Distributed data mining refers to the mining of distributed data sets.

Distributed data mining explores techniques of how to apply data mining in a

non-centralized way. The data sets are stored in a local data base, and hosted

by local computers, which are connected through a computer network. Data

mining takes place at a local level and at a global level, where local data

mining results are combined to get global results. A simple architecture for

distributed data mining is shown in Figure 2.3.

Page 11: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

22

Figure 2.3 Distributed data mining

The following reasons have made distributed data mining interesting.

‚ Distributed data: In some applications, data are inherently

distributed, but it is necessary to gain global knowledge from

the distributed data sites. For example, each site of a

multinational company manages its own operational data

locally, but the data must be analyzed for global patterns to

allow company-wide activities, such as planning, marketing

and sales. The straightforward solution is to move all data to a

centralized site, where data mining is done. Even if a

centralized site has enough capacity to handle the data storage

and data mining, it may be too long or too expensive to

transfer the local data sets because of their sizes.

‚ Security: Sometimes the local data cannot be transferred

because of security and the anatomy of the data sets.

Local

Data

Local

Data

Local

Data

Local

Mining

Local

Mining

Local

Mining

Result Combiner

Page 12: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

23

‚ Scalability: Distributed data mining may be useful where data

is stored in a single site. One scenario is that if the data set is

too large it is beyond the data mining capability of the site. In

such a case, the site may send part of the data to other sites for

mining. The involved sites perform data mining and the results

are then combined.

2.4.4 Advantages of Distributed Data Mining

Distributed data mining techniques are scalable with the increase in

data size. This will be much faster than the centralized data mining algorithm,

since the work load is distributed among the sites. Distributed data mining is

feasible with an advance in computer network, especially the growth of the

internet and intranet. In distributed data mining, the computers and data are

distributed. In a parallel data mining, a parallel computer is assumed with

processors sharing memory and/or disk. In distributed data mining, processors

are sharing nothing.

2.5 GRID COMPUTING

2.5.1 Grid Definition

Grid computing represents the natural evolution of distributed

computing infrastructure that enables coordinated resource sharing within

dynamic organizations consisting of individuals, institutions and resources.

As the grid is becoming a well accepted computing infrastructure in science

and industry, it is necessary to provide general data mining services,

algorithms and applications that help analysts, scientists, organizations and

professionals to leverage grid capacity in supporting high-performance

distributed computing for solving their data mining in a distributed way

(Foster 2001).

Page 13: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

24

A grid is a distributed system that enables the sharing, selection and

aggregation of geographically distributed "autonomous" resources dynamically at

runtime, depending on their availability, capability, performance, cost, and the

user's quality-of-service requirements. The grid uses the resources of many

separate computers connected by a network (usually the Internet) to solve

large-scale computation problems. Grids provide the ability to perform

computations on large data sets, by breaking them down into many smaller

ones, or provide the ability to perform many more computations at once, than

would be possible on a single computer, by modeling a parallel division of

labor between processes (Foster 2000).

“Grid” computing has emerged as an important new field,

distinguished from conventional distributed computing, by its focus on large-

scale resource sharing, innovative applications, and, in some cases, high-

performance orientation. A Virtual Organization (VO) is referred to as

flexible, secure, coordinated resource sharing among dynamic collections of

individuals, institutions, and resource.

2.5.2 Need for Grid Computing

The main aim of grid computing is to give organizations and

application developers the ability to create distributed computing environments

that can utilize computing resources on demand (Fran Berman et al 2003).

Grid computing can leverage the computing power of a large numbers of

servers, computers, desktop PCs, cluster and other kind of hardware. Grid

technology allows the creation of a computing system with extreme performance

and capacity from geographically distributed computational and memory

resources. The system has the features of a global worldwide computer, where

all the components are connected together via the internet. For the user it

appears like a common workstation, but some segments of a solved task are

computed in different parts of the system.

Page 14: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

25

2.5.3 Benefits of Grid Computing

Using a grid computing architecture, organizations can quickly and

easily create a large-scale computing infrastructure from inexpensive, off-the-

shelf components. Other benefits of grid computing include

‚ Quick response to volatile business needs

‚ Real-time responsiveness to dynamic workloads

‚ Predictable IT service levels

‚ Reduced costs, as a result of improved efficiency and smarter

capacity planning.

The grid can help increase efficiencies and reduce the cost of

computing networks, by decreasing the data processing time, and optimizing

resources and distributing workloads, thereby allowing users to achieve much

faster results on larger operations and lower costs. The main advantage of the

grid is its high efficiency of using technological capacity. The grid is highly

effective in using the associated technological capacities of creative users’

potential, the safety, the reliability, the effectiveness and high level of

transportability for computational applications.

2.5.4 Grid Architecture Description

In grid computing, a VO is a group which shares the same

computing resources in a controlled fashion, so that members may collaborate

to achieve a shared goal. To achieve their mutual goal, people within a VO

choose to share their resources, creating a computer grid. This grid can give

VO members direct access to each other's computers, programs, files, data,

sensors and networks. This sharing must be controlled, secure, flexible, and

usually time-limited. The layered architecture of grid is shown in Figure 2.4.

Page 15: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

26

Figure 2.4 Grid layered architecture

Fabric: The grid fabric layer provides the resources to which shared access is

mediated by Grid protocols: for example, computational resources, storage

systems, catalogs, network resources, and sensors. Fabric components

implement the local, resource-specific operations, that occur on specific

resources (whether physical or logical) as a result of sharing operations at

higher levels.

Connectivity: The connectivity layer defines core communication and

authentication protocols required for grid-specific network transactions.

Communication protocols enable the exchange of data between Fabric layer

resources. Authentication protocols build on communication services to

provide cryptographically secure mechanisms for verifying the identity of

users and resources. Communication requirements include transport, routing,

and naming.

Resource: The resource layer builds on connectivity layer communication

and authentication protocols to define protocols (and APIs and SDKs) for the

secure negotiation, initiation, monitoring, control, accounting, and payment of

sharing operations on individual resources. The resource layer implementations of

Fabric

Connectivity

Resource

Collective

Application

Page 16: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

27

these protocols call the fabric layer functions, to access and control local

resources. Resource layer protocols are concerned entirely with individual

resources, and hence, ignore issues of global state and atomic actions across

distributed collections; such issues are the concern of the collective layer

discussed next.

Two primary classes of resource layer protocols can be

distinguished:

1. Information protocols are used to obtain information about the

structure and state of a resource, for example, its configuration,

current load, and usage policy (e.g., cost).

2. Management protocols are used to negotiate access to a shared

resource, specifying, for example, resource requirements

(including advanced reservation and quality of service) and

the operation(s) to be performed, such as process creation, or

data access. Since management protocols are responsible for

instantiating sharing relationships, they must serve as a

“policy application point,” ensuring that the requested

protocol operations are consistent with the policy under which

the resource is to be shared. Issues that must be considered

include accounting and payment. A protocol may also support

monitoring the status of an operation and controlling (for

example, terminating) the operation.

Collective: It is responsible for coordinating multiple resources. While the

resource layer is focused on interactions with a single resource, the next layer

in the architecture contains protocols and services (and APIs and SDKs) that

Page 17: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

28

are not associated with any one specific resource, but are rather global in

nature and capture interactions across collections of resources.

Application: The final layer in our grid architecture comprises the user

applications that operate within a VO environment. Applications are

constructed in terms of, and by calling upon, services defined at any layer. At

each layer, there is a well-defined protocol which provides access to some

useful service: resource management, data access, resource discovery, and so

forth.

2.5.5 Types of Grid

There are different types of grid infrastructure to fit different types

of business problems. Some grids are designed to take advantage of extra

processing resources, whereas some grid architectures are designed to support

collaboration between various organizations. The type of grid selected is

based primarily on the business problem that is being solved.

2.5.5.1 Data Grid

Data services are concerned with providing secure access to

distributed datasets and their management. To provide a scalable storage and

access to the data sets, they may be replicated, catalogued, and different

datasets may even be stored in different locations to create an illusion of mass

storage. The processing of datasets is carried out using computational grid

services, and such a combination is commonly called a data grid.

A data grid provides services that help users discover, transfer, and

manipulate large datasets stored in distributed repositories and also, create

and manage copies of these datasets. At the minimum, a data grid provides

two basic functionalities: a high performance, reliable data transfer

Page 18: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

29

mechanism, and a scalable replica discovery and management mechanism. A

security layer that handles the authentication of entities and ensures the

conduct of only authorized operations mediates all the operations in a data

grid. Another aspect of a data grid is the maintenance of shared collections of

data distributed across administrative domains. These collections are

maintained independent of the underlying storage systems, and are able to

include new sites without major effort. More importantly, it is required that

the data and information associated with data such as metadata, access

controls, and version changes be preserved even in the face of platform

changes.

A data grid, therefore, provides a platform through which users can

access aggregated computational, storage and networking resources, to

execute their data-intensive applications on remote data. It promotes a rich

environment for users to analyze data, share the results with their

collaborators, and maintain state information about the data seamlessly across

institutional and geographical boundaries.

Figure 2.5 shows a high-level view of a worldwide data grid

consisting of computational and storage resources in different countries that

are connected by high-speed networks. The thick lines show high bandwidth

networks linking the major centers, and the thinner lines are lower capacity

networks that connect the latter to their subsidiary centers. The data generated

from an instrument, experiment, or a network of sensors is stored in its

principal storage site, and is transferred to the other storage sites around the

world on request through the data replication mechanism.

Page 19: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

30

Figure 2.5 Data grid

Users query their local replica catalog to locate datasets that they

require. If they have been granted the requisite rights and permissions, the

data is fetched from the repository local to their area if it is present there;

otherwise it is fetched from a remote repository. The data may be transmitted

to a computational site, such as a cluster or a supercomputer facility for

processing. After processing, the results may be sent to a visualization

facility, a shared repository, or to the desktops of the individual users.

Data grids also harness data, storage, and network resources located

in distinct administrative domains, respect local and global polices governing

how data can be used, schedule resources efficiently, and provide high speed

and reliable access to data. Businesses interested in data grids typically have

IT initiatives to expand data-mining abilities, while maximizing the utilization

of an existing storage infrastructure investment, and to reduce the complexity

of data management.

Page 20: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

31

2.5.5.2 Computational Grid

A grid providing computational services is often called a

computational grid. Computational services are concerned with providing

secure services for executing application jobs on distributed computational

resources individually or collectively. Resources brokers provide the services

for the collective use of distributed resources. A computational grid

aggregates the processing power from a distributed collection of systems. A

computational grid is a hardware and software infrastructure that provides

dependable, consistent, pervasive, and inexpensive access to high-end

computational capabilities.

A computational grid can be recognized by these primary

characteristics:

‚ Made up of clusters of clusters

‚ Enables CPU scavenging to better utilize resources

‚ Provides the computational power to process large-scale jobs.

‚ Satisfies the business requirements for instant access to

resources on demand.

2.5.5.3 Knowledge Grid

The evolution of the data grid is represented by a knowledge grid

offering high level services for distributed mining and extraction of

knowledge from data repositories available on data. The KG is a parallel and

distributed architecture that integrates data mining techniques and grid

technologies. The knowledge grid is used to perform distributed data mining

on very large data sets available over grids to find hidden valuable

information, process models to make business decisions and arrive at results.

Page 21: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

32

2.5.6 Applications of the Grid

The grid infrastructure can benefit many applications, including

collaborative engineering, data exploration, high-throughput computing, and

distributed supercomputing. A grid can be viewed as a seamless, integrated

computational and collaborative environment. From the end-user point of

view, grids can be used to provide the following types of services.

‚ Computational services

‚ Data services

‚ Application services

‚ Information services

‚ Knowledge services

2.6 DISTRIBUTED DATA MINING ON GRID

In data-intensive applications, the focus is on synthesizing new

information from the data that is maintained in geographically distributed

repositories, digital libraries and databases. This synthesizing process is often

computationally intensive.

The popularity of the internet as well as the availability of powerful

computers and high-speed network technologies as low-cost commodity

components is changing the way we use computers today. These technology

opportunities have led to the possibility of using distributed computers as a

single, unified computing resource, leading to what is popularly known as

grid computing. The grid infrastructure can benefit many applications,

including collaborative engineering, data exploration, high-throughput

computing, and distributed supercomputing.

Page 22: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

33

Distribution of data and computation allows for solving larger

problems, and executing applications that are distributed in nature. The grid is

a distributed computing infrastructure that enables coordinated resource

sharing within dynamic organizations consisting of individuals, institutions,

and resources. The grid extends the distributed and parallel computing

paradigms allowing resource negotiation and dynamical allocation,

heterogeneity, open protocols and services.

Grid environments can be used both to compute intensive tasks and

data intensive applications, as they offer resources, services, and data access

mechanisms. Data mining algorithms and knowledge discovery processes are

both compute and data intensive, therefore the grid can offer a computing and

data management infrastructure for supporting decentralized and parallel data

analysis. This thesis discusses how grid computing can be used to support

distributed data mining. Grid-based data mining uses grids as decentralized

high-performance platforms where data mining tasks and knowledge

discovery algorithms and applications, can be executed.

Grid computing represents the natural evolution of distributed

computing and parallel processing technologies. The grid is a distributed

computing infrastructure that enables coordinated resource sharing within

dynamic organizations consisting of individuals, institutions, and resources.

The main aim of grid computing is to give organizations and application

developers the ability to create distributed computing environments that can

utilize computing resources on demand. Grid computing can leverage the

computing power of a large numbers of server computers, desktop PCs,

clusters and other kind of hardware. The grid can help increase efficiencies

and reduce the cost of computing networks by decreasing the data processing

time and optimizing resources and distributing workloads, thereby allowing

users to achieve much faster results on larger operations and at lower costs.

Page 23: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

34

2.7 GLOBUS TOOLKIT

The term grid computing refers to the emerging computational and

networking infrastructure that is designed to provide pervasive, uniform and

reliable access to data, computational, and human resources distributed over

wide area environments. Grid services allow scientists at locations throughout

the world to share data collection instruments, such as particle colliders,

compute resources such as supercomputers and clusters of workstations, and

community datasets stored on network caches and hierarchical storage

systems.

The globus toolkit developed within the globus project provides

middleware services for Grid computing environments. Major components

include the Grid Security Infrastructure (GSI), which provides public-key-

based authentication and authorization services; resource management

services, which provide a language for specifying application requirements,

mechanisms for immediate and advance reservations of grid resources, and

for remote job management; and information services, which provide for the

distributed publication and retrieval of information about grid resources.

The globus toolkit (http://www.globus.org/) is a community-based,

open-architecture, open-source set of services and software libraries that

supports grids and grid applications. The toolkit includes software for

security, information infrastructure, resource management, data management,

communication, fault detection, and portability. It is packaged as a set of

components that can be used either independently or together to develop

applications.

For each component, the toolkit defines both protocols and

application programming interfaces (APIs), and provides open-source

reference implementations in C and (for client-side APIs) Java.

Page 24: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

35

A tremendous variety of higher-level services, tools, and applications have

been implemented in terms of these basic components. Some of these services

and tools are distributed as part of the toolkit, while others are available from

other sources.

2.7.1 Components of the Globus Toolkit

The globus toolkit is used to build the grid environment. The GT

contains the following components.

‚ Grid security infrastructure (GSI): Enables secure

authentication and communication over an open network

providing a number of services, including mutual authentication

and single sign-on run-anywhere authentication, with support

for local control over access rights, and mapping from global

to local user identities. GSI is based on public key encryption,

X.509 certificates, and the secure sockets layer (SSL)

communication protocol.

‚ Monitoring and discovery service (MDS): Provides a

framework for publishing and accessing information about

grid resources by using the lightweight directory access

protocol (LDAP) as a uniform interface to such information.

The MDS provides two types of directory services: the grid

resource information service (GRIS) and the grid index

information service (GIIS). A GRIS can answer queries about

the resources of a particular grid node; examples of

information provided include host identity (e.g., operating

systems and versions), as well as more dynamic information

such as the current CPU load and memory availability. A GIIS

combines the information provided by a set of GRIS services

Page 25: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

36

managed by an organization, giving a coherent system image

that can be explored or searched by grid applications.

‚ Globus resource allocation manager (GRAM): Provides

facilities for resource allocation and process creation,

monitoring, and management. The GRAM simplifies the use

of remote systems by providing a single standard interface for

requesting and using remote system resources for the

execution of jobs. The most common use of the GRAM is

remote job submission and control, to support distributed

computing applications.

‚ Dynamically updated resource online co-allocator

(DUROC): Manages multirequests of resources, delivers

requests to different GRAMs and provides time-barrier

mechanisms among jobs. In globus, a GRAM provides an

interface to submit jobs on a particular set of physical

resources, whereas the DUROC is used to coordinate

transactions with independent GRAMs.

‚ Heartbeat monitor (HBM): Provides a mechanism for

monitoring the state of processes. The HBM is designed to

detect and report the failure of processes that have identified

them selves to the HBM. It allows the simultaneous

monitoring of both globus system processes and application

processes associated with user computations. The HBM also

provides the notification of process status exception events, so

that recovery action can be taken.

‚ GridFTP: Implements a high-performance, secure data

transfer mechanism based on an extension of the FTP protocol

that allows parallel data transfer, partial file transfer, and

Page 26: CHAPTER 2 DISTRIBUTED DATA MINING ON DATA GRIDshodhganga.inflibnet.ac.in/bitstream/10603/13444/7/07_chapter 2.pdf · M-Commerce. The goal of the initiative is to make browsing the

37

third-party (server-to-server) data transfer, using the GSI for

authentication. This allows grid applications to have

ubiquitous, high-performance access to data, in a way that is

compatible with the most popular file transfer protocol in use

today.

‚ Replica catalog and replica management: Provide facilities

for managing data replicas, i.e., multiple copies of data stored

in different systems to improve access across geographically

distributed grids. The replica catalog provides mappings

between logical names for files and one or more copies of the

files on physical storage systems; it is accessible via an

associated library and a command-line tool. The replica

management combines the replica catalog (for keeping track

of replicated files) and the GridFTP (for moving data) to

manage data replication.

In the mobile web environment, the necessity of efficient location

based services leads to the development of distributed data grid mining

system to predict the location and service request pattern of a mobile user.