chapter 2 distributed data mining on data...
TRANSCRIPT
12
CHAPTER 2
DISTRIBUTED DATA MINING ON DATA GRID
2.1 KNOWLEDGE DISCOVERY
Knowledge Discovery in Databases (KDD) refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information
from the data in databases. While data mining and knowledge discovery in
databases are frequently treated as synonyms, data mining is actually part of
the knowledge discovery process. The KDD employs data mining to find
useful patterns, models and trends in a large volume of data. In many
scientific and commercial applications, it is necessary to perform the analysis
of large data sets. With the enormous amount of data stored in files,
databases, and other repositories, it is increasingly important, if not necessary,
to develop a powerful means for the analysis and perhaps interpretation of
such data, and for the extraction of interesting knowledge that could help in
decision-making.
The knowledge discovery in databases is the process of
automatically searching large volumes of data for patterns, using tools such as
classification, association rule mining, clustering, etc. Data mining is a
complex topic, and has links with multiple core fields, such as computer
science and adds value to rich seminal computational techniques from
statistics, information retrieval, machine learning and pattern recognition.
Based on the latest development of grid computing, a Distributed Data
Mining (DDM) system architecture on the grid environment is proposed in
this thesis.
13
2.2 DATA MINING
The concept of data mining on the grid is a hot research topic of
late. Data mining is the process of analyzing data from different perspectives
and summarizing it into useful information - information that can be used to
increase revenue, cuts costs, or both. Data mining software is one of a number
of analytical tools for analyzing data. It allows users to analyze data from
many different dimensions or angles, categorize it, and summarize the
relationships identified. Technically, data mining is the process of finding
correlations or patterns among dozens of fields in large relational databases.
The ultimate goal of data mining is prediction.
2.2.1 Data Mining Steps
Data mining is a powerful analytical tool that enables business
executives to advance from describing historical customer behavior to
predicting the future. It finds patterns that unlock the mysteries of customer
behavior. These findings can be used to increase revenue, reduce expenses
and identify business opportunities, offering new competitive advantages. The
KDD process comprises of a few steps, leading from raw data collections to
some form of new knowledge. There are various steps that are involved in
mining data as shown in Figure 2.1.
‚ Data warehouse: The term data warehouse was coined by
Bill Inmon in 1990; he defined it in the following way: "A
warehouse is a subject-oriented, integrated, time-variant and
non-volatile collection of data in support of the management's
decision making process". All mobile users moving and
service request logs are stored in the data warehouse, and will
be provided as input to the data mining process.
‚ Data selection: The data relevant for the analysis is decided
on and retrieved from the data warehouse.
14
‚ Data cleaning: In this step, noise and irrelevant data are
removed from the selected data.
‚ Data transformation: In this, the selected data is transformed
in to forms appropriate for the mining process.
‚ Data mining: Applying data mining techniques, such as the
Apriori algorithm for association rule mining a on data base to
discover interesting patterns.
‚ Interpretation: It is the process of converting the result of the
data mining process in to knowledge, which is presented to the
user.
‚ Action: The user can make use of knowledge to perform
better action.
Figure 2.1 Data mining steps
15
2.2.2 Data Mining Techniques
2.2.2.1 Association
Association analysis is the discovery of what are commonly called
association rules. It studies the frequency of items occurring together in
transactional databases, and based on a threshold called support, identifies the
frequent item sets. Another threshold, confidence, which is the conditional
probability that an item appears in a transaction when another item appears, is
used to pinpoint association rules. Association rule mining is a data mining
technique used to find interesting relations among a large set of data items.
The discovered association rules may help decision making in different areas.
The problem of association rule mining originated for market
analysis on sales basket data. In a market-basket analysis, the buying habits of
customers are analyzed to find the association between the data items
purchased. The discovery of such associations can help the retailers develop
marketing and planning strategies. The association rule is in the form of XåY, where X and Y are item sets. Items that are purchased together by
customers can be identified. Whenever the customer bought X, he bought Y
also. Ex. {Bread, Jam}å {Butter}.
An association rule is an implication of the form X s Y, where
X = {x1, …, xm} and Y= {y1, …, yn} are sets of items with X y Y = Г. The
rule X s Y has a support s, if the s% of all itemsets contain X ж Y. The rule
X s Y has confidence c, if the c% of itemsets that contain X also contain Y.
The problem of mining association rules, is to generate all the association
rules that have a support and confidence greater than the user-specified
minimum support min_supp and minimum confidence min_conf,
respectively.
16
The problem of discovering association rules can be decomposed
into two subproblems:
1) Find the set F of all itemsets with support above minimum
support min_supp. These itemsets are called frequent itemsets.
2) Use the frequent itemsets to generate the desired rules. For
every XЁ F, check the confidence of all rules X \ Y s Y, Y
胃 X, Y Œ Г, and eliminate those that do not achieve
min_conf. It is sufficient to calculate all the support values of
the subsets of X to determine the confidence of each rule.
2.2.2.2 Classification
Classification is a data mining (machine learning) technique used to
predict the group membership for data instances. For example, you may wish
to use classification to predict whether the weather on a particular day will be
“sunny”, “rainy” or “cloudy”. A classification analysis is the organization of
data in given classes. Also known as supervised classification, the
classification uses given class labels to order the objects in the data collection.
Classification approaches normally use a training set, where all objects are
already associated with known class labels. The classification algorithm
learns from the training set and builds a model. The model is used to classify
new objects.
2.2.2.3 Clustering
Similar to classification, clustering is the organization of data in
classes. However, unlike classification, in clustering, class labels are
unknown, and it is up to the clustering algorithm to discover acceptable
classes. Clustering is also called unsupervised classification, because the
17
classification is not dictated by given class labels. There are many clustering
approaches all based on the principle of maximizing the similarity between
objects in the same class (intra-class similarity), and minimizing the similarity
between objects of different classes (inter-class similarity).
2.2.2.4 Prediction
Prediction has attracted considerable attention, given the potential
implications of successful forecasting in a business context. There are two
major types of predictions: one can either try to predict some unavailable data
values or pending trends, or predict a class label for some data. The latter is
tied to classification. Once a classification model is built based on a training
set, the class label of an object can be foreseen, based on the attribute values
of the object and the attribute values of the classes. Prediction, however, more
often, refers to the forecast of missing numerical values, or increasing/
decreasing trends in time related data. The major idea is to use a large number
of past values to consider probable future values.
2.2.2.5 Deviation Analysis
Deviation analysis, on the other hand, considers differences
between the measured values and expected values, and attempts to find the
cause of the deviations from the anticipated values.
2.2.2.6 Outlier Analysis
Outliers are data elements that cannot be grouped in a given class
or cluster. Also known as exceptions or surprises, they are often very
important to identify. While outliers can be considered noise and discarded in
some applications, they can reveal important knowledge in other domains,
and thus can be very significant and their analysis valuable.
18
2.3 MOBILE WEB ENVIRONMENTS
The mobile web refers to the use of Internet-connected
applications, or browser-based access to the Internet from a mobile device -
such as a laptop or tablet PC - connected to a wireless network. Advances in
Internet and network technology and the rapidly growing number of mobile
personal devices have resulted in the fast growth of mobile E-Commerce and
M-Commerce. The goal of the initiative is to make browsing the web from
mobile devices more reliable and accessible.
2.3.1 Mobile Web Services
In a mobile web, LBSs are information services accessible with
mobile devices through the mobile network, and utilizing the ability to make
use of the current location of the mobile devices. In this research work, the
predicted location pattern and service request patterns are effectively used by
service providers to provide location-based services. Web services are
software components identified by the URI. Mobile web services can be
accessed over the Internet using popular Web mechanisms and protocols,
such as the HTTP.
2.3.2 Mobility Prediction
Mobile users move from one location to another location in a
wireless PCS network. The coverage area of the network is divided into a
number of location areas. Each mobile device is linked with the Base Station
(BS). Each Base station contains a Home Location Register (HLR), which
stores permanent details of the mobile users, and a Visiting Location Register
(VLR), which stores temporary details of the mobile users. These registers
include attributes like the user ID, user location, call time, call duration, etc.
The user ID acts as a key for mobile user records. Mobility patterns are mined
from the User Access Path (UAP), and mobility rules are generated using
19
mobility patterns. Finally the current location of mobile user is compared with
the mobility rule to predict the next locations of the mobile user. By using the
predicted movement, the system can effectively allocate resources and
provide location-based services to the mobile users.
2.3.3 Location Based Web Services on the Mobile Web
A location-based service consists roughly of two phases:
determining the position of the customer, and providing service or contents
based on the position. LBSs define those mobile commerce services that
utilize information about the current location of the person using a mobile
device. Ideally, the information provided should be both location-specific and
personalized, based on the personal profile of the user.
2.3.4 Prediction of Mobile User Behavior
Predicted location based services can be used by the service
providers to provide location based services to the mobile users. Location-
based services can be categorized in two possible categories, based on the
perspective of the user:
‚ End User Application Perspective: Traffic and weather
information as to where you are, driving directions,
entertainment applications, wireless advertising, store
location, etc.
‚ Developer and vendor perspective: mapping, routing
(Directions), GPS Navigation (real-time turn by turn
navigation), proximity searches, destination guides, tracking
(vehicles, assets, friend or buddy finders), telematics, location-
based billing, advertising, etc.
20
2.4 DISTRIBUTED DATA MINING
Distributed computing plays an important role in this process for
several reasons. First, data mining often requires huge amounts of resources in
storage space and computation time. To make systems scalable, it is important
to develop mechanisms that distribute the work load among several sites in a
flexible way. Second, data is often inherently distributed over several databases,
making a centralized processing of this data very inefficient and prone to security
risks. Finally, many data mining tasks require connecting heterogeneous
resources, such as data sources, processing nodes and end user applications.
2.4.1 Centralized Data Mining
Centralized data mining algorithms are used for the analysis of
large datasets stored in a single site. Data mining and data warehousing go
hand-in-hand: most tools operate on the principle of gathering all data into a
central site, and then running an algorithm against that data. A simple
architecture of centralized data mining is shown in Figure 2.2.
Figure 2.2 Centralized data mining
Local
Data
Data
Warehouse
Data Mining
Algorithm
Local
Data
Local
Data
21
2.4.2 Limitations of Centralized Data Mining
Mining large data sets requires powerful computational resources.
A major issue in data miming is scalability with respect to the very large
databases. Centralized data mining algorithms take longer time to work on
large databases. There are a number of applications that are infeasible under
such a methodology, leading to the need for distributed data mining.
There are several issues where distributed data mining arises.
1. Connectivity: Transmitting a large amount of data to a
centralized node may by infeasible
2. Heterogeneity of source data: Integrating heterogeneous data
from different places into one database is not easy.
3. Privacy of data: Organizations may be willing to share data
mining results, but not data.
2.4.3 Need for Distributed Data Mining
Distributed data mining refers to the mining of distributed data sets.
Distributed data mining explores techniques of how to apply data mining in a
non-centralized way. The data sets are stored in a local data base, and hosted
by local computers, which are connected through a computer network. Data
mining takes place at a local level and at a global level, where local data
mining results are combined to get global results. A simple architecture for
distributed data mining is shown in Figure 2.3.
22
Figure 2.3 Distributed data mining
The following reasons have made distributed data mining interesting.
‚ Distributed data: In some applications, data are inherently
distributed, but it is necessary to gain global knowledge from
the distributed data sites. For example, each site of a
multinational company manages its own operational data
locally, but the data must be analyzed for global patterns to
allow company-wide activities, such as planning, marketing
and sales. The straightforward solution is to move all data to a
centralized site, where data mining is done. Even if a
centralized site has enough capacity to handle the data storage
and data mining, it may be too long or too expensive to
transfer the local data sets because of their sizes.
‚ Security: Sometimes the local data cannot be transferred
because of security and the anatomy of the data sets.
Local
Data
Local
Data
Local
Data
Local
Mining
Local
Mining
Local
Mining
Result Combiner
23
‚ Scalability: Distributed data mining may be useful where data
is stored in a single site. One scenario is that if the data set is
too large it is beyond the data mining capability of the site. In
such a case, the site may send part of the data to other sites for
mining. The involved sites perform data mining and the results
are then combined.
2.4.4 Advantages of Distributed Data Mining
Distributed data mining techniques are scalable with the increase in
data size. This will be much faster than the centralized data mining algorithm,
since the work load is distributed among the sites. Distributed data mining is
feasible with an advance in computer network, especially the growth of the
internet and intranet. In distributed data mining, the computers and data are
distributed. In a parallel data mining, a parallel computer is assumed with
processors sharing memory and/or disk. In distributed data mining, processors
are sharing nothing.
2.5 GRID COMPUTING
2.5.1 Grid Definition
Grid computing represents the natural evolution of distributed
computing infrastructure that enables coordinated resource sharing within
dynamic organizations consisting of individuals, institutions and resources.
As the grid is becoming a well accepted computing infrastructure in science
and industry, it is necessary to provide general data mining services,
algorithms and applications that help analysts, scientists, organizations and
professionals to leverage grid capacity in supporting high-performance
distributed computing for solving their data mining in a distributed way
(Foster 2001).
24
A grid is a distributed system that enables the sharing, selection and
aggregation of geographically distributed "autonomous" resources dynamically at
runtime, depending on their availability, capability, performance, cost, and the
user's quality-of-service requirements. The grid uses the resources of many
separate computers connected by a network (usually the Internet) to solve
large-scale computation problems. Grids provide the ability to perform
computations on large data sets, by breaking them down into many smaller
ones, or provide the ability to perform many more computations at once, than
would be possible on a single computer, by modeling a parallel division of
labor between processes (Foster 2000).
“Grid” computing has emerged as an important new field,
distinguished from conventional distributed computing, by its focus on large-
scale resource sharing, innovative applications, and, in some cases, high-
performance orientation. A Virtual Organization (VO) is referred to as
flexible, secure, coordinated resource sharing among dynamic collections of
individuals, institutions, and resource.
2.5.2 Need for Grid Computing
The main aim of grid computing is to give organizations and
application developers the ability to create distributed computing environments
that can utilize computing resources on demand (Fran Berman et al 2003).
Grid computing can leverage the computing power of a large numbers of
servers, computers, desktop PCs, cluster and other kind of hardware. Grid
technology allows the creation of a computing system with extreme performance
and capacity from geographically distributed computational and memory
resources. The system has the features of a global worldwide computer, where
all the components are connected together via the internet. For the user it
appears like a common workstation, but some segments of a solved task are
computed in different parts of the system.
25
2.5.3 Benefits of Grid Computing
Using a grid computing architecture, organizations can quickly and
easily create a large-scale computing infrastructure from inexpensive, off-the-
shelf components. Other benefits of grid computing include
‚ Quick response to volatile business needs
‚ Real-time responsiveness to dynamic workloads
‚ Predictable IT service levels
‚ Reduced costs, as a result of improved efficiency and smarter
capacity planning.
The grid can help increase efficiencies and reduce the cost of
computing networks, by decreasing the data processing time, and optimizing
resources and distributing workloads, thereby allowing users to achieve much
faster results on larger operations and lower costs. The main advantage of the
grid is its high efficiency of using technological capacity. The grid is highly
effective in using the associated technological capacities of creative users’
potential, the safety, the reliability, the effectiveness and high level of
transportability for computational applications.
2.5.4 Grid Architecture Description
In grid computing, a VO is a group which shares the same
computing resources in a controlled fashion, so that members may collaborate
to achieve a shared goal. To achieve their mutual goal, people within a VO
choose to share their resources, creating a computer grid. This grid can give
VO members direct access to each other's computers, programs, files, data,
sensors and networks. This sharing must be controlled, secure, flexible, and
usually time-limited. The layered architecture of grid is shown in Figure 2.4.
26
Figure 2.4 Grid layered architecture
Fabric: The grid fabric layer provides the resources to which shared access is
mediated by Grid protocols: for example, computational resources, storage
systems, catalogs, network resources, and sensors. Fabric components
implement the local, resource-specific operations, that occur on specific
resources (whether physical or logical) as a result of sharing operations at
higher levels.
Connectivity: The connectivity layer defines core communication and
authentication protocols required for grid-specific network transactions.
Communication protocols enable the exchange of data between Fabric layer
resources. Authentication protocols build on communication services to
provide cryptographically secure mechanisms for verifying the identity of
users and resources. Communication requirements include transport, routing,
and naming.
Resource: The resource layer builds on connectivity layer communication
and authentication protocols to define protocols (and APIs and SDKs) for the
secure negotiation, initiation, monitoring, control, accounting, and payment of
sharing operations on individual resources. The resource layer implementations of
Fabric
Connectivity
Resource
Collective
Application
27
these protocols call the fabric layer functions, to access and control local
resources. Resource layer protocols are concerned entirely with individual
resources, and hence, ignore issues of global state and atomic actions across
distributed collections; such issues are the concern of the collective layer
discussed next.
Two primary classes of resource layer protocols can be
distinguished:
1. Information protocols are used to obtain information about the
structure and state of a resource, for example, its configuration,
current load, and usage policy (e.g., cost).
2. Management protocols are used to negotiate access to a shared
resource, specifying, for example, resource requirements
(including advanced reservation and quality of service) and
the operation(s) to be performed, such as process creation, or
data access. Since management protocols are responsible for
instantiating sharing relationships, they must serve as a
“policy application point,” ensuring that the requested
protocol operations are consistent with the policy under which
the resource is to be shared. Issues that must be considered
include accounting and payment. A protocol may also support
monitoring the status of an operation and controlling (for
example, terminating) the operation.
Collective: It is responsible for coordinating multiple resources. While the
resource layer is focused on interactions with a single resource, the next layer
in the architecture contains protocols and services (and APIs and SDKs) that
28
are not associated with any one specific resource, but are rather global in
nature and capture interactions across collections of resources.
Application: The final layer in our grid architecture comprises the user
applications that operate within a VO environment. Applications are
constructed in terms of, and by calling upon, services defined at any layer. At
each layer, there is a well-defined protocol which provides access to some
useful service: resource management, data access, resource discovery, and so
forth.
2.5.5 Types of Grid
There are different types of grid infrastructure to fit different types
of business problems. Some grids are designed to take advantage of extra
processing resources, whereas some grid architectures are designed to support
collaboration between various organizations. The type of grid selected is
based primarily on the business problem that is being solved.
2.5.5.1 Data Grid
Data services are concerned with providing secure access to
distributed datasets and their management. To provide a scalable storage and
access to the data sets, they may be replicated, catalogued, and different
datasets may even be stored in different locations to create an illusion of mass
storage. The processing of datasets is carried out using computational grid
services, and such a combination is commonly called a data grid.
A data grid provides services that help users discover, transfer, and
manipulate large datasets stored in distributed repositories and also, create
and manage copies of these datasets. At the minimum, a data grid provides
two basic functionalities: a high performance, reliable data transfer
29
mechanism, and a scalable replica discovery and management mechanism. A
security layer that handles the authentication of entities and ensures the
conduct of only authorized operations mediates all the operations in a data
grid. Another aspect of a data grid is the maintenance of shared collections of
data distributed across administrative domains. These collections are
maintained independent of the underlying storage systems, and are able to
include new sites without major effort. More importantly, it is required that
the data and information associated with data such as metadata, access
controls, and version changes be preserved even in the face of platform
changes.
A data grid, therefore, provides a platform through which users can
access aggregated computational, storage and networking resources, to
execute their data-intensive applications on remote data. It promotes a rich
environment for users to analyze data, share the results with their
collaborators, and maintain state information about the data seamlessly across
institutional and geographical boundaries.
Figure 2.5 shows a high-level view of a worldwide data grid
consisting of computational and storage resources in different countries that
are connected by high-speed networks. The thick lines show high bandwidth
networks linking the major centers, and the thinner lines are lower capacity
networks that connect the latter to their subsidiary centers. The data generated
from an instrument, experiment, or a network of sensors is stored in its
principal storage site, and is transferred to the other storage sites around the
world on request through the data replication mechanism.
30
Figure 2.5 Data grid
Users query their local replica catalog to locate datasets that they
require. If they have been granted the requisite rights and permissions, the
data is fetched from the repository local to their area if it is present there;
otherwise it is fetched from a remote repository. The data may be transmitted
to a computational site, such as a cluster or a supercomputer facility for
processing. After processing, the results may be sent to a visualization
facility, a shared repository, or to the desktops of the individual users.
Data grids also harness data, storage, and network resources located
in distinct administrative domains, respect local and global polices governing
how data can be used, schedule resources efficiently, and provide high speed
and reliable access to data. Businesses interested in data grids typically have
IT initiatives to expand data-mining abilities, while maximizing the utilization
of an existing storage infrastructure investment, and to reduce the complexity
of data management.
31
2.5.5.2 Computational Grid
A grid providing computational services is often called a
computational grid. Computational services are concerned with providing
secure services for executing application jobs on distributed computational
resources individually or collectively. Resources brokers provide the services
for the collective use of distributed resources. A computational grid
aggregates the processing power from a distributed collection of systems. A
computational grid is a hardware and software infrastructure that provides
dependable, consistent, pervasive, and inexpensive access to high-end
computational capabilities.
A computational grid can be recognized by these primary
characteristics:
‚ Made up of clusters of clusters
‚ Enables CPU scavenging to better utilize resources
‚ Provides the computational power to process large-scale jobs.
‚ Satisfies the business requirements for instant access to
resources on demand.
2.5.5.3 Knowledge Grid
The evolution of the data grid is represented by a knowledge grid
offering high level services for distributed mining and extraction of
knowledge from data repositories available on data. The KG is a parallel and
distributed architecture that integrates data mining techniques and grid
technologies. The knowledge grid is used to perform distributed data mining
on very large data sets available over grids to find hidden valuable
information, process models to make business decisions and arrive at results.
32
2.5.6 Applications of the Grid
The grid infrastructure can benefit many applications, including
collaborative engineering, data exploration, high-throughput computing, and
distributed supercomputing. A grid can be viewed as a seamless, integrated
computational and collaborative environment. From the end-user point of
view, grids can be used to provide the following types of services.
‚ Computational services
‚ Data services
‚ Application services
‚ Information services
‚ Knowledge services
2.6 DISTRIBUTED DATA MINING ON GRID
In data-intensive applications, the focus is on synthesizing new
information from the data that is maintained in geographically distributed
repositories, digital libraries and databases. This synthesizing process is often
computationally intensive.
The popularity of the internet as well as the availability of powerful
computers and high-speed network technologies as low-cost commodity
components is changing the way we use computers today. These technology
opportunities have led to the possibility of using distributed computers as a
single, unified computing resource, leading to what is popularly known as
grid computing. The grid infrastructure can benefit many applications,
including collaborative engineering, data exploration, high-throughput
computing, and distributed supercomputing.
33
Distribution of data and computation allows for solving larger
problems, and executing applications that are distributed in nature. The grid is
a distributed computing infrastructure that enables coordinated resource
sharing within dynamic organizations consisting of individuals, institutions,
and resources. The grid extends the distributed and parallel computing
paradigms allowing resource negotiation and dynamical allocation,
heterogeneity, open protocols and services.
Grid environments can be used both to compute intensive tasks and
data intensive applications, as they offer resources, services, and data access
mechanisms. Data mining algorithms and knowledge discovery processes are
both compute and data intensive, therefore the grid can offer a computing and
data management infrastructure for supporting decentralized and parallel data
analysis. This thesis discusses how grid computing can be used to support
distributed data mining. Grid-based data mining uses grids as decentralized
high-performance platforms where data mining tasks and knowledge
discovery algorithms and applications, can be executed.
Grid computing represents the natural evolution of distributed
computing and parallel processing technologies. The grid is a distributed
computing infrastructure that enables coordinated resource sharing within
dynamic organizations consisting of individuals, institutions, and resources.
The main aim of grid computing is to give organizations and application
developers the ability to create distributed computing environments that can
utilize computing resources on demand. Grid computing can leverage the
computing power of a large numbers of server computers, desktop PCs,
clusters and other kind of hardware. The grid can help increase efficiencies
and reduce the cost of computing networks by decreasing the data processing
time and optimizing resources and distributing workloads, thereby allowing
users to achieve much faster results on larger operations and at lower costs.
34
2.7 GLOBUS TOOLKIT
The term grid computing refers to the emerging computational and
networking infrastructure that is designed to provide pervasive, uniform and
reliable access to data, computational, and human resources distributed over
wide area environments. Grid services allow scientists at locations throughout
the world to share data collection instruments, such as particle colliders,
compute resources such as supercomputers and clusters of workstations, and
community datasets stored on network caches and hierarchical storage
systems.
The globus toolkit developed within the globus project provides
middleware services for Grid computing environments. Major components
include the Grid Security Infrastructure (GSI), which provides public-key-
based authentication and authorization services; resource management
services, which provide a language for specifying application requirements,
mechanisms for immediate and advance reservations of grid resources, and
for remote job management; and information services, which provide for the
distributed publication and retrieval of information about grid resources.
The globus toolkit (http://www.globus.org/) is a community-based,
open-architecture, open-source set of services and software libraries that
supports grids and grid applications. The toolkit includes software for
security, information infrastructure, resource management, data management,
communication, fault detection, and portability. It is packaged as a set of
components that can be used either independently or together to develop
applications.
For each component, the toolkit defines both protocols and
application programming interfaces (APIs), and provides open-source
reference implementations in C and (for client-side APIs) Java.
35
A tremendous variety of higher-level services, tools, and applications have
been implemented in terms of these basic components. Some of these services
and tools are distributed as part of the toolkit, while others are available from
other sources.
2.7.1 Components of the Globus Toolkit
The globus toolkit is used to build the grid environment. The GT
contains the following components.
‚ Grid security infrastructure (GSI): Enables secure
authentication and communication over an open network
providing a number of services, including mutual authentication
and single sign-on run-anywhere authentication, with support
for local control over access rights, and mapping from global
to local user identities. GSI is based on public key encryption,
X.509 certificates, and the secure sockets layer (SSL)
communication protocol.
‚ Monitoring and discovery service (MDS): Provides a
framework for publishing and accessing information about
grid resources by using the lightweight directory access
protocol (LDAP) as a uniform interface to such information.
The MDS provides two types of directory services: the grid
resource information service (GRIS) and the grid index
information service (GIIS). A GRIS can answer queries about
the resources of a particular grid node; examples of
information provided include host identity (e.g., operating
systems and versions), as well as more dynamic information
such as the current CPU load and memory availability. A GIIS
combines the information provided by a set of GRIS services
36
managed by an organization, giving a coherent system image
that can be explored or searched by grid applications.
‚ Globus resource allocation manager (GRAM): Provides
facilities for resource allocation and process creation,
monitoring, and management. The GRAM simplifies the use
of remote systems by providing a single standard interface for
requesting and using remote system resources for the
execution of jobs. The most common use of the GRAM is
remote job submission and control, to support distributed
computing applications.
‚ Dynamically updated resource online co-allocator
(DUROC): Manages multirequests of resources, delivers
requests to different GRAMs and provides time-barrier
mechanisms among jobs. In globus, a GRAM provides an
interface to submit jobs on a particular set of physical
resources, whereas the DUROC is used to coordinate
transactions with independent GRAMs.
‚ Heartbeat monitor (HBM): Provides a mechanism for
monitoring the state of processes. The HBM is designed to
detect and report the failure of processes that have identified
them selves to the HBM. It allows the simultaneous
monitoring of both globus system processes and application
processes associated with user computations. The HBM also
provides the notification of process status exception events, so
that recovery action can be taken.
‚ GridFTP: Implements a high-performance, secure data
transfer mechanism based on an extension of the FTP protocol
that allows parallel data transfer, partial file transfer, and
37
third-party (server-to-server) data transfer, using the GSI for
authentication. This allows grid applications to have
ubiquitous, high-performance access to data, in a way that is
compatible with the most popular file transfer protocol in use
today.
‚ Replica catalog and replica management: Provide facilities
for managing data replicas, i.e., multiple copies of data stored
in different systems to improve access across geographically
distributed grids. The replica catalog provides mappings
between logical names for files and one or more copies of the
files on physical storage systems; it is accessible via an
associated library and a command-line tool. The replica
management combines the replica catalog (for keeping track
of replicated files) and the GridFTP (for moving data) to
manage data replication.
In the mobile web environment, the necessity of efficient location
based services leads to the development of distributed data grid mining
system to predict the location and service request pattern of a mobile user.