weka4ws : enabling distributed data mining on grids
TRANSCRIPT
Venezia, October 17-18, 2005
1
Weka4WS: Enabling Distributed Data
Mining on Grids
Domenico Talia, Paolo Trunfio, Oreste [email protected]
DEIS
University of Calabria
Rende, Italy
2
Outline
� Goal of this work
� Weka4WS, WSRF and GT4
� Weka4WS architecture
� Software components
� Graphical user interface
� Web Services operations
� Execution mechanisms
� Performance analysis
� Conclusions
3
Goal of this work
� Computational Grids emerged in the last years as
effective infrastructures for distributed high-performance
computing and data processing
� By exploiting a service-oriented approach, knowledge
discovery applications can be developed on Grids to
deliver high performance and manage data and
knowledge distribution
� The goal of this work is to extend the Weka toolkit for
supporting distributed data mining through the use of
standard Grid technologies, such as the emerging Web
Services Resource Framework (WSRF) and the Globus
Toolkit 4
4
The Weka4WS framework
� Weka provides a large collection of machine learning
algorithms for data pre-processing, classification,
clustering, association rules, and visualization, which can
be used through a common GUI
� In Weka, the overall data mining process takes place on
a single machine, since the algorithms can be executed
only locally
� The goal of Weka4WS is to extend Weka to support
remote execution of Weka data mining algorithms
5
The Weka4WS framework
� In Weka4WS the data mining algorithms for
classification, clustering and association rules can be
executed on remote Grid resources to implement
distributed data mining and speedup applications.
� To enable remote invocation, each data mining
algorithm provided by the Weka library are
exposed as a Web Service, which can be easily
deployed on the available Grid nodes
� To achieve integration and interoperability with standard
Grid environments, Weka4WS has been designed and
developed by using the emerging Web Services Resource
Framework (WSRF) as enabling technology
6
WSRF and Globus Toolkit 4
� WSRF is a family of technical specifications concerned
with the creation, addressing, inspection, and lifetime
management of stateful resources
� The framework codifies the relationship between Web
Services and stateful resources, which are termed as
WS-Resource in the WSRF language
� The Globus Alliance recently released the Globus Toolkit
4 (GT4), which provides an open source implementation
of the WSRF library and incorporates Web Services
implemented according to the WSRF specifications
� Weka4WS has been developed by using the Java WSRF
library provided by GT4
7
Outline
� Goal of this work
� Weka4WS, WSRF and GT4
� Weka4WS architecture
� Software components
� Graphical user interface
� Web Services operations
� Execution mechanisms
� Performance analysis
� Conclusions
8
Weka4WS architecture
� In Weka4WS all nodes use GT4 services for standard Grid
functionalities, such as security and data management
� We distinguish those nodes in two categories:
� user nodes, which are the local machines of the users
providing the Weka4WS client software
� computing nodes, which provide the Weka4WS Web
Services allowing the execution of remote data mining tasks
� Data can be located on computing nodes, user nodes, or
third-party nodes
� If the dataset to be mined is not locally available on a
computing node, it can be downloaded or replicated by
means of the GT4 data management services
9
Software components
� User nodes include three software components:
� Graphical User Interface (GUI)
� Client Module (CM)
� Weka Library (WL)
GraphicalUser Interface
WekaLibrary
User node
ClientModule
GT4 Services
WebService
WekaLibrary
GT4 Services
Computing node
10
Software components
� Computing nodes include two software components:
� Web Service (WS)
� Weka Library (WL)
GraphicalUser Interface
WekaLibrary
User node
ClientModule
GT4 Services
WebService
WekaLibrary
GT4 Services
Computing node
11
Software components
� The GUI extends the Weka Explorer environment to allow
the execution of both local and remote data mining tasks:
� local tasks are executed by directly invoking the local WL
� remote tasks are executed through the CM, which operates
as an intermediary between the GUI and Web Services on
remote computing nodes
GraphicalUser Interface
WekaLibrary
User node
ClientModule
GT4 Services
WebService
WekaLibrary
GT4 Services
Computing node
local task
remotetask
12
Software components
� The WS is a WSRF-compliant Web Service that exposes
the data mining algorithms provided by the underlying
WL
� Therefore, requests to the WS are executed by invoking
the corresponding WL algorithms
GraphicalUser Interface
WekaLibrary
User node
ClientModule
GT4 Services
WebService
WekaLibrary
GT4 Services
Computing node
Algorithminvocation
13
Graphical user interface
� A “Remote pane” has been added to the original Weka
Explorer environment
14
Graphical user interface
� This pane provides a list of the remote Web Services that
can be invoked, and two buttons to start and stop the
data mining task on the selected Web Service
15
Graphical user interface
� Through the GUI a user can both:
� start the execution locally by using the standard Local pane
� start the execution remotely by using the Remote pane
� Each task in the GUI is managed by an independent
thread
� Therefore, a user can start multiple data mining tasks in
parallel on different Web Services, this way taking full
advantage of the distributed Grid environment
� Whenever the output of a data mining task has been
received from a remote computing node, it is visualized
in the standard Output pane
16
Outline
� Goal of this work
� Weka4WS, WSRF and GT4
� Weka4WS architecture
� Software components
� Graphical user interface
� Web Services operations
� Execution mechanisms
� Performance analysis
� Conclusions
17
Web Services operations
� The first three operations are related to WSRF-specific
invocation mechanisms
� The last three operations are used to require the
execution of a specific data mining task
Subscribes to notifications about resource
properties changes.subscribe
Submits the execution of an association rules
task.associationRules
Submits the execution of a clustering task.clustering
Submits the execution of a classification task.classification
Explicitly requests the destruction of a WS-
Resource.destroy
Creates a new WS-Resource.createResource
WSRFspecific
Datamining
18
Web Services operations
� The classification operation provides access to the
complete set of classifiers in the Weka Library (currently,
71 algorithms)
� The clustering and associationRules operations expose all
the clustering and association rules algorithms provided
by the Weka Library (5 and 2 algorithms, respectively)
� To improve concurrency the data mining operations are
invoked in an asynchronous way:
� the client submits the execution in a non-blocking mode,
and results are notified to the client whenever they have
been computed
19
Input parameters of the DM operations
� Three parameters are required in the invocation of all
the data mining operations: algorithm, arguments, and
dataSet
URL of the dataset to be mineddataSet
Algorithm argumentsarguments
Name of the association rules algorithmalgorithm
associationRules
URL of the dataset to be mineddataSet
Index of the class w.r.t. evaluate clustersclassIndex
Indexes of the selected attributesselectedAttrs
Testing phase optionstestOptions
Algorithm argumentsarguments
Name of the clustering algorithmalgorithm
clustering
URL of the dataset to be mineddataSet
Index of the attribute to use as the classclassIndex
Options to be used during the testing phasetestOptions
Arguments to be passed to the algorithmarguments
Name of the classification algorithmalgorithm
classification
20
Input parameters of the DM operations
� The algorithm parameter specifies the name of the Java
class in the Weka Library to be invoked
� example: “weka.classifiers.trees.J48”
URL of the dataset to be mineddataSet
Algorithm argumentsarguments
Name of the association rules algorithmalgorithm
associationRules
URL of the dataset to be mineddataSet
Index of the class w.r.t. evaluate clustersclassIndex
Indexes of the selected attributesselectedAttrs
Testing phase optionstestOptions
Algorithm argumentsarguments
Name of the clustering algorithmalgorithm
clustering
URL of the dataset to be mineddataSet
Index of the attribute to use as the classclassIndex
Options to be used during the testing phasetestOptions
Arguments to be passed to the algorithmarguments
Name of the classification algorithmalgorithm
classification
21
Input parameters of the DM operations
� The arguments parameter specifies a sequence of
arguments to be passed to the algorithm
� example: “-C 0.25 -M 2”
URL of the dataset to be mineddataSet
Algorithm argumentsarguments
Name of the association rules algorithmalgorithm
associationRules
URL of the dataset to be mineddataSet
Index of the class w.r.t. evaluate clustersclassIndex
Indexes of the selected attributesselectedAttrs
Testing phase optionstestOptions
Algorithm argumentsarguments
Name of the clustering algorithmalgorithm
clustering
URL of the dataset to be mineddataSet
Index of the attribute to use as the classclassIndex
Options to be used during the testing phasetestOptions
Arguments to be passed to the algorithmarguments
Name of the classification algorithmalgorithm
classification
22
Input parameters of the DM operations
� The dataSet parameter specifies the URL of the dataset
to be mined
� example: “gsiftp://hostname/path/file.arff”
URL of the dataset to be mineddataSet
Algorithm argumentsarguments
Name of the association rules algorithmalgorithm
associationRules
URL of the dataset to be mineddataSet
Index of the class w.r.t. evaluate clustersclassIndex
Indexes of the selected attributesselectedAttrs
Testing phase optionstestOptions
Algorithm argumentsarguments
Name of the clustering algorithmalgorithm
clustering
URL of the dataset to be mineddataSet
Index of the attribute to use as the classclassIndex
Options to be used during the testing phasetestOptions
Arguments to be passed to the algorithmarguments
Name of the classification algorithmalgorithm
classification
23
Task execution steps
Web Service
ClientModule
createResource
subscribe
destroy
clustering
deliver
DatasetWeka
Library
GT4 RFT Service
User nodeComputing node
GridFTP ServerGridFTP Server
� Scenario: the Client Module (CM) is requesting the execution
of a clustering analysis on a dataset local to the user node
� To perform this task a sequence of steps need to be executed
24
Task execution: Resource creation
Web Service
ClientModule
createResource
subscribe
destroy
clustering
deliver
Dataset
WS-Resource
Clusteringmodel
1 1
WekaLibrary
GT4 RFT Service
User nodeComputing node
GridFTP ServerGridFTP Server
Resource creation. The CM invokes the createResource
operation to create a new WS-Resource. A clustering model
property is used to store the result of the clustering task. The
WS returns the EPR of the created resource
1
25
Task execution: Notification subscription
Web Service
ClientModule
createResource
subscribe
destroy
clustering
deliver
Dataset
WS-Resource
Clusteringmodel
1 1
2 2
WekaLibrary
GT4 RFT Service
User nodeComputing node
GridFTP ServerGridFTP Server
Notification subscription. The CM invokes the subscribe
operation, which subscribes to notifications about changes
that will occur to the clustering model resource property
2
26
Task execution: Task submission
Web Service
ClientModule
createResource
subscribe
destroy
clustering
deliver
Dataset
WS-Resource
Clusteringmodel
1 1
2 2
3 3
WekaLibrary
GT4 RFT Service
User nodeComputing node
GridFTP ServerGridFTP Server
Task submission. The CM invokes the clustering operation
to require the execution of the clustering task. The operation
is invoked in an asynchronous way
3
27
Task execution: Dataset download
Web Service
ClientModule
createResource
subscribe
destroy
clustering
deliver
Dataset
WS-Resource
Clusteringmodel
1
Dataset
1
2 2
3 3
WekaLibrary
GT4 RFT Service
User nodeComputing node
44
4GridFTP ServerGridFTP Server
Dataset download. The WS requests to download the
dataset to be mined from the URL specified in the clustering
invocation. The download request is managed by the GT4
Reliable File Transfer (RFT) service
4
28
Task execution: Data mining
Web Service
ClientModule
createResource
subscribe
destroy
clustering
deliver
Dataset
WS-Resource
Clusteringmodel
1
Dataset
1
2 2
3 3
WekaLibrary
GT4 RFT Service
User nodeComputing node
44
55
4GridFTP ServerGridFTP Server
Data mining. The clustering analysis is started by invoking
the appropriate Java class in the WL. The execution is handled
within the WS-Resource created on Step 1, and the result of
the computation is stored in the clustering model property
5
29
Task execution: Results notification
Web Service
ClientModule
createResource
subscribe
destroy
clustering
deliver
Dataset
WS-Resource
Clusteringmodel
1
Dataset
1
2 2
3 3
WekaLibrary
6
GT4 RFT Service
User nodeComputing node
44
55
4GridFTP ServerGridFTP Server
Results notification. Whenever the clustering model
property has been changed, its new value is notified to the
CM, by invoking its implicit deliver operation
6
30
Task execution: Resource destruction
Web Service
ClientModule
createResource
subscribe
destroy
clustering
deliver
Dataset
WS-Resource
Clusteringmodel
1
Dataset
1
2 2
3 3
WekaLibrary
6
7 7
GT4 RFT Service
User nodeComputing node
44
55
4GridFTP ServerGridFTP Server
Resource destruction. The CM invokes the destroy
operation, which explicitly destroys the WS-Resource created
on Step 1
7
31
Outline
� Goal of this work
� Weka4WS, WSRF and GT4
� Weka4WS architecture
� Software components
� Graphical user interface
� Web Services operations
� Execution mechanisms
� Performance analysis
� Conclusions
32
Performance analysis
� Goals:
� evaluating the execution times of the different steps needed to
perform a typical data mining task in different network scenarios
� evaluating the efficiency of the WSRF mechanisms and Weka4WS
as methods to execute distributed data mining services
� We used 10 datasets extracted from the census dataset
available at the UCI repository:
� number of instances: from 1700 to 17000
� dataset size: from 0.5 to 5 MB
� Weka4WS has been used to perform a clustering analysis
on each of these datasets:
� algorithm used: Expectation Maximization (EM)
� number of clusters to be identified: 10
33
Performance analysis
� The clustering analysis on each dataset was executed in
two network scenarios:
� LAG: the computing node and the user node are connected
by a local area grid (Bw = 94.4 Mbps, RTT = 1.4 ms)
� WAG: the computing node and the user node are
connected by a wide area grid (Bw = 213 kbps, RTT = 19
ms)
� For each dataset size and network scenario we run 20
independent executions:
� the values reported in the following graphs are computed
as an average of the values measured in the 20 executions
34
Execution times - LAG scenario
1.1
2E
+05
2.1
2E
+05
3.1
1E
+05
4.1
6E
+05
5.1
9E
+05
6.2
5E
+05
7.2
6E
+05
8.3
0E
+05
9.2
4E
+05
1.0
3E
+06
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Dataset size (MB)
Execution t
ime (
ms)
Resource creation
Notification subscription
Task submission
Dataset download
Data mining
Results notification
Resource destruction
Total
Dataset size (MB)
Execution time (ms)
� This graph represents the execution times of the different
steps of the clustering task in the LAG scenario for a dataset
size ranging from 0.5 to 5 MB
35
Execution times - LAG scenario
� The execution times of the WSRF-specific steps are
independent from the dataset size:
� resource creation (1698 ms, on the average), notification
subscription (275 ms), task submission (342 ms), results
notification (1354 ms), and resource destruction (214 ms)
1.1
2E
+05
2.1
2E
+05
3.1
1E
+05
4.1
6E
+05
5.1
9E
+05
6.2
5E
+05
7.2
6E
+05
8.3
0E
+05
9.2
4E
+05
1.0
3E
+06
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Dataset size (MB)
Execution t
ime (
ms)
Resource creation
Notification subscription
Task submission
Dataset download
Data mining
Results notification
Resource destruction
Total
Dataset size (MB)
Execution time (ms)
36
Execution times - LAG scenario
� On the contrary, the execution times of the dataset download
and data mining steps are proportional to the dataset size:
� Dataset download: 218 ms (0.5 MB) ... 665 ms (5 MB)
� Data mining: 107474 ms (0.5 MB) ... 1026584 ms (5 MB)
1.1
2E
+05
2.1
2E
+05
3.1
1E
+05
4.1
6E
+05
5.1
9E
+05
6.2
5E
+05
7.2
6E
+05
8.3
0E
+05
9.2
4E
+05
1.0
3E
+06
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Dataset size (MB)
Execution t
ime (
ms)
Resource creation
Notification subscription
Task submission
Dataset download
Data mining
Results notification
Resource destruction
Total
Dataset size (MB)
Execution time (ms)
dataset download
data mining
37
Execution times - LAG scenario
� The total execution time ranges from 111798 ms for the
dataset of 0.5 MB, to 1031209 ms for the dataset of 5 MB
� The lines representing the total execution time and the data
mining execution time appear coincident, because the data mining
step takes from 96% to 99% of the total execution time
1.1
2E
+05
2.1
2E
+05
3.1
1E
+05
4.1
6E
+05
5.1
9E
+05
6.2
5E
+05
7.2
6E
+05
8.3
0E
+05
9.2
4E
+05
1.0
3E
+06
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Dataset size (MB)
Execution t
ime (
ms)
Resource creation
Notification subscription
Task submission
Dataset download
Data mining
Results notification
Resource destruction
Total
Dataset size (MB)
Execution time (ms)
total time
38
Execution times - WAG scenario
� The execution times of the WSRF-specific steps are similar to
those measured in the LAG scenario
� The only significant difference is the execution time of the results
notification step (2790 ms), due to additional time needed to
transfer the clustering model through a low-speed network
1.3
0E
+05
2.4
6E
+05
3.6
1E
+05
4.8
6E
+05
6.1
6E
+05
7.2
5E
+05
8.4
9E
+05
9.6
4E
+05
1.0
6E
+06
1.1
8E
+06
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Dataset size (MB)
Execution t
ime (
ms)
Resource creation
Notification subscription
Task submission
Dataset download
Data mining
Results notification
Resource destruction
Total
Dataset size (MB)
Execution time (ms)
39
Execution times - WAG scenario
� For the same reason, the transfer of the dataset to be mined
requires an execution time significantly greater than the one
measured in the LAG scenario:
� Dataset download: 14638 ms (0.5 MB) ... 132463 ms (5 MB)
1.3
0E
+05
2.4
6E
+05
3.6
1E
+05
4.8
6E
+05
6.1
6E
+05
7.2
5E
+05
8.4
9E
+05
9.6
4E
+05
1.0
6E
+06
1.1
8E
+06
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Dataset size (MB)
Execution t
ime (
ms)
Resource creation
Notification subscription
Task submission
Dataset download
Data mining
Results notification
Resource destruction
Total
Dataset size (MB)
Execution time (ms)
40
Execution times percentage - LAG scenario
� This graph shows the percentage of the execution times of the
data mining, dataset download, and the other steps (i.e.,
resource creation, notification subscription, task submission,
results notification, resource destruction), w.r.t. the total
execution time in the LAG scenario
96.1
3%
98.0
2%
98.7
1%
99.0
2%
99.2
1%
99.3
2%
99.4
0%
99.4
8%
99.5
2%
99.5
5%
0
10
20
30
40
50
60
70
80
90
100
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Dataset size (MB)
% o
f th
e t
ota
l execution t
ime
Other steps
Dataset download
Data mining
Dataset size (MB)
% of the total execution time
41
Execution times percentage - LAG scenario
� In the LAG scenario the data mining step represents from
96.13% to 99.55% of the total execution time, the dataset
download ranges from 0.19% to 0.06%, and the other steps
range from 3.67% to 0.38%
96.1
3%
98.0
2%
98.7
1%
99.0
2%
99.2
1%
99.3
2%
99.4
0%
99.4
8%
99.5
2%
99.5
5%
0
10
20
30
40
50
60
70
80
90
100
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Dataset size (MB)
% o
f th
e t
ota
l execution t
ime
Other steps
Dataset download
Data mining
Dataset size (MB)
% of the total execution time
42
Execution times percentage - WAG scenario
� In the WAG scenario the data mining step represents from
84.62% to 88.32% of the total execution time, the dataset
download ranges from 11.22% to 11.20%, while the other
steps range from 4.16% to 0.48%
88.3
2%
88.2
5%
88.3
2%
88.3
1%
88.1
9%
88.1
0%
87.5
6%
87.2
9%
86.4
0%
84.6
2%
0
10
20
30
40
50
60
70
80
90
100
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Dataset size (MB)
% o
f th
e t
ota
l execution t
ime
Other steps
Dataset download
Data mining
Dataset size (MB)
% of the total execution time
43
Performance considerations
� The performance analysis demonstrates the efficiency of
the WSRF mechanisms and Weka4WS as a means to
execute data mining tasks on remote machines
� In the LAN scenario neither the dataset download nor
the other steps represent a significant overhead with
respect to the total execution time
� In the WAN scenario, on the contrary, the dataset
download is a critical step that can affect the overall
execution time:
� for this reason, the use of data replication mechanisms and
high-performance file transfer protocols such as GridFTP
can be of great importance
44
Outline
� Goal of this work
� Weka4WS, WSRF and GT4
� Weka4WS architecture
� Software components
� Graphical user interface
� Web Services operations
� Execution mechanisms
� Performance analysis
� Conclusions
45
Conclusions (1)
� Gathering data in a central site raises several privacy
considerations that generally prevent centralized mining
of different multi-owner data sources.
� Privacy-preserving mining when data is distributed
among many sites can be implemented through the
cooperation of classifiers that learn global data mining
results without revealing the data sources available at
the single sites.
� Flexible middleware and services running in distributed
environments can help users to follow this approach.
46
Conclusions (2)
� Weka4WS can be used to implement distributed data
mining applications
� By exploiting such mechanisms, Weka4WS can provide
an effective way to perform privacy-preserving
distributed data analysis on large-scale Grids
� The experimental results demonstrate the efficiency of
the WSRF mechanisms as a means to execute data
mining tasks on remote resources
� A Weka4WS software prototype running under Globus
Toolkit 4.0.1 will be soon made available to the research
community
Thanks!THANKS