weka4ws : enabling distributed data mining on grids

Venezia, October 17-18, 2005

1

Weka4WS: Enabling Distributed Data

Mining on Grids

Domenico Talia, Paolo Trunfio, Oreste [email protected]

DEIS

University of Calabria

Rende, Italy

2

Outline

� Goal of this work

� Weka4WS, WSRF and GT4

� Weka4WS architecture

� Software components

� Graphical user interface

� Web Services operations

� Execution mechanisms

� Performance analysis

� Conclusions

3

Goal of this work

� Computational Grids emerged in the last years as

effective infrastructures for distributed high-performance

computing and data processing

� By exploiting a service-oriented approach, knowledge

discovery applications can be developed on Grids to

deliver high performance and manage data and

knowledge distribution

� The goal of this work is to extend the Weka toolkit for

supporting distributed data mining through the use of

standard Grid technologies, such as the emerging Web

Services Resource Framework (WSRF) and the Globus

Toolkit 4

4

The Weka4WS framework

� Weka provides a large collection of machine learning

algorithms for data pre-processing, classification,

clustering, association rules, and visualization, which can

be used through a common GUI

� In Weka, the overall data mining process takes place on

a single machine, since the algorithms can be executed

only locally

� The goal of Weka4WS is to extend Weka to support

remote execution of Weka data mining algorithms

5

The Weka4WS framework

� In Weka4WS the data mining algorithms for

classification, clustering and association rules can be

executed on remote Grid resources to implement

distributed data mining and speedup applications.

� To enable remote invocation, each data mining

algorithm provided by the Weka library are

exposed as a Web Service, which can be easily

deployed on the available Grid nodes

� To achieve integration and interoperability with standard

Grid environments, Weka4WS has been designed and

developed by using the emerging Web Services Resource

Framework (WSRF) as enabling technology

6

WSRF and Globus Toolkit 4

� WSRF is a family of technical specifications concerned

with the creation, addressing, inspection, and lifetime

management of stateful resources

� The framework codifies the relationship between Web

Services and stateful resources, which are termed as

WS-Resource in the WSRF language

� The Globus Alliance recently released the Globus Toolkit

4 (GT4), which provides an open source implementation

of the WSRF library and incorporates Web Services

implemented according to the WSRF specifications

� Weka4WS has been developed by using the Java WSRF

library provided by GT4

7

Outline









� Conclusions

8

Weka4WS architecture

� In Weka4WS all nodes use GT4 services for standard Grid

functionalities, such as security and data management

� We distinguish those nodes in two categories:

� user nodes, which are the local machines of the users

providing the Weka4WS client software

� computing nodes, which provide the Weka4WS Web

Services allowing the execution of remote data mining tasks

� Data can be located on computing nodes, user nodes, or

third-party nodes

� If the dataset to be mined is not locally available on a

computing node, it can be downloaded or replicated by

means of the GT4 data management services

9

Software components

� User nodes include three software components:

� Graphical User Interface (GUI)

� Client Module (CM)

� Weka Library (WL)

GraphicalUser Interface

WekaLibrary

User node

ClientModule

GT4 Services

WebService

WekaLibrary

GT4 Services

Computing node

10

Software components

� Computing nodes include two software components:

� Web Service (WS)

� Weka Library (WL)


WekaLibrary

User node

ClientModule

GT4 Services

WebService

WekaLibrary

GT4 Services

Computing node

11

Software components

� The GUI extends the Weka Explorer environment to allow

the execution of both local and remote data mining tasks:

� local tasks are executed by directly invoking the local WL

� remote tasks are executed through the CM, which operates

as an intermediary between the GUI and Web Services on

remote computing nodes


WekaLibrary

User node

ClientModule

GT4 Services

WebService

WekaLibrary

GT4 Services

Computing node

local task

remotetask

12

Software components

� The WS is a WSRF-compliant Web Service that exposes

the data mining algorithms provided by the underlying

WL

� Therefore, requests to the WS are executed by invoking

the corresponding WL algorithms


WekaLibrary

User node

ClientModule

GT4 Services

WebService

WekaLibrary

GT4 Services

Computing node

Algorithminvocation

13

Graphical user interface

� A “Remote pane” has been added to the original Weka

Explorer environment

14


� This pane provides a list of the remote Web Services that

can be invoked, and two buttons to start and stop the

data mining task on the selected Web Service

15


� Through the GUI a user can both:

� start the execution locally by using the standard Local pane

� start the execution remotely by using the Remote pane

� Each task in the GUI is managed by an independent

thread

� Therefore, a user can start multiple data mining tasks in

parallel on different Web Services, this way taking full

advantage of the distributed Grid environment

� Whenever the output of a data mining task has been

received from a remote computing node, it is visualized

in the standard Output pane

16

Outline









� Conclusions

17

Web Services operations

� The first three operations are related to WSRF-specific

invocation mechanisms

� The last three operations are used to require the

execution of a specific data mining task

Subscribes to notifications about resource

properties changes.subscribe

Submits the execution of an association rules

task.associationRules

Submits the execution of a clustering task.clustering

Submits the execution of a classification task.classification

Explicitly requests the destruction of a WS-

Resource.destroy

Creates a new WS-Resource.createResource

WSRFspecific

Datamining

18

Web Services operations

� The classification operation provides access to the

complete set of classifiers in the Weka Library (currently,

71 algorithms)

� The clustering and associationRules operations expose all

the clustering and association rules algorithms provided

by the Weka Library (5 and 2 algorithms, respectively)

� To improve concurrency the data mining operations are

invoked in an asynchronous way:

� the client submits the execution in a non-blocking mode,

and results are notified to the client whenever they have

been computed

19

Input parameters of the DM operations

� Three parameters are required in the invocation of all

the data mining operations: algorithm, arguments, and

dataSet

URL of the dataset to be mineddataSet

Algorithm argumentsarguments

Name of the association rules algorithmalgorithm

associationRules


Index of the class w.r.t. evaluate clustersclassIndex

Indexes of the selected attributesselectedAttrs

Testing phase optionstestOptions


Name of the clustering algorithmalgorithm

clustering


Index of the attribute to use as the classclassIndex

Options to be used during the testing phasetestOptions

Arguments to be passed to the algorithmarguments

Name of the classification algorithmalgorithm

classification

20


� The algorithm parameter specifies the name of the Java

class in the Weka Library to be invoked

� example: “weka.classifiers.trees.J48”




associationRules







clustering






classification

21


� The arguments parameter specifies a sequence of

arguments to be passed to the algorithm

� example: “-C 0.25 -M 2”




associationRules







clustering






classification

22


� The dataSet parameter specifies the URL of the dataset

to be mined

� example: “gsiftp://hostname/path/file.arff”




associationRules







clustering






classification

23

Task execution steps

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

DatasetWeka

Library

GT4 RFT Service

User nodeComputing node

GridFTP ServerGridFTP Server

� Scenario: the Client Module (CM) is requesting the execution

of a clustering analysis on a dataset local to the user node

� To perform this task a sequence of steps need to be executed

24

Task execution: Resource creation

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1 1

WekaLibrary

GT4 RFT Service



Resource creation. The CM invokes the createResource

operation to create a new WS-Resource. A clustering model

property is used to store the result of the clustering task. The

WS returns the EPR of the created resource

1

25

Task execution: Notification subscription

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1 1

2 2

WekaLibrary

GT4 RFT Service



Notification subscription. The CM invokes the subscribe

operation, which subscribes to notifications about changes

that will occur to the clustering model resource property

2

26

Task execution: Task submission

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1 1

2 2

3 3

WekaLibrary

GT4 RFT Service



Task submission. The CM invokes the clustering operation

to require the execution of the clustering task. The operation

is invoked in an asynchronous way

3

27

Task execution: Dataset download

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1

Dataset

1

2 2

3 3

WekaLibrary

GT4 RFT Service


44

4GridFTP ServerGridFTP Server

Dataset download. The WS requests to download the

dataset to be mined from the URL specified in the clustering

invocation. The download request is managed by the GT4

Reliable File Transfer (RFT) service

4

28

Task execution: Data mining

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1

Dataset

1

2 2

3 3

WekaLibrary

GT4 RFT Service


44

55


Data mining. The clustering analysis is started by invoking

the appropriate Java class in the WL. The execution is handled

within the WS-Resource created on Step 1, and the result of

the computation is stored in the clustering model property

5

29

Task execution: Results notification

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1

Dataset

1

2 2

3 3

WekaLibrary

6

GT4 RFT Service


44

55


Results notification. Whenever the clustering model

property has been changed, its new value is notified to the

CM, by invoking its implicit deliver operation

6

30

Task execution: Resource destruction

Web Service

ClientModule

createResource

subscribe

destroy

clustering

deliver

Dataset

WS-Resource

Clusteringmodel

1

Dataset

1

2 2

3 3

WekaLibrary

6

7 7

GT4 RFT Service


44

55


Resource destruction. The CM invokes the destroy

operation, which explicitly destroys the WS-Resource created

on Step 1

7

31

Outline









� Conclusions

32

Performance analysis

� Goals:

� evaluating the execution times of the different steps needed to

perform a typical data mining task in different network scenarios

� evaluating the efficiency of the WSRF mechanisms and Weka4WS

as methods to execute distributed data mining services

� We used 10 datasets extracted from the census dataset

available at the UCI repository:

� number of instances: from 1700 to 17000

� dataset size: from 0.5 to 5 MB

� Weka4WS has been used to perform a clustering analysis

on each of these datasets:

� algorithm used: Expectation Maximization (EM)

� number of clusters to be identified: 10

33

Performance analysis

� The clustering analysis on each dataset was executed in

two network scenarios:

� LAG: the computing node and the user node are connected

by a local area grid (Bw = 94.4 Mbps, RTT = 1.4 ms)

� WAG: the computing node and the user node are

connected by a wide area grid (Bw = 213 kbps, RTT = 19

ms)

� For each dataset size and network scenario we run 20

independent executions:

� the values reported in the following graphs are computed

as an average of the values measured in the 20 executions

34

Execution times - LAG scenario

1.1

2E

+05

2.1

2E

+05

3.1

1E

+05

4.1

6E

+05

5.1

9E

+05

6.2

5E

+05

7.2

6E

+05

8.3

0E

+05

9.2

4E

+05

1.0

3E

+06

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

Execution t

ime (

ms)

Resource creation

Notification subscription

Task submission

Dataset download

Data mining

Results notification

Resource destruction

Total

Dataset size (MB)

Execution time (ms)

� This graph represents the execution times of the different

steps of the clustering task in the LAG scenario for a dataset

size ranging from 0.5 to 5 MB

35


� The execution times of the WSRF-specific steps are

independent from the dataset size:

� resource creation (1698 ms, on the average), notification

subscription (275 ms), task submission (342 ms), results

notification (1354 ms), and resource destruction (214 ms)

1.1

2E

+05

2.1

2E

+05

3.1

1E

+05

4.1

6E

+05

5.1

9E

+05

6.2

5E

+05

7.2

6E

+05

8.3

0E

+05

9.2

4E

+05

1.0

3E

+06

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

Execution t

ime (

ms)

Resource creation


Task submission

Dataset download

Data mining



Total

Dataset size (MB)

Execution time (ms)

36


� On the contrary, the execution times of the dataset download

and data mining steps are proportional to the dataset size:

� Dataset download: 218 ms (0.5 MB) ... 665 ms (5 MB)

� Data mining: 107474 ms (0.5 MB) ... 1026584 ms (5 MB)

1.1

2E

+05

2.1

2E

+05

3.1

1E

+05

4.1

6E

+05

5.1

9E

+05

6.2

5E

+05

7.2

6E

+05

8.3

0E

+05

9.2

4E

+05

1.0

3E

+06

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

Execution t

ime (

ms)

Resource creation


Task submission

Dataset download

Data mining



Total

Dataset size (MB)

Execution time (ms)

dataset download

data mining

37


� The total execution time ranges from 111798 ms for the

dataset of 0.5 MB, to 1031209 ms for the dataset of 5 MB

� The lines representing the total execution time and the data

mining execution time appear coincident, because the data mining

step takes from 96% to 99% of the total execution time

1.1

2E

+05

2.1

2E

+05

3.1

1E

+05

4.1

6E

+05

5.1

9E

+05

6.2

5E

+05

7.2

6E

+05

8.3

0E

+05

9.2

4E

+05

1.0

3E

+06

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

Execution t

ime (

ms)

Resource creation


Task submission

Dataset download

Data mining



Total

Dataset size (MB)

Execution time (ms)

total time

38

Execution times - WAG scenario

� The execution times of the WSRF-specific steps are similar to

those measured in the LAG scenario

� The only significant difference is the execution time of the results

notification step (2790 ms), due to additional time needed to

transfer the clustering model through a low-speed network

1.3

0E

+05

2.4

6E

+05

3.6

1E

+05

4.8

6E

+05

6.1

6E

+05

7.2

5E

+05

8.4

9E

+05

9.6

4E

+05

1.0

6E

+06

1.1

8E

+06

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

Execution t

ime (

ms)

Resource creation


Task submission

Dataset download

Data mining



Total

Dataset size (MB)

Execution time (ms)

39

Execution times - WAG scenario

� For the same reason, the transfer of the dataset to be mined

requires an execution time significantly greater than the one

measured in the LAG scenario:

� Dataset download: 14638 ms (0.5 MB) ... 132463 ms (5 MB)

1.3

0E

+05

2.4

6E

+05

3.6

1E

+05

4.8

6E

+05

6.1

6E

+05

7.2

5E

+05

8.4

9E

+05

9.6

4E

+05

1.0

6E

+06

1.1

8E

+06

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

Execution t

ime (

ms)

Resource creation


Task submission

Dataset download

Data mining



Total

Dataset size (MB)

Execution time (ms)

40

Execution times percentage - LAG scenario

� This graph shows the percentage of the execution times of the

data mining, dataset download, and the other steps (i.e.,

resource creation, notification subscription, task submission,

results notification, resource destruction), w.r.t. the total

execution time in the LAG scenario

96.1

3%

98.0

2%

98.7

1%

99.0

2%

99.2

1%

99.3

2%

99.4

0%

99.4

8%

99.5

2%

99.5

5%

0

10

20

30

40

50

60

70

80

90

100

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

% o

f th

e t

ota

l execution t

ime

Other steps

Dataset download

Data mining

Dataset size (MB)

% of the total execution time

41

Execution times percentage - LAG scenario

� In the LAG scenario the data mining step represents from

96.13% to 99.55% of the total execution time, the dataset

download ranges from 0.19% to 0.06%, and the other steps

range from 3.67% to 0.38%

96.1

3%

98.0

2%

98.7

1%

99.0

2%

99.2

1%

99.3

2%

99.4

0%

99.4

8%

99.5

2%

99.5

5%

0

10

20

30

40

50

60

70

80

90

100

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

% o

f th

e t

ota

l execution t

ime

Other steps

Dataset download

Data mining

Dataset size (MB)


42

Execution times percentage - WAG scenario

� In the WAG scenario the data mining step represents from

84.62% to 88.32% of the total execution time, the dataset

download ranges from 11.22% to 11.20%, while the other

steps range from 4.16% to 0.48%

88.3

2%

88.2

5%

88.3

2%

88.3

1%

88.1

9%

88.1

0%

87.5

6%

87.2

9%

86.4

0%

84.6

2%

0

10

20

30

40

50

60

70

80

90

100

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

% o

f th

e t

ota

l execution t

ime

Other steps

Dataset download

Data mining

Dataset size (MB)


43

Performance considerations

� The performance analysis demonstrates the efficiency of

the WSRF mechanisms and Weka4WS as a means to

execute data mining tasks on remote machines

� In the LAN scenario neither the dataset download nor

the other steps represent a significant overhead with

respect to the total execution time

� In the WAN scenario, on the contrary, the dataset

download is a critical step that can affect the overall

execution time:

� for this reason, the use of data replication mechanisms and

high-performance file transfer protocols such as GridFTP

can be of great importance

44

Outline









� Conclusions

45

Conclusions (1)

� Gathering data in a central site raises several privacy

considerations that generally prevent centralized mining

of different multi-owner data sources.

� Privacy-preserving mining when data is distributed

among many sites can be implemented through the

cooperation of classifiers that learn global data mining

results without revealing the data sources available at

the single sites.

� Flexible middleware and services running in distributed

environments can help users to follow this approach.

46

Conclusions (2)

� Weka4WS can be used to implement distributed data

mining applications

� By exploiting such mechanisms, Weka4WS can provide

an effective way to perform privacy-preserving

distributed data analysis on large-scale Grids

� The experimental results demonstrate the efficiency of

the WSRF mechanisms as a means to execute data

mining tasks on remote resources

� A Weka4WS software prototype running under Globus

Toolkit 4.0.1 will be soon made available to the research

community

Thanks!THANKS

weka4ws : enabling distributed data mining on grids

Documents