distributed clustering from data streams

Distributed Clustering for Smart GridsPedro Rodrigues, João Gama

University of Porto, Portugal

Project KDUS (PTDC/EIA-EIA/98355/2008)4 September 2011NGDM '11

NGDM '11

2

Smart GridsSmart Grids: monitoring information on the top of electrical grid

Internet-like communications layer

A shift in the way in which power grids are operatedIntelligent monitoring in real time

Interactive with consumers and markets

Optimized to make the best use of resources and equipment

Predictive rather than reactive

Distributed across geographical and organizational boundaries

NGDM '11

3

Smart Grids and Data Mining Smart grid forms a network (eventually decomposable) of distributed sources of high-speed data streams.

The dynamics of data are unknown:

the topology of network changes over time,

the number of meters tends to increase and

the context where the meter acts evolves over time.

Several data mining tasks are involved: prediction, cluster (profiling) analysis, event and anomaly detection, correlation analysis, etc.

All these characteristics constitute real challenges and opportunities for applied research in distributed data mining.

The requirements of near real-time analysis for multiple time horizons and multiple space aggregations make these analysis an even harder research challenge.

NGDM '11

4

Outline

Rationale

Clustering distributed data streams

Local-to-Global Clustering of data sources

NGDM '11

5

Sensors are usually small, low-cost devices capable of sensing some attribute and of communicating with other sensors.

Sensor networks can include thousands of sensors, each one being capable of measuring, analysing and transmitting a stream of data.

Resources are scarse, which reduce the possibilities for heavy computation,while operating under a limited bandwidth.

Rationale Sensor Networks

NGDM '11

6

Comprehension

Extract information about global interaction between sources by looking at the data they produce.

When no other information is available, usual knowledge discovery approaches are based on unsupervised techniques (e.g. clustering).

However, two different stream clustering problems exist:

clustering streaming data points (e.g. meter' readings)

clustering streaming data sources (e.g. meters)

Rationale Comprehension of Ubiquitous Data Streams

NGDM '11

7

Information about dense regions of the sensor data space.

Cluster A Cluster B Cluster C

Rationale Comprehension by Clustering Data Points

NGDM '11

8

Information about groups of sensors that behave similarly over time.

Possible scenario

Sensors collecting electricity demand data from different homes, exploring similar consumption patterns.


Rationale Comprehension by Clustering Data Sources

NGDM '11

9

Setting

Sensors in a wide network produce streams of heterogeneously distributed data (each sensor produces a univariate stream of data)

Objective

To keep a clustering of the observations that are created by aggregating each node's data as a feature in a centralized stream.


DGClust Setting and Objective

NGDM '11

10

Problems

high-speed data streams excessive storage and processing

widely spread network heavy communication

centralized clustering high dimensionality

dynamic data outdated models

Research Question

Does local discretization and representative clustering improve validity, communication and computation loads when applied to distributed sensor data streams?

DGClust Problems and Research Question

NGDM '11

11

DGClust – Distributed Grid Clustering (Local Step)

Each sensor keeps an online ordinal discretization of its data.

Partition Incremental Discretization

Current State

low

D

DGClust Methodology : Local Step

NGDM '11

12

DGClust – Distributed Grid Clustering (Aggregating Step)

The central server gathers the global state of the network.

Sensors whose state has not change since last communication, do not transmit to server.

lowlow

Dhigh

highA

BB

Bhigh

low

lowlowD

highhigh

ABBB

highlow

DGClust Methodology : Aggregating Step

NGDM '11

13

DGClust – Distributed Grid Clustering (Representative Step)

Server keeps a small list of the most frequent global states.

Space-Saving Frequent Items Monitoring

lowlowD

highhigh

ABBB

highlow

highlowDlowlowACCBhighhigh

lowlowDhighhigh

ABBBhighlow

lowhigh

DhighlowAABAlowlow

#

523

334

89

...

DGClust Methodology : Representative Step

NGDM '11

14

DGClust – Distributed Grid Clustering (Clustering Step)

Server applies partitional clustering to the most frequent states.

Furthest Point Clustering + Online Adaptive K-Means

DGClust Methodology : Clustering Step

NGDM '11

15

DGClust Example (k=5) Varying Resources

NGDM '11

16

Quality of results does not depend on the number of sensors.

Communication reduction is constant with any number of sensors (as long as direct link with server exists).

higher clustering quality

higher discretization granularity

lower communication reduction

higher number of sensors more clustering updates

DGClust Main Findings

NGDM '11

17

Setting

Sensors in a wide network produce streams of heterogeneously distributed data (each sensor produces a univariate stream of data)

Objective

To keep, at each node, a clustering of the entire network of sensors.


L2GClust Setting and Objective

NGDM '11

18

Each sensor keeps a sketch of its most recent data.

The common approach for focus on recent data are sliding windows1.

Even within the sliding window, the most recent data point is usually more important than the last one which is about to be discarded.

In ubiquitous streaming data sources, such as sensor networks, resources like memory and processing power are scarse.

Some times, there is not even enough memory to store all the data points inside the window.

Memoryless α-fading average

10.2

L2GClust Methodology : Local Sketch

NGDM '11

19

1

2

10

99

95

11

10

100

3

10

2

12

5

10

L2GClust Example : Local Clustering

NGDM '11

20

Centroids {6.9, 98.0}1

2

10

99

95

11

10

100

3

10

2

12

5

10


NGDM '11

21

This estimate is computed by clustering the centroids of direct neighbors’ estimates of the global clustering.

Furthest Point Clustering

Basically, each node performs an ensemble of clusterings from its direct neighbors.

Instead of broadcasting the sketch of the its own data, each node broadcasts its estimate of the global clustering.

L2GClust Methodology : Local Clustering

NGDM '11

22

Centroids {6.9, 98.0}88.07

88.06

2.80

1.21

3.58

3.74

87.37

4.19

88.03

3.50

88.12

86.31

2.41

88.06


{7.71, 97.1}

{10.59, 97.38}

{5.10, 95.00}

NGDM '11

23

Centroids {6.9, 98.0}88.07

88.06

2.80

1.21

3.58

3.74

87.37

4.19

88.03

3.50

88.12

86.31

2.41

88.06


{7.71, 97.1}

{10.59, 97.38}

{5.10, 95.00}

NGDM '11

24

Centroids {6.9, 98.0}88.07

88.06

2.80

1.21

3.58

3.74

87.37

4.19

88.03

3.50

88.12

86.31

2.41

88.06


{10.36, 97.1}

NGDM '11

25

Comparison was performed with same strategy executed at a central server with access to all data.

Measured outcomes were the agreement between a node's clustering estimate and the centralized clustering, averaged over all nodes.

Kappa statistic cluster sanity

Proportion of agreement cluster validity

K=(P(A)-P(e))/(1-P(e))

State-of-the-art Simulator

Each sensor in the simulation (Visual Sense) generates a Gaussian stream with mean from one of the predefined Gaussian clusters.

Evaluated parameters were number of clusters, network size, and cluster overlap.

L2GClust Evaluation Summary

NGDM '11

26

L2GClust Results

Average proportion of agreement converges (with small fluctuations).

NGDM '11

27

L2GClust Results

Sanity was confirmed with Kappa statistic always above 0.58.

NGDM '11

28

L2GClust Results

Real data from electricity demand sensors showedability to improve with examples.

NGDM '11

29

Local sketch yields:

memoryless storage of summaries;

a straightforward adaptation to most recent data;

a reduction of the system's sensitivity to uncertainty;

Local clustering with direct neighbors yields:

no forwarding of information (reduced communication);

low dimensionality of the clustering problem;

sensitive information better preserved.

Future Work

Evaluate L2GClust on smart grid sensor networks.

L2GClust Main Properties

NGDM '11

30

Thank you!

distributed clustering from data streams

Technology

distributed clustering

nodes data

dynamics of data

partitional clustering

sensor data space

distributed data mining

b b high high low low

data miningsmart grid