distributed clustering from data streams

30
Distributed Clustering for Smart Grids Pedro Rodrigues, João Gama University of Porto, Portugal Project KDUS (PTDC/EIA-EIA/98355/2008) 4 September 2011 NGDM '11

Upload: larca-upc

Post on 14-Jun-2015

1.251 views

Category:

Technology


0 download

DESCRIPTION

Simple objects that surround us are gaining sensors, computational power, and actuators, and are changing from static, into adaptive and reactive systems. In this talk we discuss issues for knowledge discovery from distributed data streams generated by sensors with limited computational resources.We present two clustering algorithms for two different tasks: clustering streaming data, which searches for dense regions of the data space, and clustering streaming data sources, which finds groups of sources that behave similarly over time. In the first setting, a cluster is defined to be a set of data points. In the second setting, a cluster is defined to be a set of sensors. We conclude the talk by presenting the lessons learned.

TRANSCRIPT

Page 1: Distributed clustering from data streams

Distributed Clustering for Smart GridsPedro Rodrigues, João Gama

University of Porto, Portugal

Project KDUS (PTDC/EIA-EIA/98355/2008)4 September 2011NGDM '11

Page 2: Distributed clustering from data streams

NGDM '11

2

Smart GridsSmart Grids: monitoring information on the top of electrical grid

Internet-like communications layer

A shift in the way in which power grids are operatedIntelligent monitoring in real time

Interactive with consumers and markets

Optimized to make the best use of resources and equipment

Predictive rather than reactive

Distributed across geographical and organizational boundaries

Page 3: Distributed clustering from data streams

NGDM '11

3

Smart Grids and Data Mining Smart grid forms a network (eventually decomposable) of distributed sources of high-speed data streams.

The dynamics of data are unknown:

the topology of network changes over time,

the number of meters tends to increase and

the context where the meter acts evolves over time.

Several data mining tasks are involved: prediction, cluster (profiling) analysis, event and anomaly detection, correlation analysis, etc.

All these characteristics constitute real challenges and opportunities for applied research in distributed data mining.

The requirements of near real-time analysis for multiple time horizons and multiple space aggregations make these analysis an even harder research challenge.

Page 4: Distributed clustering from data streams

NGDM '11

4

Outline

Rationale

Clustering distributed data streams

Local-to-Global Clustering of data sources

Page 5: Distributed clustering from data streams

NGDM '11

5

Sensors are usually small, low-cost devices capable of sensing some attribute and of communicating with other sensors.

Sensor networks can include thousands of sensors, each one being capable of measuring, analysing and transmitting a stream of data.

Resources are scarse, which reduce the possibilities for heavy computation,while operating under a limited bandwidth.

Rationale Sensor Networks

Page 6: Distributed clustering from data streams

NGDM '11

6

Comprehension

Extract information about global interaction between sources by looking at the data they produce.

When no other information is available, usual knowledge discovery approaches are based on unsupervised techniques (e.g. clustering).

However, two different stream clustering problems exist:

clustering streaming data points (e.g. meter' readings)

clustering streaming data sources (e.g. meters)

Rationale Comprehension of Ubiquitous Data Streams

Page 7: Distributed clustering from data streams

NGDM '11

7

Information about dense regions of the sensor data space.

Cluster A Cluster B Cluster C

Rationale Comprehension by Clustering Data Points

Page 8: Distributed clustering from data streams

NGDM '11

8

Information about groups of sensors that behave similarly over time.

Possible scenario

Sensors collecting electricity demand data from different homes, exploring similar consumption patterns.

Cluster A Cluster B Cluster C

Rationale Comprehension by Clustering Data Sources

Page 9: Distributed clustering from data streams

NGDM '11

9

Setting

Sensors in a wide network produce streams of heterogeneously distributed data (each sensor produces a univariate stream of data)

Objective

To keep a clustering of the observations that are created by aggregating each node's data as a feature in a centralized stream.

Cluster A Cluster B Cluster C

DGClust Setting and Objective

Page 10: Distributed clustering from data streams

NGDM '11

10

Problems

high-speed data streams excessive storage and processing

widely spread network heavy communication

centralized clustering high dimensionality

dynamic data outdated models

Research Question

Does local discretization and representative clustering improve validity, communication and computation loads when applied to distributed sensor data streams?

DGClust Problems and Research Question

Page 11: Distributed clustering from data streams

NGDM '11

11

DGClust – Distributed Grid Clustering (Local Step)

Each sensor keeps an online ordinal discretization of its data.

Partition Incremental Discretization

Current State

low

D

DGClust Methodology : Local Step

Page 12: Distributed clustering from data streams

NGDM '11

12

DGClust – Distributed Grid Clustering (Aggregating Step)

The central server gathers the global state of the network.

Sensors whose state has not change since last communication, do not transmit to server.

lowlow

Dhigh

highA

BB

Bhigh

low

lowlowD

highhigh

ABBB

highlow

DGClust Methodology : Aggregating Step

Page 13: Distributed clustering from data streams

NGDM '11

13

DGClust – Distributed Grid Clustering (Representative Step)

Server keeps a small list of the most frequent global states.

Space-Saving Frequent Items Monitoring

lowlowD

highhigh

ABBB

highlow

highlowDlowlowACCBhighhigh

lowlowDhighhigh

ABBBhighlow

lowhigh

DhighlowAABAlowlow

#

523

334

89

...

DGClust Methodology : Representative Step

Page 14: Distributed clustering from data streams

NGDM '11

14

DGClust – Distributed Grid Clustering (Clustering Step)

Server applies partitional clustering to the most frequent states.

Furthest Point Clustering + Online Adaptive K-Means

DGClust Methodology : Clustering Step

Page 15: Distributed clustering from data streams

NGDM '11

15

DGClust Example (k=5) Varying Resources

Page 16: Distributed clustering from data streams

NGDM '11

16

Quality of results does not depend on the number of sensors.

Communication reduction is constant with any number of sensors (as long as direct link with server exists).

higher clustering quality

higher discretization granularity

lower communication reduction

higher number of sensors more clustering updates

DGClust Main Findings

Page 17: Distributed clustering from data streams

NGDM '11

17

Setting

Sensors in a wide network produce streams of heterogeneously distributed data (each sensor produces a univariate stream of data)

Objective

To keep, at each node, a clustering of the entire network of sensors.

Cluster A Cluster B Cluster C

L2GClust Setting and Objective

Page 18: Distributed clustering from data streams

NGDM '11

18

Each sensor keeps a sketch of its most recent data.

The common approach for focus on recent data are sliding windows1.

Even within the sliding window, the most recent data point is usually more important than the last one which is about to be discarded.

In ubiquitous streaming data sources, such as sensor networks, resources like memory and processing power are scarse.

Some times, there is not even enough memory to store all the data points inside the window.

Memoryless α-fading average

10.2

L2GClust Methodology : Local Sketch

Page 19: Distributed clustering from data streams

NGDM '11

19

1

2

10

99

95

11

10

100

3

10

2

12

5

10

L2GClust Example : Local Clustering

Page 20: Distributed clustering from data streams

NGDM '11

20

Centroids {6.9, 98.0}1

2

10

99

95

11

10

100

3

10

2

12

5

10

L2GClust Example : Local Clustering

Page 21: Distributed clustering from data streams

NGDM '11

21

This estimate is computed by clustering the centroids of direct neighbors’ estimates of the global clustering.

Furthest Point Clustering

Basically, each node performs an ensemble of clusterings from its direct neighbors.

Instead of broadcasting the sketch of the its own data, each node broadcasts its estimate of the global clustering.

L2GClust Methodology : Local Clustering

Page 22: Distributed clustering from data streams

NGDM '11

22

Centroids {6.9, 98.0}88.07

88.06

2.80

1.21

3.58

3.74

87.37

4.19

88.03

3.50

88.12

86.31

2.41

88.06

L2GClust Example : Local Clustering

{7.71, 97.1}

{10.59, 97.38}

{5.10, 95.00}

Page 23: Distributed clustering from data streams

NGDM '11

23

Centroids {6.9, 98.0}88.07

88.06

2.80

1.21

3.58

3.74

87.37

4.19

88.03

3.50

88.12

86.31

2.41

88.06

L2GClust Example : Local Clustering

{7.71, 97.1}

{10.59, 97.38}

{5.10, 95.00}

Page 24: Distributed clustering from data streams

NGDM '11

24

Centroids {6.9, 98.0}88.07

88.06

2.80

1.21

3.58

3.74

87.37

4.19

88.03

3.50

88.12

86.31

2.41

88.06

L2GClust Example : Local Clustering

{10.36, 97.1}

Page 25: Distributed clustering from data streams

NGDM '11

25

Comparison was performed with same strategy executed at a central server with access to all data.

Measured outcomes were the agreement between a node's clustering estimate and the centralized clustering, averaged over all nodes.

Kappa statistic cluster sanity

Proportion of agreement cluster validity

K=(P(A)-P(e))/(1-P(e))

State-of-the-art Simulator

Each sensor in the simulation (Visual Sense) generates a Gaussian stream with mean from one of the predefined Gaussian clusters.

Evaluated parameters were number of clusters, network size, and cluster overlap.

L2GClust Evaluation Summary

Page 26: Distributed clustering from data streams

NGDM '11

26

L2GClust Results

Average proportion of agreement converges (with small fluctuations).

Page 27: Distributed clustering from data streams

NGDM '11

27

L2GClust Results

Sanity was confirmed with Kappa statistic always above 0.58.

Page 28: Distributed clustering from data streams

NGDM '11

28

L2GClust Results

Real data from electricity demand sensors showedability to improve with examples.

Page 29: Distributed clustering from data streams

NGDM '11

29

Local sketch yields:

memoryless storage of summaries;

a straightforward adaptation to most recent data;

a reduction of the system's sensitivity to uncertainty;

Local clustering with direct neighbors yields:

no forwarding of information (reduced communication);

low dimensionality of the clustering problem;

sensitive information better preserved.

Future Work

Evaluate L2GClust on smart grid sensor networks.

L2GClust Main Properties

Page 30: Distributed clustering from data streams

NGDM '11

30

Thank you!