stream clustering cse 902. big data stream analysis stream: continuous flow of data challenges...

25
Stream Clustering CSE 902

Upload: oscar-copeland

Post on 24-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Stream Clustering CSE 902
  • Slide 2
  • Big Data
  • Slide 3
  • Stream analysis Stream: Continuous flow of data Challenges Volume: Not possible to store all the data One-time access: Not possible to process the data using multiple passes Real-time analysis: Certain applications need real-time analysis of the data Temporal Locality: Data evolves over time, so model should be adaptive.
  • Slide 4
  • Stream Clustering Topic cluster Article Listings
  • Slide 5
  • Stream Clustering Online Phase Summarize the data into memory-efficient data structures Offline Phase Use a clustering algorithm to find the data partition
  • Slide 6
  • Stream Clustering Algorithms Data StructuresExamples PrototypesStream, Stream Lsearch CF-TreesScalable k-means, single pass k-means Microcluster TreesClusTree, DenStream, HP-Stream GridsD-Stream, ODAC Coreset TreeStreamKM++
  • Slide 7
  • Prototypes Stream, LSearch
  • Slide 8
  • CF-Trees Summarize the data in each CF-vector Linear sum of data points Squared sum of data points Number of points Scalable k-means, Single pass k-means
  • Slide 9
  • Microclusters CF-Trees with time element CluStream Linear sum and square sum of timestamps Delete old microclusters/merging microclusters if their timestamps are close to each other Sliding Window Clustering Timestamp of the most recent data point added to the vector Maintain only the most recent T microclusters DenStream Microclusters are associated with weights based on recency Outliers detected by creating separate microcluster
  • Slide 10
  • Microclusters CF-Trees with time element DenStream Microclusters are associated with weights based on recency Outliers detected by creating separate microcluster ClusTree Allows real-time clustering
  • Slide 11
  • Grids D-Stream Assign the data to grids Grids weighted by recency of points added to it Each grid associated with a label DGClust Distributed clustering of sensor data Sensors maintain local copies of the grid and communicate updates to the grid to a central site
  • Slide 12
  • StreamKM++ (Coresets) StreamKM++: A Clustering Algorithm for Data Streams, Ackermann, Journal of Experimental Algorithmics 2012
  • Slide 13
  • Kernel-based Clustering
  • Slide 14
  • Kernel-based Stream Clustering Use non-linear distance measures to define similarity between data points in the stream Challenges Quadratic running time complexity Computationally expensive to compute centers using linear sums and squared sums (CF-vector approach will not work)
  • Slide 15
  • Stream Kernel k-means (sKKM) Kernel k-means Weighted Kernel k-means History from only the preceding data chunk retained Approximation of Kernel k-Means for Streaming Data, Havens, ICPR 2012
  • Slide 16
  • Statistical Leverage Scores Measures the influence of a point in the low-rank approximation
  • Slide 17
  • Statistical Leverage Scores
  • Slide 18
  • Slide 19
  • Approximate Stream kernel k-means o Uses statistical leverage score to determine which data points in the stream are potentially important o Retain the important points and discard the rest o Use an approximate version of kernel k-means to obtain the clusters Linear time complexity o Bounded amount of memory
  • Slide 20
  • Approximate Stream kernel k-means
  • Slide 21
  • Importance Sampling
  • Slide 22
  • Clustering Kernel k-means Approximate Kernel k-means
  • Slide 23
  • Clustering Approximate Kernel k-means
  • Slide 24
  • Updating eigenvectors Only eigenvectors and eigenvalues of kernel matrix are required for both sampling and clustering Update the eigenvectors and eigenvalues incrementally
  • Slide 25
  • Approximate Stream Kernel k-means
  • Slide 26
  • Network Traffic Monitoring Clustering used to detect intrusions in the network Network Intrusion Data set TCP dump data from seven weeks of LAN traffic 10 classes: 9 types of intrusions, 1 class of legitimate traffic. Running Time in milliseconds (per data point) Cluster Accuracy (NMI) Approximate stream kernel k-means6.614.2 StreamKM++0.87.0 sKKM42.113.3 Around 200 points clustered per second
  • Slide 27
  • Summary Efficient kernel-based stream clustering algorithm - linear running time complexity Memory required is bounded Real-time clustering is possible Limitation: does not account for data evolution