data warehousing mining & bi data streams mining dwmbi1

42
Data Warehousing Mining Data Warehousing Mining & BI & BI Data Streams Mining Data Streams Mining DWMBI DWMBI 1

Upload: benjamin-goodman

Post on 17-Jan-2016

230 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Data Warehousing Mining & BIData Warehousing Mining & BI

Data Streams MiningData Streams Mining

DWMBIDWMBI 11

Page 2: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Mining Data StreamsMining Data Streams

What is stream data? Why Stream Data Systems?What is stream data? Why Stream Data Systems?

Stream data management systems: Issues and solutionsStream data management systems: Issues and solutions

Stream data cube and multidimensional OLAP analysisStream data cube and multidimensional OLAP analysis

Stream frequent pattern analysisStream frequent pattern analysis

Stream classificationStream classification

Stream cluster analysisStream cluster analysis

Research issuesResearch issues

DWMBIDWMBI 22

Page 3: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Characteristics of Data StreamsCharacteristics of Data Streams Data StreamsData Streams

Data Data streamsstreams——continuous, ordered, changing, fast, huge amount continuous, ordered, changing, fast, huge amount

Traditional Traditional DBMSDBMS——data stored in finite, persistent data sets data stored in finite, persistent data sets

CharacteristicsCharacteristics Huge volumes of continuous data, possibly infiniteHuge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time responseFast changing and requires fast, real-time response Data stream captures nicely our data processing needs of todayData stream captures nicely our data processing needs of today Random access is expensiveRandom access is expensive——single scan algorithm (single scan algorithm (can only have can only have

one lookone look)) Store only the summary of the data seen thus farStore only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in Most stream data are at pretty low-level or multi-dimensional in

nature, needs multi-level and multi-dimensional processingnature, needs multi-level and multi-dimensional processingDWMBIDWMBI 33

Page 4: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Stream Data ApplicationsStream Data Applications

Telecommunication calling recordsTelecommunication calling records Business: credit card transaction flowsBusiness: credit card transaction flows Network monitoring and traffic engineeringNetwork monitoring and traffic engineering Financial market: stock exchangeFinancial market: stock exchange Engineering & industrial processes: power supply & Engineering & industrial processes: power supply &

manufacturingmanufacturing Sensor, monitoring & surveillance: video streams, RFIDsSensor, monitoring & surveillance: video streams, RFIDs Security monitoringSecurity monitoring Web logs and Web page click streamsWeb logs and Web page click streams Massive Business data setsMassive Business data sets

DWMBIDWMBI 44

Page 5: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Challenges of Stream Data ProcessingChallenges of Stream Data Processing

Multiple, continuous, rapid, time-varying, orderedMultiple, continuous, rapid, time-varying, ordered streams streams

Main memoryMain memory computations computations

Queries are often Queries are often continuouscontinuous Evaluated continuously as stream data arrivesEvaluated continuously as stream data arrives

Answer updated over timeAnswer updated over time

Queries are often Queries are often complexcomplex Beyond element-at-a-time processingBeyond element-at-a-time processing

Beyond stream-at-a-time processingBeyond stream-at-a-time processing

Beyond relational queries (scientific, data mining, OLAP)Beyond relational queries (scientific, data mining, OLAP)

Multi-level/multi-dimensional Multi-level/multi-dimensional processing and data miningprocessing and data mining Most stream data are at low-level or multi-dimensional in natureMost stream data are at low-level or multi-dimensional in nature

DWMBIDWMBI 55

Page 6: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Architecture: Stream Query ProcessingArchitecture: Stream Query Processing

DWMBIDWMBI 66

Scratch SpaceScratch Space(Main memory and/or Disk)(Main memory and/or Disk)

User/ApplicationUser/ApplicationUser/ApplicationUser/Application

Stream QueryStream QueryProcessorProcessor

ResultsResultsMultiple streamsMultiple streams

SDMS (Stream Data Management System)

Page 7: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Stream Data Processing Methods Stream Data Processing Methods Random sampling (but without knowing the total length in advance)Random sampling (but without knowing the total length in advance)

Reservoir samplingReservoir sampling: maintain a set of : maintain a set of ss candidates in the reservoir, which candidates in the reservoir, which form a true random sample of the element seen so far in the stream. As the form a true random sample of the element seen so far in the stream. As the data stream flow, every new element has a certain probability (data stream flow, every new element has a certain probability (ss/N) of /N) of replacing an old element in the reservoir.replacing an old element in the reservoir.

Sliding windowsSliding windows Make decisions based only on Make decisions based only on recentrecent data data of sliding window size of sliding window size ww An element arriving at time An element arriving at time tt expires at time expires at time t t + + ww

HistogramsHistograms Approximate the frequency distribution of element values in a streamApproximate the frequency distribution of element values in a stream Partition data into a set of contiguous bucketsPartition data into a set of contiguous buckets Equal-width (equal value range for buckets) vs. V-optimal (minimizing Equal-width (equal value range for buckets) vs. V-optimal (minimizing

frequency variance within each bucket)frequency variance within each bucket) Multi-resolution modelsMulti-resolution models

Popular models: balanced binary trees, micro-clusters, and waveletsPopular models: balanced binary trees, micro-clusters, and waveletsDWMBIDWMBI 77

Page 8: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Processing Stream QueriesProcessing Stream Queries Query typesQuery types

One-time query vs. One-time query vs. continuous querycontinuous query (being evaluated (being evaluated continuously as stream continues to arrive)continuously as stream continues to arrive)

Predefined queryPredefined query vs. ad-hoc query (issued on-line) vs. ad-hoc query (issued on-line) Unbounded memory requirementsUnbounded memory requirements

For real-time response, For real-time response, main memory algorithmmain memory algorithm should be should be usedused

Memory requirement is unbounded if one will join future tuples Memory requirement is unbounded if one will join future tuples Approximate query answeringApproximate query answering

With bounded memory, it is not always possible to produce With bounded memory, it is not always possible to produce exact answersexact answers

High-quality approximate answersHigh-quality approximate answers are desired are desired Data reduction and synopsis construction methodsData reduction and synopsis construction methods

Sketches, random sampling, histograms, wavelets, etc.Sketches, random sampling, histograms, wavelets, etc.

DWMBIDWMBI 88

Page 9: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Approximate Query Answering in StreamsApproximate Query Answering in Streams

Sliding windowsSliding windows Only over sliding windows of Only over sliding windows of recentrecent stream data stream data Approximation but often more desirable in applicationsApproximation but often more desirable in applications

Batched processing, sampling and synopsesBatched processing, sampling and synopses BatchedBatched if update is fast but computing is slow if update is fast but computing is slow

Compute periodically, not very timelyCompute periodically, not very timely SamplingSampling if update is slow but computing is fast if update is slow but computing is fast

Compute using sample data, but not good for joins, etc.Compute using sample data, but not good for joins, etc. SynopsisSynopsis data structures data structures

Maintain a small Maintain a small synopsissynopsis or or sketchsketch of data of data Good for querying historical dataGood for querying historical data

Blocking operators, e.g., sorting, avg, min, etc.Blocking operators, e.g., sorting, avg, min, etc. BlockingBlocking if unable to produce the first output until seeing the entire if unable to produce the first output until seeing the entire

inputinput DWMBIDWMBI 99

Page 10: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Stream Data Mining vs. Stream QueryingStream Data Mining vs. Stream Querying Stream miningStream mining——A more challenging task in many casesA more challenging task in many cases

It shares most of the difficulties with stream queryingIt shares most of the difficulties with stream querying But often requires less “precision”, e.g., no join, grouping, But often requires less “precision”, e.g., no join, grouping,

sortingsorting

Patterns are hidden and more general than queryingPatterns are hidden and more general than querying It may require exploratory analysisIt may require exploratory analysis

Not necessarily continuous queriesNot necessarily continuous queries Stream data mining tasksStream data mining tasks

Multi-dimensional on-line analysis of streamsMulti-dimensional on-line analysis of streams Mining outliers and unusual patterns in stream dataMining outliers and unusual patterns in stream data Clustering data streams Clustering data streams Classification of stream dataClassification of stream data

DWMBIDWMBI 1010

Page 11: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Challenges for Mining Dynamics in Data StreamsChallenges for Mining Dynamics in Data Streams

Most stream data are at pretty low-level or multi-Most stream data are at pretty low-level or multi-

dimensional in nature: needs ML/MD processingdimensional in nature: needs ML/MD processing

Analysis requirementsAnalysis requirements

Multi-dimensional trends and unusual patternsMulti-dimensional trends and unusual patterns

Capturing important changes at multi-dimensions/levels Capturing important changes at multi-dimensions/levels

Fast, real-time detection and responseFast, real-time detection and response

Comparing with data cube: Similarity and differencesComparing with data cube: Similarity and differences

Stream (data) cube or stream OLAP: Is this feasible?Stream (data) cube or stream OLAP: Is this feasible?

Can we implement it efficiently?Can we implement it efficiently?

DWMBIDWMBI 1111

Page 12: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Multi-Dimensional Stream Analysis: ExamplesMulti-Dimensional Stream Analysis: Examples

Analysis of Analysis of Web click streamsWeb click streams Raw data at low levels: seconds, web page addresses, user IP Raw data at low levels: seconds, web page addresses, user IP

addresses, …addresses, … Analysts want: changes, trends, unusual patterns, at reasonable Analysts want: changes, trends, unusual patterns, at reasonable

levels of detailslevels of details E.g., E.g., AAverageverage clicking traffic in clicking traffic in India India on on sportssports in the in the last 15 last 15

minutesminutes is 40% higher than that in the is 40% higher than that in the last 24 hourslast 24 hours.”.”

Analysis of Analysis of power consumption streamspower consumption streams Raw data: power consumption flow for every household, every Raw data: power consumption flow for every household, every

minute minute Patterns one may find: Patterns one may find: averageaverage hourly power consumption hourly power consumption

surges up 30%surges up 30% for for housing housing in in DelhiDelhi in the in the last 2 hours today last 2 hours today than than that of the same that of the same day a week ago day a week ago

DWMBIDWMBI 1212

Page 13: Data Warehousing Mining & BI Data Streams Mining DWMBI1

A Stream Cube ArchitectureA Stream Cube Architecture

AA tilted time frametilted time frame Different time granularitiesDifferent time granularities

second, minute, quarter, hour, day, week, …second, minute, quarter, hour, day, week, …

Critical layersCritical layers Minimum interest layerMinimum interest layer (m-layer) (m-layer) Observation layerObservation layer (o-layer) (o-layer) User: watches at o-layer and occasionally needs to drill-down down User: watches at o-layer and occasionally needs to drill-down down

to m-layerto m-layer

Partial materialization of stream cubesPartial materialization of stream cubes Full materialization: too space and time consumingFull materialization: too space and time consuming No materialization: slow response at query timeNo materialization: slow response at query time Partial materialization: what do we mean “partial”?Partial materialization: what do we mean “partial”?

DWMBIDWMBI 1313

Page 14: Data Warehousing Mining & BI Data Streams Mining DWMBI1

A Titled Time ModelA Titled Time Model

Tim et8 t 4 t 2 t t1 6 t3 2 t6 4 t

DWMBIDWMBI

1414

Natural Natural tilted time frame:tilted time frame: Example: Minimal: quarter, then 4 quarters Example: Minimal: quarter, then 4 quarters 1 hour, 24 hours 1 hour, 24 hours

day, … day, …

LogarithmicLogarithmic tilted time frame: tilted time frame: Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, …Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, …

4 q tr s2 4 h o u r s3 1 d ay s1 2 m o n th stim e

Tim et8 t 4 t 2 t t1 6 t3 2 t6 4 t

Page 15: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Two Critical Layers in the Stream CubeTwo Critical Layers in the Stream Cube

DWMBIDWMBI 1515

(*, theme, quarter)

(user-group, URL-group, minute)

m-layer (minimal interest)(individual-user, URL, second)

(primitive) stream data layer

o-layer (observation)

Page 16: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Frequent Patterns for Stream DataFrequent Patterns for Stream Data Frequent pattern mining is valuable in stream applicationsFrequent pattern mining is valuable in stream applications

e.g., network intrusion mininge.g., network intrusion mining Mining Mining preciseprecise freq. patterns in stream data: unrealistic freq. patterns in stream data: unrealistic

store them in a compressed form, such as FPtreestore them in a compressed form, such as FPtree How to mine frequent patterns with good approximation?How to mine frequent patterns with good approximation?

Approximate frequent patterns Approximate frequent patterns Keep only current frequent patterns? No changes can Keep only current frequent patterns? No changes can

be detectedbe detected Mining evolution freq. patternsMining evolution freq. patterns

Use tilted time window frame Use tilted time window frame Mining evolution and dramatic changes of frequent Mining evolution and dramatic changes of frequent

patternspatterns Space-saving computation of frequent and top-k elementsSpace-saving computation of frequent and top-k elements

DWMBIDWMBI 1616

Page 17: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Mining Approximate Frequent PatternsMining Approximate Frequent Patterns

Mining Mining preciseprecise freq. patterns in stream data: freq. patterns in stream data: unrealisticunrealistic

Even store them in a compressed form, such as FPtreeEven store them in a compressed form, such as FPtree

Approximate answersApproximate answers are often sufficient (e.g., trend/pattern analysis) are often sufficient (e.g., trend/pattern analysis)

Example: a router is interested in all flows:Example: a router is interested in all flows:

whose whose frequencyfrequency is at least is at least 1% (1% (σσ)) of the entire traffic stream of the entire traffic stream

seen so far seen so far

and feels that and feels that 1/10 of 1/10 of σσ ( (εε = 0.1%) error= 0.1%) error is comfortable is comfortable

How to mine frequent patterns with How to mine frequent patterns with good approximationgood approximation??

Lossy Counting Algorithm) Lossy Counting Algorithm)

Major ideas: not tracing items until it becomes frequentMajor ideas: not tracing items until it becomes frequent

Adv: guaranteed error boundAdv: guaranteed error bound

Disadv: keep a large set of tracesDisadv: keep a large set of tracesDWMBIDWMBI 1717

Page 18: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Lossy Counting for Lossy Counting for Frequent ItemsFrequent Items

DWMBIDWMBI 1818

Bucket 1 Bucket 2 Bucket 3

Divide Stream into ‘Buckets’ (bucket size is 1/ ε = 1000)

Page 19: Data Warehousing Mining & BI Data Streams Mining DWMBI1

First Bucket of StreamFirst Bucket of Stream

DWMBIDWMBI 1919

Empty(summary) +

At bucket boundary, decrease all counters by 1

Page 20: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Next Bucket of StreamNext Bucket of Stream

DWMBIDWMBI 2020

+

At bucket boundary, decrease all counters by 1

Page 21: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Approximation GuaranteeApproximation Guarantee Given: (1) support threshold: Given: (1) support threshold: σσ, (2) error threshold: , (2) error threshold: εε, and , and

(3)(3) stream length N stream length N Output: Output: items with frequency counts exceeding (items with frequency counts exceeding (σσ –– εε) N) N How much do we undercount?How much do we undercount?

If stream length seen so far = N and bucket-size = If stream length seen so far = N and bucket-size = 1/1/εε then then frequency count error frequency count error #buckets = #buckets = εεNN

Approximation guaranteeApproximation guarantee No false negativesNo false negatives

False positives have frequency at least (False positives have frequency at least (σσ––εε)N)N

Frequency count underestimated by at most Frequency count underestimated by at most εεNN

DWMBIDWMBI 2121

Page 22: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Lossy Counting For Frequent ItemsetsLossy Counting For Frequent Itemsets

DWMBIDWMBI 2222

Divide Stream into ‘Buckets’ as for frequent itemsBut fill as many buckets as possible in main memory

one time

Bucket 1 Bucket 2 Bucket 3

If we put 3 buckets of data into main memory one time,Then decrease each frequency count by 3

Page 23: Data Warehousing Mining & BI Data Streams Mining DWMBI1

DWMBIDWMBI 2323

Update of Summary Data Structure

2

2

1

2

11

1

summary data 3 bucket datain memory

4

4

10

22

0

+

3

3

9

summary data

Itemset ( ) is deleted.That’s why we choose a large number of buckets – delete more

Page 24: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Pruning Itemsets – Apriori RulePruning Itemsets – Apriori Rule

DWMBIDWMBI 2424

If we find itemset ( ) is not frequent itemset,Then we needn’t consider its superset

3 bucket datain memory

1

+

summary data

2

2

1

1

Page 25: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Summary of Lossy CountingSummary of Lossy Counting

StrengthStrength A A simplesimple idea idea Can be extended to Can be extended to frequent itemsetsfrequent itemsets

Weakness:Weakness: Space BoundSpace Bound is not good is not good For frequent itemsets, they do For frequent itemsets, they do scan each record scan each record

many timesmany times The output is based on The output is based on all previous dataall previous data. But . But

sometimes, we are only interested in sometimes, we are only interested in recent datarecent data A space-saving method for stream frequent item A space-saving method for stream frequent item

miningmining

DWMBIDWMBI 2525

Page 26: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Mining Evolution of Frequent Patterns for Mining Evolution of Frequent Patterns for Stream DataStream Data

Approximate frequent patternsApproximate frequent patterns Keep only current frequent patterns—No changes Keep only current frequent patterns—No changes

can be detectedcan be detected Mining evolution and dramatic changes of frequent Mining evolution and dramatic changes of frequent

patternspatterns Use tilted time window frame Use tilted time window frame Use compressed form to store significant Use compressed form to store significant

(approximate) frequent patterns and their time-(approximate) frequent patterns and their time-dependent tracesdependent traces

Note: To mine precise counts, one has to trace/keep a Note: To mine precise counts, one has to trace/keep a fixed (and small) set of itemsfixed (and small) set of items

DWMBIDWMBI 2626

Page 27: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Classification for Dynamic Data StreamsClassification for Dynamic Data Streams Decision tree induction for stream data classificationDecision tree induction for stream data classification

VFDT (Very Fast Decision Tree)/ CVFDTVFDT (Very Fast Decision Tree)/ CVFDT Is decision-tree good for modeling fast changing data, e.g., Is decision-tree good for modeling fast changing data, e.g.,

stock market analysis?stock market analysis? Other stream classification methodsOther stream classification methods

Instead of decision-trees, consider other models Instead of decision-trees, consider other models Naïve BayesianNaïve BayesianEnsemble Ensemble K-nearest neighborsK-nearest neighbors

Tilted time framework, incremental updating, dynamic Tilted time framework, incremental updating, dynamic maintenance, and model constructionmaintenance, and model construction

Comparing of models to find changesComparing of models to find changesDWMBIDWMBI 2727

Page 28: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Network traffic

Classification model

Attack traffic

Firewall

Block and quarantine

Benign traffic

Server

Model

update

Expert analysis and labeling

Uses past Uses past labeled data labeled data to build classification modelto build classification model

Predicts the labels of future instances using the Predicts the labels of future instances using the modelmodel

Helps decision making Helps decision making

Page 29: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Applications

SecuritySecurity Monitoring Monitoring

Network monitoring and traffic engineering.Network monitoring and traffic engineering.

Business : credit card transaction flows.Business : credit card transaction flows.

Telecommunication calling records.Telecommunication calling records.

Web logs and web page click streams.Web logs and web page click streams.

Financial market: stock exchange.Financial market: stock exchange.

Page 30: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Hoeffding Tree AlgorithmHoeffding Tree Algorithm Hoeffding Tree InputHoeffding Tree Input

S: sequence of examplesS: sequence of examplesX: attributesX: attributesG( ): evaluation functionG( ): evaluation functiond: desired accuracyd: desired accuracy

Hoeffding Tree AlgorithmHoeffding Tree Algorithmfor each example in Sfor each example in S

retrieve G(Xretrieve G(Xaa) and G(X) and G(Xbb) //two highest G(X) //two highest G(Xii))

if ( G(Xif ( G(Xaa) – G(X) – G(Xbb) > ) > εε ) )

split on Xsplit on Xaa

recurse to next noderecurse to next nodebreakbreak

DWMBIDWMBI 3030

Page 31: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Decision-Tree Induction with Data Decision-Tree Induction with Data StreamsStreams

DWMBIDWMBI 3131

yes no

Packets > 10

Protocol = http

Protocol = ftp

yes

yes no

Packets > 10

Bytes > 60K

Protocol = http

Data Stream

Data Stream

Page 32: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Hoeffding Tree: Strengths and WeaknessesHoeffding Tree: Strengths and Weaknesses

Strengths Strengths Scales better than traditional methodsScales better than traditional methods

Sublinear with samplingSublinear with samplingVery small memory utilizationVery small memory utilization

IncrementalIncremental

Make class predictions in parallelMake class predictions in parallelNew examples are added as they comeNew examples are added as they come

WeaknessWeakness Could spend a lot of time with tiesCould spend a lot of time with ties Memory used with tree expansionMemory used with tree expansion Number of candidate attributesNumber of candidate attributes

DWMBIDWMBI 3232

Page 33: Data Warehousing Mining & BI Data Streams Mining DWMBI1

VFDT (Very Fast Decision Tree)VFDT (Very Fast Decision Tree) Modifications to Hoeffding TreeModifications to Hoeffding Tree

Near-ties broken more aggressivelyNear-ties broken more aggressively G computed every nG computed every nminmin

Deactivates certain leaves to save memoryDeactivates certain leaves to save memory Poor attributes droppedPoor attributes dropped Initialize with traditional learnerInitialize with traditional learner

Compare to Hoeffding Tree: Better time and memoryCompare to Hoeffding Tree: Better time and memory Compare to traditional decision treeCompare to traditional decision tree

Similar accuracySimilar accuracy Better runtime with 1.61 million examplesBetter runtime with 1.61 million examples

21 minutes for VFDT ; 24 hours for C4.521 minutes for VFDT ; 24 hours for C4.5 Still does not handle concept driftStill does not handle concept drift

DWMBIDWMBI 3333

Page 34: Data Warehousing Mining & BI Data Streams Mining DWMBI1

CVFDT (Concept-adapting VFDT)CVFDT (Concept-adapting VFDT) Concept Drift Concept Drift

Time-changing data streamsTime-changing data streams Incorporate new and eliminate oldIncorporate new and eliminate old

CVFDTCVFDT Increments count with new exampleIncrements count with new example Decrement old exampleDecrement old example

Sliding windowSliding windowNodes assigned monotonically increasing IDsNodes assigned monotonically increasing IDs

Grows alternate subtreesGrows alternate subtrees When alternate more accurate => replace oldWhen alternate more accurate => replace old O(w) better runtime than VFDT-windowO(w) better runtime than VFDT-window

DWMBIDWMBI 3434

Page 35: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Ensemble of Classifiers AlgorithmEnsemble of Classifiers Algorithm Method (derived from the ensemble idea in Method (derived from the ensemble idea in

classification)classification) train K classifiers from K chunkstrain K classifiers from K chunks

for each subsequent chunkfor each subsequent chunk

train a new classifiertrain a new classifier

test other classifiers against the chunktest other classifiers against the chunk

assign weight to each classifierassign weight to each classifier

select top K classifiersselect top K classifiers

DWMBIDWMBI 3535

Page 36: Data Warehousing Mining & BI Data Streams Mining DWMBI1

DWMBIDWMBI 3636

Clustering Data Streams Base on the k-median method

Data stream points from metric space Find k clusters in the stream s.t. the sum of distances

from data points to their closest center is minimized Constant factor approximation algorithm

In small space, a simple two step algorithm:

1. For each set of M records, Si, find O(k) centers in S1, …, Sl

Local clustering: Assign each point in Si to its closest center

2. Let S’ be centers for S1, …, Sl with each center weighted by number of points assigned to it

Cluster S’ to find k centers

Page 37: Data Warehousing Mining & BI Data Streams Mining DWMBI1

DWMBIDWMBI 3737

Hierarchical Clustering Tree

data points

level-i medians

level-(i+1) medians

Page 38: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Hierarchical Tree and DrawbacksHierarchical Tree and Drawbacks Method:Method:

maintain at most m level-i mediansmaintain at most m level-i medians On seeing m of them, generate O(k) level-(i+1) On seeing m of them, generate O(k) level-(i+1)

medians of weight equal to the sum of the weights of medians of weight equal to the sum of the weights of the intermediate medians assigned to themthe intermediate medians assigned to them

Drawbacks:Drawbacks:

Low quality for evolving data streams (register only k Low quality for evolving data streams (register only k centers)centers)

Limited functionality in discovering and exploring Limited functionality in discovering and exploring clusters over different portions of the stream over timeclusters over different portions of the stream over time

DWMBIDWMBI 3838

Page 39: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Clustering for Mining Stream DynamicsClustering for Mining Stream Dynamics

Network intrusion detection: one example Network intrusion detection: one example

Detect bursts of activities or abrupt changes in real timeDetect bursts of activities or abrupt changes in real time——by on-by on-

line clustering line clustering

Our methodology Our methodology

Tilted time frame work: o.w. dynamic changes cannot be foundTilted time frame work: o.w. dynamic changes cannot be found

Micro-clustering: better quality than k-means/k-median Micro-clustering: better quality than k-means/k-median

incremental, online processing and maintenance)incremental, online processing and maintenance)

Two stages: micro-clustering and macro-clusteringTwo stages: micro-clustering and macro-clustering

With limited “overhead” to achieve high efficiency, scalability, With limited “overhead” to achieve high efficiency, scalability,

quality of results and power of evolution/change detectionquality of results and power of evolution/change detection

DWMBIDWMBI 3939

Page 40: Data Warehousing Mining & BI Data Streams Mining DWMBI1

DWMBIDWMBI 4040

CluStream: A Framework for Clustering Evolving Data Streams

Design goal High quality for clustering evolving data streams

with greater functionality While keep the stream mining requirement in

mind One-pass over the original stream data Limited space usage and high efficiency

CluStream: A framework for clustering evolving data streams Divide the clustering process into online and

offline components Online component: periodically stores

summary statistics about the stream data Offline component: answers various user

questions based on the stored summary statistics

Page 41: Data Warehousing Mining & BI Data Streams Mining DWMBI1

DWMBIDWMBI 4141

CluStream: Clustering On-line Streams

Online micro-cluster maintenance Initial creation of q micro-clusters

q is usually significantly larger than the number of natural clusters

Online incremental update of micro-clusters If new point is within max-boundary, insert into the

micro-cluster O.w., create a new cluster May delete obsolete micro-cluster or merge two

closest ones

Query-based macro-clustering Based on a user-specified time-horizon h and the

number of macro-clusters K, compute macroclusters using the k-means algorithm

Page 42: Data Warehousing Mining & BI Data Streams Mining DWMBI1

Stream Data Mining: Research IssuesStream Data Mining: Research Issues

Mining sequential patterns in data streamsMining sequential patterns in data streams

Mining partial periodicity in data streamsMining partial periodicity in data streams

Mining notable gradients in data streamsMining notable gradients in data streams

Mining outliers and unusual patterns in data streamsMining outliers and unusual patterns in data streams

Stream clusteringStream clustering

Multi-dimensional clustering analysis?Multi-dimensional clustering analysis?

Cluster not confined to 2-D metric space, how to incorporate Cluster not confined to 2-D metric space, how to incorporate

other features, especially non-numerical propertiesother features, especially non-numerical properties

Stream clustering with other clustering approaches?Stream clustering with other clustering approaches?

Constraint-based cluster analysis with data streams?Constraint-based cluster analysis with data streams?

DWMBIDWMBI 4242