a unified approach to spatial outliers detection chang-tien lu spatial database lab department of...

52
A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota [email protected]. edu http://www.cs.umn.edu/research/shashi-group

Post on 21-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

A Unified Approach to Spatial Outliers Detection

Chang-Tien Lu

Spatial Database Lab Department of Computer Science

University of Minnesota

[email protected]

http://www.cs.umn.edu/research/shashi-group

Page 2: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Outline

Introduction Motivation General Definition of Spatial Outlier Related Work

Proposed Approach and Algorithm Evaluation of Proposed Approach Conclusions

Page 3: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Spatial Data Mining

Spatial Databases are too large to analyze manually NASA Earth Observation System (EOS) National Institute of Justice – Crime mapping Census Bureau, Dept. of Commerce - Census Data

Spatial Data Mining Discover frequent and interesting spatial patterns for

post processing (knowledge discovery) Pattern examples: outliers, crime hot spots, land-use

classification

Historical Example London, 1854

Cholera & water pump

Page 4: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Spatial Outlier

Definition: A data point that is extreme relative to its neighbors

Page 5: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Application Domain: Traffic Data

Page 6: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

An Example of Spatial Outlier

Spatial outlier: S, global outlier: G, L

Page 7: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Spatial Outlier Detection: Z s(x) approach

s

sxs

xSZ

)()(

))((

1)()( )( yf

kxfxS xNy

Function:

If

Declare x as a spatial outlier

Page 8: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Evaluation of Statistical Assumption

Distribution of traffic station attribute f(x) looks normal Distribution of looks normal too!

))((

1)()( )( yf

kxfxS xNy

Page 9: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Outlier Detection Tests

Page 10: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Outlier Detection Tests Related work

1-dimension: ignores geographic location Homogeneous multi-dimension: mixes location

with attributes Spatial outlier

2 classes of dimensions : location, attribute Neighborhood : based on location dimension Difference : compares attribute dimension

Comparison of outlier detection methods

Page 11: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Issues

Numerous tests Each has custom algorithm Add complexity to implement Spatial

Database Management System Desirable

Unified test A general algorithm to perform different

tests High performance

Page 12: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Our Contribution

A general definition of spatial outlier S - outlier

Show that existing definitions are special cases of

s -outlier Design efficient algorithms Analyze the computation structures Develop I/O cost models Evaluate alternate disk page clustering methods

Page 13: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

General Definition of Spatial Outlier

Given A spatial framework SF consisting of locations s1, s2, …,

sn

An attribute function f : si R

(R : set of real numbers) A neighborhood relationship N SF SF A neighborhood aggregation function : RN R A difference function Fdiff : R R R Statistic test function ST : R { True, False }

Test is based on Fdiff (f, (f, N) )

General definition: S -outlier An object O SF is a S -outlier (f, , Fdiff , ST )

if ST = = TRUE

Naggrf

Naggrf

Naggrf

Page 14: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Related Work- Spatial Outlier Tests

Different spatial outlier tests Spatial statistic approach Scatter plot approach (Luc Anselin ’94) Moran Scatter plot approach (Luc Anselin

’95) Variogram cloud approach (Graphic)

All these are special cases of S -outlier Show this for one case: scatter plot

Page 15: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Scatter Plot Approach

Lemma Scatter plot is a special case of S -outlier

Given An attribute function f(x) A neighborhood relationship

N(x) An aggregation function

A difference function Fdiff : є = E(x) – (m f(x) + b)

Detect spatial outlier by Statistic test function ST :

)(1

)(: )( yfk

xEf xNyN

aggr

f(x)

E(x

)

Page 16: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Outline

Introduction Proposed Approach and Algorithm

Problem formulation Our approach Efficient algorithm Cost model

Evaluation of Proposed Approach Conclusions

Page 17: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Problem Formulation

General definition: S -outlier An object O is a S -outlier (f, , Fdiff, ST ) if ST == TRUE

Design An efficient algorithm to detect S -outlier i,e., O= {si si SF , si is a spatial outlier}

Objective Efficiency: to minimize the computation time

Constraints Fdiff and ST are algebraic aggregate functions of

values of f and The size of the data set >> the main memory size Computation time is determined by I/O time

Naggrf

Naggrf

Page 18: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Aggregate Function

Distributive aggregate function F Global F value can computed by applying the G function to the

value of F in each partition of the data set, F = G for most cases

Algebraic aggregate function F Global F value can be computed using a fixed number of sub-

aggregates from each partition of the data set

Page 19: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Our Approach

Separate two phases Model building Testing (a node or a set of nodes)

Computation structure of model building Key insights:

Spatial self join using N(x) relationship Algebraic aggregate functions can be computed

in one disk scan of spatial join Computation structure of testing

Single node: spatial range query Get-All-Neighbors(x) operation

Page 20: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

An Example of Our Approach : Scatter plot

Model building An attribute function f(x) Neighborhood aggregate function Distributive aggregate functions

Algebraic aggregate functions

where ,

Testing Difference function

where

Statistic test function

)(1

)( )( yfk

xE xNy

)(),(),()(),(),( 22 xExfxExfxExf

22 ))(()(

)()()()(

xfxfN

xExfxExfNm

22

2

))(()(

))()(()()()(

xfxfN

xExfxfxExfNb

)2()( 2

nSmS xxyy n

xfxfSxx

2

2 ))(()(

n

xExESyy

2

2 ))(()(

))(()( bxfmxE )(1

)( )( yfk

xE xNy

Page 21: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Model Building Algorithm

Steps For each node x

Retrieve data record of x : (f(x), list of neighbors(x)) Get-All-Neighbors(x): Retrieve data records of neighbor(x)

If neighbor y is not in the memory buffer, request another I/O operation

Compute neighborhood aggregate function Accumulate distributive aggregate function: , ,..,

Compute algebraic aggregate function: , ,.. ,

I/O cost Dominant operation: Get-All-Neighbors(x) I/O cost of Get-All-Neighbors(x) is determined by

the clustering efficiency

Naggrf

1Daggrf

1Aaggrf 2A

aggrf Anaggrf

2Daggrf Dk

aggrf

Page 22: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Attribute Table

Node x

Attribute: volume f(x)

List of Neighbors N(x)

V1 125 V3, V5, V10

V2 130 V4, V6

V3 140 V1, V7

…. …. ….

V10 120 V1, V7, V12

Page 23: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Computation structure

Lemma: In model building: algebraic aggregate function

can be computed in one disk scan of the spatial self-join

Proof: A fixed k number of distributive aggregate

functions , ,.., are used to store the aggregate

values Algebraic aggregate functions are then computed

using these aggregate values In all cases, one needs a very small number of

distributive aggregate functions (k 10)

1Gaggrf 2D

aggrf Dkaggrf

Page 24: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Testing Algorithm

Steps For each node x

Retrieve data record of x : (f(x), list of neighbors(x))

Get-All-Neighbors(x): Retrieve data records of neighbor(x)

If neighbor y is not in the memory buffer Request another I/O operation

Compute difference function Fdiff

If test function ST == True Declare x as a spatial outlier

Page 25: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

I/O Cost Model

Definition CE: Clustering efficiency N: Total number of nodes Bfr: Blocking factor (number of nodes in a disk

page) K: Avg. number of neighbors for each node L: Number of nodes in a route

Cost model of A1

The cost to retrieve all nodes: The cost to retrieve neighbors of all nodes:

Cost model of A2

BfrN /

)1(/ CEKNBfrN

)1( CEKN

)1()1( CEKLCEL

Page 26: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Clustering Efficiency

CE definition: Probability [vi and a neighbor of vi are

stored in the same disk page] (Total number of unsplit edges)/(Total

number of edges)

Computation cost (I/O cost) is determined by Clustering Efficiency (CE)

Page 27: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Clustering Efficiency

An example

CE depends on Disk page size Node record size, edge distribution over nodes Clustering

method

66.096

939

CE

Page 28: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

I/O Cost Model

Definition CE : Clustering efficiency N : Total number of nodes Bfr : Blocking factor (number of nodes in a disk

page) K : Avg. number of neighbors for each node L : Number of nodes in a route

Cost model of Model Building

The cost to retrieve all nodes: The cost to retrieve neighbors of all nodes:

Cost model of Testing

BfrN /

)1(/ CEKNBfrN

)1( CEKN

)1()1( CEKLCEL

Page 29: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Outline

Introduction Proposed Approach and Algorithm Evaluation of Proposed Approach

Candidates (Clustering Methods) Experiment Design Results

conclusions

Page 30: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Experimental Evaluation (Summary)

Hypothesis: I/O cost of the algorithm is determined by the

clustering efficiency Physical Data Page Clustering Method

CCAM Cell Tree Z-order

Benchmark data Minneapolis- St. Paul traffic data (loop-detector)

Benchmark tasks Model building Testing

Metrics: Clustering efficiency(CE), I/O cost

Page 31: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Clustering Method: CCAM

Connectivity Clustered Access Method Cluster the nodes via min-cut graph partitioning Use B+ tree with Z-order as the secondary index

Page 32: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Clustering Method: CCAM

Page 33: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Clustering Methods: Cell Tree

Binary Space Partitioning (BSP) Decompose universe into disjoint convex subspaces Cannot exploit edge information, pure geometric

Page 34: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Clustering Method: Cell Tree

Page 35: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Clustering Method: Z-order

Impose a total order on the nodes Z-order = interleave (bits of X, bits of Y)

Page 36: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Clustering Method: Z-order

Page 37: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Experiment Design

Questions/Hypotheses What is the ranking of candidate

clustering methods? Is CE a predictor of relative performance

of clustering methods? Does cost model predict observed

ranking? What are the effects of parameters:

Disk page size Memory buffer size

Page 38: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Experiment Design

Page 39: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Model Building: Effect of Page Size

Fixed parameters: buffer size Variable parameters: page size, clustering strategy

CCAM has the best performance, the highest CE value High CE => Low I/O cost

Cost model: (N/Bfr)+N*K*(1-CE) Increase page size => reduce number of page accesses

Trends:

Configuration:

Page 40: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Model Building: Effect of Buffer Size

Fixed parameters Page size: 2 Kbytes Clustering Efficiency:

CCAM=0.81, Cell=0.69, Z-ord=0.51

Variable parameters Number of buffers Clustering strategies

Trends: Increase buffer size

=> reduce number of page accesses

CCAM has the best performance

Page 41: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Testing: Effect of Page Size

Fixed parameters: buffer size Variable parameters: page size, clustering strategy

CCAM has the best performance, the highest CE value High CE => Low I/O cost

Cost model: L*(1-CE)+L*K*(1-CE) Increase page size

Reduce number of page access, performance gap reduces

Configuration:

Trends:

Page 42: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Summary of Experimental Results

Model building and testing CCAM achieves higher clustering efficiency CCAM has lower I/O than Cell-Tree and Z-

order Higher CE leads to lower I/O cost

CE is a good predictor of relative I/O performance

Page size improves clustering efficiency of all methods Reduces performance gap between

methods

Page 43: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Outlier Station Detected

Page 44: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Minneapolis–St. Paul Traffic Data

Page 45: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Outlier Station Detected

Page 46: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Conclusion

A general definition of spatial outlier S -outlier models existing spatial outlier

definitions Scatter plot, Moran Scatterplot, Spatial Statistic, ..

Efficient algorithm to detect spatial outlier Model building & Testing

Recognize the computation structure of algorithms Algebraic aggregate functions on self join Get-All-Neighbor(x) dominates I/O cost

Develop algebraic cost models Evaluate alternate disk page clustering methods

CCAM, Cell-Tree, Z-order

Page 47: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Future Directions

Extend Spatial Outlier Detection Test Multi-attributes

Combination of traffic volume, speed, occupancy Location attribute includes time Temporal and Spatial-Temporal Outliers

Explore other spatial patterns beyond spatial outlier Land-use classification Co-locations

Fire ignition source feature Needle vegetation type feature Drought feature

Iterative Spatial Outlier Detection

Page 48: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Application Domain: Traffic Volume Matrix

Page 49: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Spatial Outlier Detection

Page 50: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Iterative Spatial Outlier Detection

Page 51: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Related Publications

Detecting Graph-based Spatial Outliers: Algorithms and Applications, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, September, 2001

Detecting Graph-based Spatial Outliers, the International Journal of Intelligent Data Analysis (IDA), Vol. 6, No 3. 2002

A Unified Approach to Spatial Outliers Detection, IEEE Transactions on Knowledge and Data Engineering. (under review)

Page 52: A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota ctlu@cs.umn.edu

Thank you !!!Thank you !!!

http://www.cs.umn.edu/~ctlu