computational geometry and spatial data mining
DESCRIPTION
Marc van Kreveld ( and Giri Narasimhan ) Department of Information and Computing Sciences Utrecht University. Computational Geometry and Spatial Data Mining. Clustering?. Are the people clustered in this room? How do we define a cluster? - PowerPoint PPT PresentationTRANSCRIPT
Marc van Kreveld (and Giri Narasimhan)Department of Information and Computing SciencesUtrecht University
Are the people clustered in this room? How do we define a cluster?
In spatial data mining we have objects/ entities with a location given by coordinates
Cluster definitions involve distance between locations How do we define distance?
Determine whether clustering occursDetermine the degree of clusteringDetermine the clustersDetermine the largest clusterDetermine the largest empty region
Determine the outliers
Are the men clustered?Are the women clustered?
Is there a co-location of men and women?
Determine regions favored exclusively by women. Men? Loners? Couples? Families?
Determine empty regions.
Like before, we may be interested in is there co-location? the degree of co-location the largest co-location the co-locations themselves the objects not involved in co-location Regions with no (or little) co-location
Locations have a time stamp Interesting patterns involve space
and timeAnomalies?
Entities with a trajectory (time-stamped motion path)
Interesting patterns involve subgroupswith similar heading, expected arrival,joint motion, ...
n entities = trajectories; n = 10 – 100,000 t time steps; t = 10 – 100,000
input size is nt m size subgroup (unknown); m = 10 – 100,000
Tracked animals (buffalo, birds, ...)Tracked people (potential terrorists)Tracked GSMs (e.g. for traffic
purposes)Trajectories of tornadoesSports scene analysis (players on a
soccer field)
What is the location visited by most entities?
location = circular region of specified radius
What is the location visited by most entities?
location = circular region of specified radius
4 entities
What is the location visited by most entities?
location = circular region of specified radius
3 entities
Compute buffer of each trajectory
Compute buffer of each trajectory
0
1
2
1
11
• Compute the arrangement of the buffers and the cover count of each cell
1
One trajectory has t time stamps; its buffer can be computed in O(t log t) time
All buffers can be computed in O(nt log t) time
The arrangement can be computed in O(nt log (nt) + k) time, where k = O( (nt)2 ) is the complexity of the arrangement
Cell cover counts are determined in O(k) time
Total: O(nt log (nt) + k) time If the most visited location is visited by
m entities, this is O(nt log (nt) + ntm)
Note: input size is nt ;n entities, each with location at t moments
Spatial data n points (locations) Distance is
important clustering pattern
Presence of attributes (e.g. man/woman): co-location patterns
Spatio-temporal data
n trajectories, each has t time steps
Distance is time-dependent flock pattern meet pattern
Heading and speed are important and are also time-dependent
Also co-location patternDiscovered simply by overlay
E.g., occurrences of oakson different soil types
What if it is known that the entities only occur in regions of a certain type?
bird nestsradius of cluster
Situation without subdivision
What if it is known that the entities only occur in regions of a certain type?
bird nests
Situation with subdivisionland-water
radius of cluster
burglary
housecar
Determine clusters in point sets that are sensitive to the geographic context (at least, for the relevant aspects)
Assume that a set of regions is given where points can only be, how should we define clusters?
Joint research with Joachim Gudmundsson (NICTA, Sydney) and Giri Narasimhan (U of F, Miami), 2006
Given a set P of points, a set F of regions, a radius r and a subset size m, aregion-restricted cluster is a subset P’ P inside a circle C where P’ has size at least m C has radius at most 2r C contains at most r2 area of regions of F
≤ 2r sum area ≤ r2
r
Given a set P of n points, a set F of polygons with nf edges in total, and values for r and m, report all region-restricted clusters of exactly m points
Exactly m points?“Real” clustering (partition)?Outliers?
Exactly m points?Every cluster with >m points consists of clusters with m points with smaller circles
“Real” clustering (partition)?
Outliers?
m = 5
Exactly m points?Every cluster with >m points consists of clusters with m points with smaller circles
“Real” clustering (partition)?
Outliers?
m = 5
1. Determine all smallest circles with m points of P inside
2. Test if the radius is ≤ r (report) or > 2r (discard)
3. If the radius is in between, determine the area of regions of F inside
1. Determine all minimal circles with m points of P inside
2. Determine all minimal circles with 3 points of P inside
ordinary =order-1 VD
1. Determine all smallest circles with m points of P inside
• Use (m-2)-th order Voronoi diagram: cells where the same (m-2) points are closest
• Its vertices are centers of smallest circles around exactly m points
ordinary =order-1 VD
order-2 VD
order-3 VD
The m-th order Voronoi diagram (or (m-2)) has O(nm) cells, edges, and vertices
It can be constructed in O(nm log n) time
we get O(nm) smallest circles with m points inside; for each we also know the radius
2. Test if the radius is ≤ r (report) or > 2r (discard)
Trivial in O(1) time per circle, so in O(nm) time overall
3. Determine the area of regions of F inside
Brute force: O(nf) time per circle, so in O(nmnf) time overall
Complication: This need not give all region-restricted clusters! Need to compute area of F inside a circle
with moving center Requires solving high-degree polynomials
The anti-climax: we cannot give an exact algorithm!
If we takes squares instead of circles, we can deal with the problem ....
3. Determine the area of regions of F insideBrute force: O(nf) time per square, so in
O(nmnf) time overall
The total time for steps 1, 2, and 3 isO(nm log n) + O(nm) + O(nmnf) =
O(nm log n + nmnf) time
3. Determine the area of regions of F insideUsing a suitable data structure (only
possible for squares): O(log2 nf) time per square, so in O(nm log2 nf) time overall
The total time becomesO(nm log n + nf log2 nf + nm log2 nf)
order- (m-2)VD construction
preprocessingof data structure
total query timein data structure
The squares solution generalizes toregular polygons (e.g. 20-gons)
An approximation of the radius within (1+)r gives a O(n/2 + nf log2 nf + n log nf /(m 2)) time algorithm
16-gon
Open problems: Develop a region-restricted version of k-
means clustering, single link clustering, ... Region-restricted co-location? Replace region-restricted by gradual model
0 /unit 2 /unit 5 /unit 8 /unit
typical: clusters:
n trajectories, each with t time steps n polygonal lines with t vertices
Already looked at most visited location
Patterns in trajectories• Flock: near positions of (sub)trajectories for some
subset of the entities during some time• Convergence: same destination region for some
subset of the entities• Encounter: same destination region with same arrival
time for some subset of the entities• Similarity of trajectories• Same direction of movement, leadership, ......
flock convergence
Patterns in trajectories• Flocking, convergence, encounter patterns
– Laube, van Kreveld, Imfeld (SDH 2004)– Gudmundsson, van Kreveld, Speckmann (ACM GIS 2004)– Benkert, Gudmundsson, Huebner, Wolle (ESA 2006)– ...
• Similarity of trajectories– Vlachos, Kollios, Gunopulos (ICDE 2002)– Shim, Chang (WAIM 2003)– ...
• Lifelines, motion mining, modeling motion– Mountain, Raper (GeoComputation 2001)– Kollios, Scaroff, Betke (DM&KD 2001)– Frank (GISDATA 8, 2001)– ...
Patterns in trajectories• Flock: near positions of (sub)trajectories for some
subset of the entities during some time– clustering-type pattern– different definitions are used
• Given: radius r, subset size m, and duration T,a flock is a subset of size m that is inside a (moving) circle of radius r for a duration T
Patterns in trajectories• Longest flock: given a radius r and subset size m,
determine the longest time interval for which m entities were within each other’s proximity (circle radius r)
Time = 0 1 65432 7 8
longest flock in [ 1.8 , 6.4 ]
m = 3
Patterns in trajectories• Meet: near some position of (sub)trajectories for some
subset of the entities– clustering-type pattern
• Given: radius r, subset size m, and duration T,a meet is a subset of size m that is inside a (stationary) circle of radius r for a duration T
this was “moving” for flock
Patterns in trajectories• The same subset required for a flock or meet?
Example: meet with m = 4; duration is 3+ time steps or 4+ time steps?
Patterns in trajectories
flock
meet
fixed subset variable subset
examples for m = 3
Patterns in trajectories
Exact results ( input size is n )
NP-hard O(n3 log n)
O(n4 2 log n + n2 3)
fixed subset variable subset
flock
meet O(n4 2 log n + n2 3)
Patterns in trajectories• A radius-2 approximation of the longest flock can be
computed in time O(n2 log n)
... meaning: if the longest flock of size m for radius rhas duration T, then we surely find a flock of size m and duration T for radius 2r
longest flock for r at least as long a flock for 2r
Patterns in trajectoriesApproximate radius results ( input size is n )
flock
meet
fixed subset variable subset
O(n2 log n) O((n2
log n) / 2)
O((n2 log n) / (m2))O((n2
log n) / (m2))
factor 2 factor 2+
factor 1+ factor 1+
NP-hard O(n3 log n)
O(n4 2 log n + n2 3) O(n4 2 log n + n2 3)
v3
Fixed subset flock• It is NP-complete to decide if a graph has a subgraph
with m nodes that is a clique
v1 v2 v3 v4 v5 v6 v7
For every node of the graph,make an entity with a trajectory
all nodes notadjacent to v1 go here
v1
v2 v4
v5v6
v7
v1 is not adjacent tov4, v5, and v7
r
v3
Fixed subset flock
v1 v2 v3 v4 v5 v6 v7
v1
v2 v4
v5v6
v7
v4 not in flock
v4 in flock
v3
Fixed subset flock
v1 v2 v3 v4 v5 v6 v7
v1
v2 v4
v5v6
v7
The trajectories have a fixed flock of size m and full duration if and only if the graph has a clique of size m
flock {v4,v5,v7} of (full) duration 23 (3·7+2) and size 3
Fixed subset flock• Longest fixed flock is NP-hard• Max clique has no approximation
cannot approximate duration, nor flock size• The reduction applies for all radii < 2r
v1 v2 v3 v4 v5 v6 v7
v4 not in flock
v4 in flock
Flock and meet algorithms• Go into 3D (space-time) for algorithms
time
0
1
2
4
3
flock meet
Fixed subset flock, approximation• An efficient radius-2 approximation
algorithm of longest fixed flock exists• Idea: if some vi is in the longest flock,
then all other entities are within distance 2r from vi
radius 2r, centered at vi
vi
flock with vi
2r
Fixed subset flock, approximation• For each vj, we can determine the
O() time intervals where vj is in the column of vi
• Maintain the intersections for all entities in an augmented tree inO(n log n) time
• Do this for all columns (role of vi)and report longest overall pattern
Total: O(n2 log n) time
Variable subset flock, exact• The subset that forms the flock may
change entities, but must stay of size m
• Any flock subset at any instant has a disk D of radius r with at least 2 entities on the boundary defining entities
r
defining entities
Variable subset flock, exact• Two entities define two cylinders
through time by tracing the two possible radius r disks
Variable subset flock, exact• Two entities define two cylinders
through time by tracing the two possible radius r disks
Variable subset flock, exact• Two entities define two cylinders
through time by tracing the two possible radius r disks
Variable subset flock, exact• Two entities define two cylinders
through time by tracing the two possible radius r disks
Variable subset flock, exact• Two entities define two cylinders
through time by tracing the two possible radius r disks
Variable subset flock, exact• Two entities define two cylinders
through time by tracing the two possible radius r disks
Variable subset flock, exact• Two entities define two cylinders
through time by tracing the two possible radius r disks
Variable subset flock, exact• Two entities define two cylinders
through time by tracing the two possible radius r disks
Variable subset flock, exact• Two entities define two cylinders
through time by tracing the two possible radius r disks
Variable subset flock, exact• Two entities define two cylinders
through time by tracing the two possible radius r disks
Variable subset flock, exact• Two entities define two cylinders
through time by tracing the two possible radius r disks
Variable subset flock, exact• A critical moment is where another
entity is on the boundary of the disk; it may go outside or inside
Variable subset flock, exact• At a critical moment:
– a variable subset flock may start (m entities)– a variable subset flock may stop (<m
entities)– Three pairs of defining entities have disks
that coincide
• There are also critical moments when two entities are at distance exactly 2r
• Between two time steps ti and ti+1 there are O(n3) critical moments in total there are O(n3 ) critical moments
2r
Variable subset flock, exact• Let the O(n3 ) critical moments be the nodes in
a directed acyclic graph G• Edges of G are between two consecutive critical
moments of the same two defining entities– directed from earlier to later– weight is time between critical moments– only if at least m entities are inside the disk
time A longest variable subset flock is a maximum weight path in G
Variable subset flock, exact• The graph G can be built in O(n3 log n) time• A maximum weight path can be found in
O(n3 log n) time
time
A longest variable subset flock is a maximum weight path in G
Patterns in trajectories, summary• Flock and meet patterns require algorithms in 3-
dimensional space (space-time)• Exact algorithms are inefficient only suitable for
smaller data sets• Approximation can reduce running time with one or
two orders of magnitude
Patterns in trajectories, summary
flock
meet
fixed subset variable subset
O(n2 log n) O((n2
log n) / 2)
O((n2 log n) / (m2))O((n2
log n) / (m2))
factor 2 factor 2+
factor 1+ factor 1+
NP-hard O(n3 log n)
apx
exact
apx
exact O(n4 2 log n + n2 3) O(n4 2 log n + n2 3)
Future research on longest trajectories
• Faster exact and approximation algorithms• Better approximation factors• Remove restriction of fixed shape of flocking region
(compact or elongated both possible during same flock)• Longest duration convergence
longest convergence
Flock and meet patterns require algorithms in 3-dimensional space (space-time)
Exact algorithms are inefficient only suitable for smaller data sets
Approximation can reduce running time with an order of magnitude
With an exact definition of a spatial or spatio-temporal pattern, geometric algorithms can be used to compute all patterns
Many known structures from computational geometry are useful (Voronoi diagrams, arrangements, ...)
Since the (exact) algorithms may be inefficient, approximation may be a solution
What patterns must be detected in practice (both spatial and spatio-temporal)?
What is the most appropriate definition (formalization) of these?
Spatial association rules, auto-correlation, irregularities, classification, ... and other computable things in spatial/spatio-temporal data mining