geographic data mining marc van kreveld seminar for give block 1, 2003/2004
TRANSCRIPT
About …
• A form of geographical analysis• Current topic of interest in GIS research
(and database research and AI research)
• Finding hidden information in large collections of geographic data
This seminar
• Learning about a topic together• Presenting to each other + interaction• Added value by good examples:
– for important concepts, algorithms– possibly self-thought of, or extended– referring to GIS data and issues (hence the
GIS course prerequisite)
• Written assignment: joint survey
Material
• Book by Harvey Miller and Jiawei Han (editors): selected chapters
• Possibly: papers from conference proceedings
• Mostly provided by me
Weeks
• Week 36-46• Probably:
– Not September 4 (this Thursday)– Not in week 40 (Sept. 29 & Oct. 2)– Not October 23
• The above depending on participation!
Overview of Geographic Data Mining & Knowledge
Discovery
Chapter 1 of the book
• KDD: knowledge discovery in databases• Data warehouses• Data mining• Geographic aspects of the above
Knowledge Discovery in Databases (KDD)
• Large databases contain interesting patterns: non-random properties and relationships that are:– valid (general enough to apply to new data)– novel (non-trivial and unexpected)– useful (leads to effective action: decision
making or investigation)– ultimately understandable (simple, and
interpretable by humans)
Knowledge Discovery in Databases (KDD)
• Because of quantity of data nowadays• Because we want information, not data• Because computing power allows it
KDD opposed to statistics
• Statistics– small and clean numeric database– scientifically sampled– specific questions in mind
• KDD: none of the above
KDD techniques
• Statistics• Machine learning• Pattern recognition• Numeric search (?)• Scientific visualization
Data warehouse
• Large repository of data• For analytical processing
(DB: transactional processing)• Heterogeneous: different sources and
formats (DB: homogeneous)
• Supports OLAP tools (OnLine Analytical Processing)
OLAP Example
• Measure of interest: sales• Dimensions of interest: item, store, week
• (item, store, week) money
[quantity sold times price ]
OLAP Example
• 2-dim. aggregation:(item, store, . ) money
• Another 2-dim. aggregation: sales by store and by week
• 1-dim. aggregation: sales by week (all items and stores)
• Data cube: all 2d possible aggregations, different types of summaries
KDD steps
• Data selection• Data pre-processing• Data enrichment• Data reduction and projection• Data mining• Interpretation and reporting
Presence of steps and order not fixed
KDD steps
• Data selection• Data pre-processing: removing noise,
duplicate records, handling missing data, …
KDD steps
• Data selection• Data pre-processing• Data enrichment: combining the
selected data with external data
KDD steps
• Data selection• Data pre-processing• Data enrichment• Data reduction and projection:
reduction in number, reducing dimension
KDD steps
• Data selection• Data pre-processing• Data enrichment• Data reduction and projection• Data mining: uncovering information,
interesting patterns
KDD steps
• Data selection• Data pre-processing• Data enrichment• Data reduction and projection• Data mining• Interpretation and reporting: evaluating,
understanding, communicating
Data mining
• Segmentation• Dependency analysis• Deviation and outlier analysis• Trend detection• Generalization and characterization
DM - segmentation
Description:• Clustering: finding a
finite set of implicit classes
• Classification: mapping data items into pre-defined classes
Techniques:• Cluster analysis• Bayesian
classification• Decision or
classification trees• Artificial neural
networks
DM – dependency analysis
Description:• Finding rules to
predict the value of some attribute based on other attributes
Techniques:• Bayesian networks• Association rules
(4, 12, 0.24)(3, 14, 0.21)(7, 13, 0.43)(2, 9, 0.78)(11, 11, 0.55)
(5, 11, ???)
(???, 12, 0.51)
DM – dependency analysis
• Confidence and support measures for association rules of the form:[ if X then Y ]:
confidence = #(X and Y in DB) / #(X in DB)support = #(X and Y in DB) / #(all in DB)
DM – deviation & outlier analysis
Description:• Finding data with
unusual deviations (=errors, or data of particular interest)
Techniques:• Clustering, other
mining methods• Outlier analysis
DM – trend detection
Description:• Finding lines, curves,
summarizing the database (often as a function over time)
Techniques:• Regression• Sequential pattern
extraction
DM – generalization and characterization
Description:• Obtaining compact
descriptions of the data
Techniques:• Summary rules• Attribute-oriented
induction
concept hierarchy low level concept
higher level concept
Visualization and knowledge discovery
• KDD is difficult to automate steered by human intelligence
• Visualization helps to understand the data and which data mining techniques to try
KD + geography
• Special case of KDD• Other special cases
– marketing– biology– astronomy
• Main features: location, distance, dimen-sionality (with dependent dimensions)
KD + geography
(attr1, attr2, attr3, attr4); attr’s are numbers and (relatively) independent: statistics
(attr1, attr2, attr3, attr4); attr’s can also be on other measurement scales: KDD
(attr1, attr2, attr3, attr4); attr’s are often dependent and can be shapes: KD + geography
Often: (lat., long., attr1, attr2, …)or: (shape description, attr1, attr2, …)
KD + geography
• Study of scalable versions of DM tasks (in lat. and long.)
• Certain dimensions can be non-metric (travel time need not be symmetric)
• DM in data that is not in the form of tuples: sets of thematic map layers
Geographic data mining
• Spatial segmentation (clustering, classification)
• Spatial dependency (spatial association rules)
• Spatial trend detection• Geographic characterization and
generalization
GDM – spatial association rules
• Example: If a location is within 500 m from water and the average winter temperature is at least –2 degreesthen there are frogs around
distance relationship
GDM – spatial trend detection
• Patterns of change with respect to neighborhood of some object
• Example: (North America) Further from Pacific ocean fewer earthquakes
GDM - applications
• Map interpretation• Remote sensing interpretation• Environmental mapping (soil type, etc.)• Extracting spatio-temporal patterns for
cyclones, crimes• Spatial interaction (movement/flow of
people, capital, goods)
Conclusions
• GDM & GKD is an extension of (tool for) geographical analysis
• GDM is different from DM due to– Geographic spaces, not attribute space– Neighborhood is extremely important– Scale issues– Data is different– Applications (interesting patterns to mine
for) are different
This seminar on GDM
• First: chapters from the book– CH 1: GDM & KD: an overview (today)– CH 2: Paradigms for spatial and spatio-temporal DM(11-9)– CH 3: Fundamentals of spatial DW for GKD (15-9)– CH 7: Algorithms and applications of SDM (Ronny)(18-9)– CH 8: Spatial clustering in DM (22-9)– CH 6: Modeling spatial dependencies (25-9)
(not: 29-9 and 2-10)
– CH 9: Detecting outliers (6-10)– CH 10: Knowledge construction based on GVis and KDD– CH 14: Mining mobile trajectories
This seminar
• All PowerPoint presentations on the Web page of the course
• Survey paper or written exam; possible topics for survey:– Hierarchical clustering– Clustering with obstacles– Proximity relationship mining– …
• Or: joint survey of (geometric) algorithms for GDM