real time geodemographics

21
Real time Geodemographics: Requirements and Challenges Muhammad Adnan, Paul Longley

Upload: muhammad-adnan

Post on 12-Nov-2014

1.502 views

Category:

Education


0 download

DESCRIPTION

This presentation is a comparison of different clustering based on their computational time. This is the first step in creating open source and bespoke Geodemographic classifications in near real time.

TRANSCRIPT

Page 1: Real Time Geodemographics

Real time Geodemographics: Requirements and Challenges

Muhammad Adnan, Paul Longley

Page 2: Real Time Geodemographics

Current Geodemographic classifications

• Census data• E.g. OA (Output Area) dataset has 41 census variables.

• Variables are weighted according to their importance in classification.

• K-means clustering algorithm is used to cluster data into homogeneous groups.• Multiple runs of K-means due to its un-stability• 10,000 times (Singleton, 2008)

Page 3: Real Time Geodemographics

Need for real time Geodemographics

• Current classifications are created using static data sources.

• Rate and scale of current population change is making large surveys (census) increasingly redundant.• Significant hidden value in transactional data

• Data is increasingly available in near real time

e.g. ONS NESS API• Application specific (bespoke) classifications have

demonstrated utility (Longley & Singleton, 2009).

Page 4: Real Time Geodemographics

What are real time Geodemographics ?

Specification Estimation Testing

Page 5: Real Time Geodemographics

Computational challenges

• Integration of large and possibly disparate databases.• E.g. NHS data; Census data

• Data normalisation and optimization for fast transactions.

• Minimizing computational time of clustering algorithms (Very Important)!

• Common protocol• XML (SOAP)

• Use of non traditional data sources. (Singleton, 2008) • E.g. Flickr; Facebook

Page 6: Real Time Geodemographics

Important Challenge: Selection of clustering algorithm

• K-Means• PAM (Partitioning Around Medoids)• CLARA (Clustering Large Applications)• GA (Genetic Algorithm)

Page 7: Real Time Geodemographics

K-means

• Attempts to find out cluster centroids by minimising within sum of squares distance.

• K-means is unstable due to its initial seeds assignment.• Sensitive to outliers.

• Creating a Geodemographic classification requires running algorithm multiple times.• 10,000 times (Singleton, 2008)• Computationally expensive in a real time environment.

Page 8: Real Time Geodemographics

K-means (100 runs of k-means on OAC data set for k=4)

Page 9: Real Time Geodemographics

An example of bad clustering result (K-means)

Page 10: Real Time Geodemographics

An example of bad clustering result (K-means)

Page 11: Real Time Geodemographics

An example of bad clustering result (K-means)

Page 12: Real Time Geodemographics

Alternate Clustering Algorithms

• PAM (Partitioning around medoids) tries to minimize the sum of distances of the objects to their cluster centers.• Less sensitive to outliers than K-means.• Cannot handle larger data sets.

• CLARA (Clustering Large Applications) draws multiple samples of the dataset, applies PAM to each sample and returns the best result.

• GA (Genetic Algorithm) is inspired by models of biological evolution. It produces results through a breeding procedure.

Page 13: Real Time Geodemographics

This paper compares

• K-means• Clara• GA

By using three data normalisation techniques• Z-Scores• Range Standardisation• Principle Component Analysis.

• Algorithm stability of K-means, Clara, and GA

Page 14: Real Time Geodemographics

Data normalisation techniques used

• Z-Scores• Widely used variable normalisation technique• Can create outliers in the datasets

• Range Standardisation• Standardise values between a range of 0-1• Can erase interesting patterns in the data

• Principle Component Analysis.• Reduces the dimensions of a data set• Can erase interesting patterns in the data

Page 15: Real Time Geodemographics

Comparing computational efficiency (Z-scores)

PAM, and GA on the three geographic aggregations of a dataset covering London.

Figure 1: OA (Output Area) level results

Figure 2 : LSOA (Lower Super Output Area) level results Figure 3: Ward level results

Page 16: Real Time Geodemographics

Comparing computational efficiency (Range Standardisation)

PAM, and GA on the three geographic aggregations of a dataset covering London.

Figure 4: OA (Output Area) level results

Figure 5 : LSOA (Lower Super Output Area) level results Figure 6: Ward level results

Page 17: Real Time Geodemographics

Comparing computational efficiency (PCA)

PAM, and GA on the three geographic aggregations of a dataset covering London.

Figure 7: OA (Output Area) level results

Figure 8 : LSOA (Lower Super Output Area) level results Figure 9: Ward level results

Page 18: Real Time Geodemographics

Algorithm Stability (w.r.t. Computational time)Figure 10: Running k-means on OA (Output Area) for 120 times on each iteration

Figure 11: Running CLARA on OA (Output Area) for 120 times on each iteration Figure 12: Running GA on OA (Output Area) for 120 times on each iteration

Page 19: Real Time Geodemographics

K-means and Principle Component Analysis

• PCA can be used to facilitate K-means clustering by reducing dimensions.

(Ding, C., He, X., 2004)

Figure 13: K-means result for 41 “OAC variables”Figure 14: K-means result for 26 “OAC Principle Components”

K=4 (99% similar)

Page 20: Real Time Geodemographics

K-means and Principle Component Analysis

• PCA can be used to facilitate K-means clustering by reducing dimensions.

(Ding, C., He, X., 2004)

Figure 13: K-means result for 4 1 “OAC variables” Figure 14: K-means result for 26 “OAC Principle Components”

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

No. of clusters

Tim

e (s

)

Kmeans

PCA_Kmeans

Page 21: Real Time Geodemographics

Conclusion

• Clara is plausible alternative to k-means in a real time Geodemographic classification system.

• K-means might be combined with PCA for enhanced computation power.

• In an online environment k-means is better for small data sets.

• Exploration of non traditional data sources.