garyfrussell
TRANSCRIPT
-
A Geodemographic Classification of
Ireland at Garda Sub-district Level
A Tool for Comparing Sub -districts of An Garda Sochna
Gary Russell
M.Sc. Geocomputation
National Centre for Geocomputation,
Maynooth University
2016
Professor Christopher Brunsdon Head of Department
Martin Charlton Programme Coordinator and Thesis Supervisor
-
i
Abstract
This thesis uses demographic data from the 2011 census to classify the catchment
areas of Garda stations in the Republic of Ireland (referred to as Garda Sub-districts). This
was accomplished by subjecting the data at a Garda Sub-district level to principal
component analysis and then using clustering techniques on the resulting principal
components. Similar (geodemographically speaking) Garda station areas were then
visualised using mapping techniques based on Central Statistics Office boundary shape
files. The clusters of Garda Sub-districts were named and described using the distribution
of the characteristic variables compared to the global mean of each variable. As an
example of the use for such data manipulation a mini Atlas of Garda Sub-districts was
created by comparing the crime figures of Garda Sub-districts within the resulting clusters
and visualising the results.
Acknowledgements
I would like to acknowledge and thank Professor Chris Brunsdon, Martin Charlton,
and Dr Ronan Foley. As well as all the staff and researchers in Maynooth University
Departments of Geography and Computer science, the National Centre for
Geocomputation and the All Ireland Research Observatory for the inspiration and
assistance in completing this thesis. I would also like to thank my three classmates for
their support and for putting up with my using them as sounding boards over the past year.
Lastly I would like to thank my wife, Siobhan and our two young children, Aoibhnn and
Joshua, for their unwavering support and belief that I could complete my B.A. and M.Sc.
despite the odd grumpy daddy moment.
-
ii
Table of Contents
Abstract ................................................................................................................................... i
Acknowledgements ................................................................................................................ i
Table of Contents .................................................................................................................. ii
List of Figures ....................................................................................................................... iv
List of Maps ........................................................................................................................... v
List of Tables ........................................................................................................................ vi
Introduction ......................................................................................................................... 1
Literature Review ................................................................................................................ 4
General ............................................................................................................................... 4
Theory ................................................................................................................................ 6
Modifiable Areal Unit Problem ......................................................................................... 6
Variables ............................................................................................................................ 8
Selection ......................................................................................................................... 8
Correlation ..................................................................................................................... 8
Units ............................................................................................................................... 9
Clustering ......................................................................................................................... 10
Methods ........................................................................................................................ 10
Cluster Description ...................................................................................................... 11
Data ..................................................................................................................................... 12
Building the Classification ................................................................................................ 16
Variables .......................................................................................................................... 16
Principal Components Analysis ....................................................................................... 19
Clustering ......................................................................................................................... 21
Cluster Description and Naming ...................................................................................... 24
Results ................................................................................................................................. 26
Cluster One: Comfortable home owning young families in semi rural areas. ................. 28
Cluster Two: Young, mobile, affluent, multicultural singles .......................................... 30
Cluster Three: Struggling labouring communities ........................................................... 32
Cluster Four: Young labouring families in outer commuter areas .................................. 34
Cluster Five: Settled older rural communities ................................................................. 36
Cluster Six: Urban city peripheral communities ............................................................. 38
-
iii
Cluster Seven: Struggling rural aging communities ........................................................ 40
Cluster Eight: Rural farming communities ...................................................................... 42
Cluster Nine: Small rural Townlands .............................................................................. 44
Cluster Ten: Labouring rural communities in older housing stock ................................. 46
Cluster Eleven: Young educated commuter families....................................................... 48
Cluster Twelve: Comfortable rural farming communities ............................................... 50
Cluster Thirteen: Affluent professional commuters in larger homes............................... 52
Cluster Fourteen: Semi rural periphery manufacturing communities ............................. 54
The Urban Rural Divide .................................................................................................. 56
Crime Atlas ........................................................................................................................ 58
Set Up .............................................................................................................................. 58
Theft ................................................................................................................................. 60
Assault ............................................................................................................................. 62
Burglary ........................................................................................................................... 64
Damage to Property or the Environment ......................................................................... 66
Dangerous Acts ................................................................................................................ 68
Drugs ................................................................................................................................ 70
Fraud ................................................................................................................................ 72
Kidnapping ....................................................................................................................... 74
Public Order ..................................................................................................................... 76
Robbery ............................................................................................................................ 78
Weapons ........................................................................................................................... 80
Offences against the State, Justice or Organised Crime .................................................. 82
Crime Atlas Comments .................................................................................................... 84
Issues with Data and Analysis .......................................................................................... 85
Conclusion .......................................................................................................................... 87
Bibliography ....................................................................................................................... 89
Appendix 1: R Code Used Throughout the Thesis ............................................................... 92
Appendix 2: Table of Census Themes .................................................................................. 99
Appendix 3: Garda Sub-district Look-Up Tables .............................................................. 105
-
iv
List of Figures
Figure 1: .................................................. 5
Figure 2: Cumulative variance explained by components 1:40 ....................................................... 20
Figure 3: Scree plot of (WCSS) for k = [1:100] ............................................................................... 23
Figure 4: Heatmap of clusters where k=14 ....................................................................................... 23
Figure 5: Rader plot of Cluster One ................................................................................................. 29
Figure 6: Rader plot of Cluster Two ................................................................................................ 31
Figure 7: Rader plot of Cluster Three .............................................................................................. 33
Figure 8: Rader plot of Cluster Four ................................................................................................ 35
Figure 9: Rader plot of Cluster Five ................................................................................................. 37
Figure 10: Rader plot of Cluster Six ................................................................................................ 39
Figure 11: Rader plot of Cluster Seven ............................................................................................ 41
Figure 12: Radar plot of Cluster Eight ............................................................................................. 43
Figure 13: Radar plot of Cluster Nine .............................................................................................. 45
Figure 14: Radar plot of Cluster Ten ............................................................................................... 47
Figure 15: Radar plot of Cluster Eleven ........................................................................................... 49
Figure 16: Radar plot of Cluster Twelve .......................................................................................... 51
Figure 17: Radar plot of Cluster Thirteen ........................................................................................ 53
Figure 18: Radar plot of Cluster Fourteen ....................................................................................... 55
Figure 19: Box plots by Cluster of theft related crime ..................................................................... 61
Figure 20: Box plots by Cluster of assault related crime ................................................................. 63
Figure 21: Box plots by Cluster of burglary related crime ............................................................... 65
Figure 22: Box plots by Cluster of damage related crime ................................................................ 67
Figure 23: Box plots by Cluster of crimes relating to dangerous acts .............................................. 69
Figure 24: Box plots by Cluster of drug related crime ..................................................................... 71
Figure 25: Box plots by Cluster of fraud related crime .................................................................... 73
Figure 26: Box plots by Cluster of kidnapping related crime .......................................................... 75
Figure 27: Box plots by Cluster of public order and social code crime ........................................... 77
Figure 28: Box plots by Cluster of robbery related crime ................................................................ 79
Figure 29: Box plots by Cluster of weapons related crime .............................................................. 81
Figure 30: Box plots by Cluster of offences against the State, justice or organised crime .............. 83
Figure 31: Variance in crime rates explained by the clustering classification ................................. 84
-
v
List of Maps
Map 1: Garda Sub-districts by division ............................................................................................ 14
Map 2: Garda Sub-districts by cluster .............................................................................................. 26
Map 3: Cluster One .......................................................................................................................... 28
Map 4: Cluster Two ......................................................................................................................... 30
Map 5: Cluster Three ....................................................................................................................... 32
Map 6: Cluster Four ......................................................................................................................... 34
Map 7: Cluster Five .......................................................................................................................... 36
Map 8: Cluster Six ........................................................................................................................... 38
Map 9: Cluster Seven ....................................................................................................................... 40
Map 10: Cluster Eight ...................................................................................................................... 42
Map 11: Cluster Nine ....................................................................................................................... 44
Map 12: Cluster Ten ......................................................................................................................... 46
Map 13: Cluster Eleven .................................................................................................................... 48
Map 14: Cluster Twelve ................................................................................................................... 50
Map 15: Cluster Thirteen ................................................................................................................. 52
Map 16: Cluster Fourteen ................................................................................................................. 54
Map 17: The Urban Rural Divide .................................................................................................... 56
Map 18: Theft ................................................................................................................................... 60
Map 19: Assault ............................................................................................................................... 62
Map 20: Burglary ............................................................................................................................. 64
Map 21: Damage to property or the environment ............................................................................ 66
Map 22: Dangerous Acts .................................................................................................................. 68
Map 23: Drugs .................................................................................................................................. 70
Map 24: Fraud .................................................................................................................................. 72
Map 25: Kidnapping ........................................................................................................................ 74
Map 26: Public Order ....................................................................................................................... 76
Map 27: Robbery .............................................................................................................................. 78
Map 28: Weapons ............................................................................................................................ 80
Map 29: State, Justice or Organised Crime ...................................................................................... 82
-
vi
List of Tables
Table 1: Derived variables used in the classification ....................................................................... 18
Table 2: Extreme z scores for variables within clusters ................................................................... 25
Table 3: Sub-districts in Cluster One ............................................................................................... 29
Table 4: Sub-districts in Cluster Two .............................................................................................. 31
Table 5: Sub-districts in Cluster Three ............................................................................................ 33
Table 6: Sub-districts in Cluster Four .............................................................................................. 35
Table 7: Sub-districts in Cluster Five ............................................................................................... 37
Table 8: Sub-districts in Cluster Six ................................................................................................ 39
Table 9: Sub-districts in Cluster Seven ............................................................................................ 41
Table 10: Sub-districts in Cluster Eight ........................................................................................... 43
Table 11: Sub-districts in Cluster Nine ............................................................................................ 45
Table 12: Sub-districts in Cluster Ten .............................................................................................. 47
Table 13: Sub-districts in Cluster Eleven ......................................................................................... 49
Table 14: Sub-districts in Cluster Twelve ........................................................................................ 51
Table 15: Sub-districts in Cluster Thirteen ...................................................................................... 53
Table 16: Sub-districts in Cluster Fourteen ...................................................................................... 55
-
1
Introduction
This thesis has two main parts; part one will focus on the statistical methods used
in geodemographics to classify Garda Sub-districts. The thesis will use principal
components analysis and clustering techniques to create the classification. Part two will
then consist of a crime atlas of Ireland at Garda Sub-district level; this atlas will also
contain information regarding the performance of the clustering exercise when used with
real crime data.
The census of Ireland provides a vast amount of data at many geographic levels
that are disseminated by the Central Statistics Office (CSO). For example the smallest
geography that the census is reported at is the Small Area (Central Statistics Office,
2014). There are 18,488 Small Areas in the Republic of Ireland. The census reports 764
variables for each of the 18,488 Small Areas, this gives fourteen million, one hundred and
twenty four thousand, eight hundred and thirty two (14,124,832) individual data points.
This thesis concentrates on the 563 catchment areas of Garda stations as defined
by An Garda Sochna and reported in the census as Garda Sub Divisions or Garda Sub-
districts (Central Statistics Office, 2014). Even though this gives 563 areas to work with,
that still equates to 430,132 data points, not including crime statistics for the 13 years
available. Making sense of that amount of data requires them to be summarised so that the
end user is not overwhelmed.
Traditionally crime statistics are reported by county or region of the country as
can be seen in newspapers and websites when reporting on crime; examples include The
Independent (MacCarthaigh & Phelan, 2014) and the Irish Mirror (Jordan, 2015). While it
makes intuitive sense to amalgamate Garda stations by county for the purposes of crime
-
2
statistics reporting, nuances and spatial variation within counties are lost. Another
approach could be to report crimes by station; however this may arguably be problematic.
While reporting by station would show differences within counties and nationally,
without knowing something of the characteristics of the individual stations, comparing
them becomes moot. The first law
(Tobler, 1970) is generally
accepted to hold. However, just because two Garda station areas are next to each other
little point comparing Ballina in County Mayo with Killala for example. The catchments
are neighbours, but Ballina is a generally urban environment with 14,329 people at the
last census in 2011, whereas Killala is a rural, coastal area that is physically larger than
the Ballina catchment but with only 3,766 people (Central Statistics Office, 2014).
This thesis advocates comparing stations not only on their location in An Garda
Sochna hierarchy, but rather based on their similarities, specifically the underlying
geodemographics. Perhaps Killala has more in common with Kilrush in Clare or
Duncannon in Wexford and comparing it to its peers is a more sensible approach than
comparing it to its neighbours. That being said, it is expected that the classification
carried out in this work will show clusters of clusters throughout Ireland in support of
There is much literature on the area of geodemographic classification, particularly
for marketing in the United Kingdom and America. Brunsdon et al. (2014) have created a
geodemographic classification of Ireland at the Small Area scale. Gale et al. (2015) have
used geodemographics in London relating to crime; however there does not appear to be a
classification at the national level for Ireland relating to Garda station catchments. It is
-
3
felt that this thesis may address a gap in the literature and provide a new tool for
comparing policing in Ireland.
The first aim is to create the classification. The steps will be described in detail
throughout this thesis; however, in short, census data will be used to create the
classification. Relevant variables will be chosen and transformed for use. Methods of
reducing the data to manageable proportions will be used and then the data will be
subjected to a clustering algorithm. The resulting clusters of similar Garda Sub-districts
will then be named and described. The second part of the thesis will then use these
clusters to map various crime statistics at a national and cluster by cluster basis to allow
comparison based on similar Garda stations to be made.
It is hoped that in creating this classification, policy makers may begin to ask
questions such as: if this area A has similar characteristics to area B, why are the crime
rates so different? Are more resources needed to reflect the similar catchment
demographics? Are the opening hours of a station appropriate given the social
demographic makeup of the population? It is further hoped that the answers to these
questions may be found by those with access to more information, such as Garda numbers
and skill breakdowns, policing and social services infrastructure in areas etc. Lastly it is
anticipated that this thesis can be built upon by carrying out the same classification with
up to date data when the census is completed and made available after census 2016.
-
4
Literat ure Review
General
Classification is a natural human process that helps us understand and make sense
of the world around us. Parker et al. (2007) assert that the natural process of classification
undertaken by lay people sociospatial construction
of reality (1989) quoted in Boyne (2006) speaks of a dream of universal
classification, or law, to describe the whole world that did not, and could not work.
So it was imagined that the entire world could be distributed according to a unique
code, that one universal law would reign over the totality of phenomena: two
hemispheres, five continents, masculine and feminine, animal and vegetable,
singular plural, right left, four seasons, five senses, six vowels, seven days, twelve
months, twenty-six letters. (Perec, 1989:155)
Vickers and Rees (2011) state that complex systems can be classified to help the
understanding of those systems. However there is no one right system of classification,
Dupr (2006) argues that classifications will be driven by the purpose for which they were
created and that differe
This is a view that is shared by Charlton et al. who produced one of the first open
the
lassification is an arbitrary thing (Charlton, et al., 1985).
Area classification is the act of grouping areas based on selected features with
those areas, the similarity of the characteristics of the selected features drives the
classification (Vickers & Rees, 2007). Geodemographic classifications are a type of area
one of the most commonly used
areas classifications at can make geodemographic classifications
useful is the descriptions that generally accompany them to give a textual summary of the
attributes of each class (Abbas, et al., 2009). Geodemographics are widely used in
-
5
marketing w popular segmentation technique(Doyle, 2011).
Many (Abbas, et al., 2009; Gale, et al., 2015; Singleton & Longley, 2008; Vickers &
Rees, 2007) describe geodemographic classifications as tools to summarise large sets of
spatially dependent data such as census data. Gale et al. (2015) state that geodemographic
classifications allow the highlighting of similarities between population structures in
different parts of a country. Gale et al. also point out that geodemographic classifications
give summaries based not only on the population but also on the built environment.
It would be remiss to discuss geodemographic classification without mentioning
Charles Booth. Between 1886 and 1903 Booth and several assistants accompanied police
officers on the beat around London to investigate places of work, working conditions,
homes and the urban environments. Through interviews and observations Booth created a
(London School of Economics and Political Science , 2012). (Booth, 1903)
was one of the first attempts to map [and classify] social-spatial structures (Alexiou &
Singleton, 2015). A portion of the digitised maps along with the classification Booth used
is shown below in Figure 1.
Figure 1:
Source: London School of Economics and Political Science (2012)
-
6
Theory
While geodemographics have a long history of being used in one form or another,
it must be acknowledged that the theory driving geodemographics is less robust than the
everything is related to everything else, but near things are more related
than distant things(Tobler, 1970). Singleton and Longley (2009) note that the theoretical
on (2015)
who state that classifications based on geodemographics lack solid theory. Another issue
with many geodemographic classifications is that they are not generally geographically
weighted, and due to the methods of their construction are aspatial in design despite
showing spatial correlations in the results (Alexiou & Singleton, 2015). While the issues
with theoretical grounding are acknowledged, Singleton and Longley express their hopes
for best practice geodemographics in that they are: focused, recognise the providence of
the data used, are scientifically reproducible and use the best methods available
(Singleton & Longley, 2009).
Modifiable Areal Unit Probl em
Geodemographic analysis is generally agreed to be best carried out at the smallest
areal unit available in order not to lose spatial variation that larger units may obscure
(Alexiou & Singleton (2015), Charlton, et al. (1985) and Gale, et al. (2015) are
examples). However the scale also depends on the purpose of the classification (Alexiou
& Singleton, 2015), therefore a balance must be struck. Another factor to consider is the
Modifiable Areal Unit Problem (MAUP). Gehlke and Biehl (1934) noted that choices in
data aggregation over space and the size of areal unit used in analysis have influence over
the correlation coefficient. This was expanded upon by Openshaw and Taylor (1979) who
-
7
coined the phrase Modifiable Areal Unit Problem. They conducted experiments on a
spatial data set and found that they could obtain correlations of between -.99 and .99 from
different levels of aggregation. Charlton and Brunsdon (2016) presented a paper at the
GIS Research UK Conference which revisited the work by Gehlke and Biehl using census
data from Ireland at several different official aggregation levels ranging from Small Areas
to Counties. They were able to show that the larger areas lost variance between areas and
While the MAUP is a factor, there is not much that can be done about it if, for
example, data is only available at one areal unit scale. If using official areal units one
must also be aware of the MAUP when comparing results over time, as these boundaries
may change. An example of this is the Irish Electoral Constituencies which were changed
by an act of the Irish Parliament (Houses of the Oireachtas, 2013) as required by Article
16.4 of the Irish Constitution every twelve years (Government Publications, 2016).
Likewise and more relevant for this work are changes to Garda Sub-districts, the
boundaries were changed in 2013 following the closure of some 100 Garda stations (An
Garda Siochana, 2013; Central Statistics Office, 2014). This is an issue noted in the
United Kingdom in relation to British police Basic Command Units (BCU), where Ashby
and Longley (2005) Maintaining the BCU families is an arduous task due to the
Therefore any follow up to this thesis
should be aware of the possibility the MAUP affecting results should the boundaries be
changed by An Garda Sochna.
-
8
Variables
Selection
Harris et al. (2005) explain that a geodemographic classification is created by
grouping areas that are alike in to a number of classes, often based on census data. As
geodemographic classifications are a way of summarising social, demographic and built
characteristics of zoned geography (Gale, et al., 2015), it makes sense to use census data.
Vickers and Rees (2007) also argue that a national census (in their case British, but the
principal holds) stands above other sources due to its amount of data and
. The choices of variables to use in the classification drive the
.
However, they do state that the choices are very difficult to make. Vickers and Rees
(2007) suggest that variables should be chosen only if there is a good reason; this implies
that including a variable just because one has the data is not the best policy.
Correlation
Variable correlation can be an issue when using census data to inform analysis.
Collinearity in the data can affect the performance of any significance tests that may be
carried out on, for example, linear regressions (Anderson, et al., 2010). In classification
exercises correlation between variables creates redundancy in the input data (Alexiou &
Singleton, 2015). Two main methods exist in dealing with correlation within the variables
and it is acknowledged that there is no general rule (Vickers & Rees, 2007). One
approach adopted is to remove one of the pairs of highly correlated variables from use
(Alexiou & Singleton, 2015; Vickers & Rees, 2007). Another approach is to use Principal
Components Analysis (PCA) to transform a set of N correlated variables into a set of n
uncorrelated principal components. This approach was used by Charlton et al. (1985) and
-
9
Brunsdon et al. (2014). With principal components, each component is a linear
combination of the parent variables so the variance is retained but the components are
uncorrelated (Alexiou & Singleton, 2015). Additionally, the first component accounts for
the most variance and each component adds less to the overall variance explained
(Jolliffe, 2002). The user can therefore decide how much variance they are willing to
sacrifice in order to reduce dimensionality in the data by using fewer principal
components than the number of input variables (Charlton, et al., 1985; Harris, et al., 2005;
Jolliffe, 2002).
Units
Variables used in geodemographics can be reported at different units such as
percentages of population, count data, indices etc. (Alexiou & Singleton, 2015). This can
make comparison of variables difficult. The census of Ireland (Central Statistics Office,
2014) reports most available data as a count of people; therefore it is relatively simple to
convert any required variable to percentage of population within the spatial unit. This not
only makes the variables easier to compare, it also stops areas with high population
figures affecting the analysis due to higher absolute numbers. The variables may also be
standardised using z scores to allow for true comparison of the individual variables
influence on a cluster. The z score is a measure of the relative location in a data set of the
observation, therefore data points in two different data sets with the same z score have the
same relative location, i.e. they are the same number of standard deviations from the
mean (Anderson, et al., 2010).
-
10
Clustering
Methods
Clustering involves finding subsets of interest within a larger set, the subsets are
called clusters and are usually homogeneous within each cluster and separated between
clusters (Hansen & Jaumard, 1997). Gordon (1987) notes that it is
, Vickers and Rees (2007) maintain that
there is no right or wrong way to classify. Commercial classifications tend to build from
the ground up, clustering at the smallest available level then aggregating in to larger
groups (Singleton & Longley, 2008). The open Output Area Classification in the UK,
however, was clustered from the top down by creating several large clusters that were
then subjected to clustering techniques separately (Vickers & Rees, 2007). It is widely
acknowledged among the available literature that k-means clustering is the technique of
choice for geodemographic clustering. This is shown in either the acknowledgement of k-
means in theoretical papers or the used of k-means in applied papers (Abbas, et al., 2009;
Alexiou & Singleton, 2015; Brunsdon, et al., 2014; Charlton, et al., 1985; Vickers &
Rees, 2007).
K-means clustering is seen to have something of an advantage over other methods
such as agglomerative, divisive, constructive or direct optimisation (described well in
Gordon (1987)). This is because they are all hierarchical in nature and will force a
hierarchy on the output even if one does not exist (Gordon, 1987). K-means will not force
number of clusters beforehand (Singleton & Longley, 2008). Clustering techniques
generally require a measure of dissimilarity between observations (Jolliffe, 2002). K-
means uses the squared Euclidean distance (Alexiou & Singleton, 2015). In essence k-
means uses k clusters to sort n observations while minimising the sum of squared errors
-
11
(Alexiou & Singleton, 2015; Ding & He, 2004). K-means assigns each observation to a
cluster while minimising sum of squares, a new set of means is then calculated and the
process begins again. The process only stops when the within cluster sum of squares
(WCSS) is minimised. This occurs when cluster assignments no longer change as any
changes would not make the sum of squares smaller (Alexiou & Singleton, 2015).
Cluster Description
Once the WCSS is minimised and the clusters are assigned, the results need to be
described. The aim of cluster descriptions is to provide a short profile of each cluster for
the end user. Vickers and Rees (2007) explain that profiles use text and visuals to help the
sentences. The cluster labelling and
description process is acknowledged by Vickers and Rees (2007) to be difficult and
subject to much thought, in order not to mislead the user or offend the people living in the
areas classified.
Cluster descriptions draw on the main identifiable (Debenham, 2002), dominant
(Abbas, et al., 2009) characteristics of a cluster. Often, the process involves using z scores
to identify extreme variables within the cluster compared to the global mean (Debenham,
2002; Vickers & Rees, 2007). The descriptions that are attached to geodemographic
classifications are viewed as useful to other researchers (Abbas, et al., 2009). Parker et al.
(2007) most
sociologically interes element of the geodemographic classification process.
Therefore the naming and description element of geodemographic classification should
not be overlooked or given less attention than the more statistical elements of the process.
-
12
Data
Three data sets are used throughout this work to create the classification:
1. Census 2011, all 764 reported census variables in columns, at Garda Sub-district
level in 563 rows (Central Statistics Office, 2014).
2. Garda Sub-district boundary files for use in mapping the outcomes (Central
Statistics Office, 2014a).
3. Crime data for Ireland at Garda Sub-district level (Central Statistics Office, 2016).
The crime data is agglomerated at the Garda Sub-district level to 12 crime types.
Attempts/threats to murder, assaults, harassments and related offences.
Dangerous or negligent acts.
Kidnapping and related offences.
Robbery, extortion and hijacking offences.
Burglary and related offences.
Theft and related offences.
Fraud, deception and related offences.
Controlled drug offences.
Weapons related offences.
Damage to property and to the environment.
Public order and other social code offences.
Offences against government, justice procedures and organisation of
crime.
-
13
The data for this study fall in to two main areas: socio-demographic data from the
Census of Ireland (Central Statistics Office, 2014), and crime data (Central Statistics
Office, 2016). Both sets of data are reported at the Garda Sub-district level.
There are 563 Garda Sub-districts in Ireland (Central Statistics Office, 2016). The
Sub-districts
Louisburgh in Mayo. They have populations ranging from 384 in Sraith Salach in Galway
to 98,078 in Blanchardstown in Dublin (Central Statistics Office, 2014).
These Sub-districts are based loosely on the official geography of Irish
Townlands, but were designed by the Examiner of Maps (GIS) at An Garda Sochna to
suit the needs (Creaner, 2016). The Sub-districts are a unique data set in
that they are designed for operational rather than statistical reasons. It is acknowledged by
Creaner that using Small Areas would be better statistically. However the use of Small
a catchment area in the middle of a motorway (for example), as may happen with Small
Areas (Creaner, 2016). The Garda Sub-districts are shown, grouped in to administrative
divisions in Map 11.
It is not known exactly how the 2011 census data were attached to the 2013 Garda
Sub-district geography. The assumption in this thesis is that the CSO would be able to
populate the new boundaries at the household level. At the time of writing the CSO have
not replied to my queries, however the designer of the Sub-districts (Creaner, 2016) has
agreed that this assumption is a reasonable one. Another option would be to populate the
Sub-districts by centroid based on a smaller unit such as Small Areas, if this is the case it
is not felt that there would be too much loss of overall validity due to the number of Small
1 All maps produced in this thesis use boundary files from the CSO website and contain Ordnance Survey Ireland Data (Ordnance Survey Ireland, 2012).
-
14
Areas (18,488) being assigned to one of 563 Sub-districts. Aside from this ambiguity,
there is no known issue with Irish Census Data.
Map 1: Garda Sub-districts by division
The census data used in this thesis is made up of 764 variables that are derived
from household level data and amalgamated to the various levels reported from 18,488
-
15
Small Areas to four Provinces (Central Statistics Office, 2014). The MAUP discussed in
the literature review section is relevant, as carrying out the classification at different scale
will produce different results. It may be possible to carry out the classification at Small
Area level and then combine these results in to larger unit scales. However this is not
appropriate for this study. Firstly, there are no smaller units that fit in to Sub-districts due
to the proprietary nature of the Sub-districts. Secondly all data required are available at
the Sub-district level. Lastly, Sub-districts are the smallest geographical unit that crime
data are released in Ireland (Central Statistics Office, 2016). Therefore it is acknowledged
that variation within the Garda Sub-districts is lost during this study. However because
this classification is for the purpose of being able to compare Garda Stations on a like for
like basis, it is felt that the loss is acceptable at a national level.
-
16
Building the Classification
The classification was built in R, a free, open source statistical computing
environment that can handle large amounts of data (The R Foundation, 2015). Some code
blocks will be included where needed for clarification. However in the interest of
reproducibility the full R code that created the classification is included in Appendix 1.
Variables
This classification aims to give reproducible results that are comparable to similar
studies at different scales. Charlton et al. (1985) chose their variables to give a
comparable classification between their open one and the commercial ACORN
classification in the UK. Brunsdon et al. (2014) chose Irish Census variables at the Small
Area Scale to reflect the OAC classification variables chosen by Vickers and Rees (2007).
Therefore this study will use the same variables as Brunsdon et al. The full list of variable
codes reported by the Census is available for download from the CSO website (Central
Statistics Office, 2014). An adapted list is included in Appendix 2 for reference should the
reader require clarification on any variable codes used.
The variables chosen for the classification exercise are actually derived variables.
Each variable is made up of two or more individual census variables to derive variables
that are percentages of the population of the area in question. For example one variable
used is that of lone parents. This variable is derived by adding the Lone Mothers with
Children (number of families) to Lone Fathers with Children (number of families),
dividing by the total number of families and multiplying the result by 100. The actual
code is shown below.
loneParent < - 100*(T4_3FTLF + T4_3FTLM) / T4_5TF
-
17
In all there are 40 derived variables used in the classification grouped in to six
areas: demographic, household composition, housing, socioeconomic, employment and
connectivity. The derived variables are shown in Table 1; the actual make up of each
derived variable can be seen in the R code in Appendix 1. As stated, the variables from
the CSO were reported at Garda Sub-district level. This aided in mapping as both the
census file and the boundary file contained unique geography ID numbers (GEOGID) for
each Sub-district. The numbers were slightly different in that one set was prefixed with an
find and
replace command in Excel before the census file was loaded in R, however it could have
just as easily been carried out in R.
-
18
Theme Derived Variable Description
Demographics Age0-4 Percentage of population aged 0-4
Age5-14 Percentage of population aged 5-14
Age25-44 Percentage of population aged 25-44
Age 45-64 Percentage of population aged 45-64
Age65+ Percentage of population aged 65 and over
EUNat Percentage of population that is European by nationality (excluding Irish)
RestofWorld Percentage of population where nationality was given as Rest of the World
BornOutsideIRE Percentage of population not born in Ireland
Housing Composition Separated Percentage of persons separated or divorced
SinglePerson Percentage of persons (non pensioners) living in one person households
Pensioner Percentage of persons who are pensioners
LoneParent Percentage of families that are lone parent families
NoChildren Percentage of families that are 'pre family' (no children born)
NonDependChildrenPercentage of families with children where the youngest child is 20+
Housing RentPublic Percentage of total households rented from local authority
RentPrivate Percentage of total households privately rented
Flats Percentage of total households defined as flats
NoCentralHeat Percentage of total household with no central heating
RoomsHH Average number of rooms per household
PeoplePerRoom Total persons total rooms
SepticTank Percentage of total households with an individual septic tank
Socioeconomic HEQual Percentage of persons with an Ordinary Bachelors Degree or higher
Employed Percentage of persons at work
TwoCars Percentage of households with two or more cars
JTWPublic Percentage of persons over age 5 who travel to school, college or work by means of bus or rail
HomeWork Percentage of persons self employed (Own account workers)
LLTI Percentage of persons reporting bad or very bad health
UnpaidCare Percentage of persons providing unpaid care
Employment Students Percentage of persons who are students
Unemployed Percentage of persons who are unemployed having lost or given up jobs
EconinactFam Percentage of persons looking after home/family - homemakers
Agric Percentage of workers who work in agriculture, forestry or fishing
Construction Percentage of workers who work in construction
Manufacturing Percentage of workers who work in manufacturing
Commerce Percentage of workers who work in commerce and trade
Transport Percentage of workers who work in transport and communication
Public Percentage of workers who work in public administration
Professional Percentage of workers who work in professional services
Connectivity Broadband Percentage of internet connected households with broadband
Internet Percentage of total households with some kind of internet access
Table 1: Derived variables used in the classification
-
19
It can be seen in Table 1 that there will be issues in the data using the derived
variables as they are. For example Separated can be expected to be correlated with
LoneParent. As mentioned previously, Principal Components Analysis (PCA) is a set of
methods that can take in variables that may be correlated and produce a set of
uncorrelated principal components. PCA is also used to reduce the size of a clustering
computational problem (Jackson, 1991).
Principal Components Analysis
Once the 40 variables were chosen and derived from the census variables they
were subjected to Principal Components Analysis. The reasons were twofold; firstly to
reduce the dimensionality of the data from 40 to a more manageable number. Secondly,
PCA was used to remove any correlation in the data. The cluster algorithm used for the
classification was k means, this assumes no correlation. Therefore PCA is essential to
provide k means with a set of uncorrelated variables to carry out the clustering. The
components are linear combinations of the original variables. Each one contains a
proportion of the variance in the original data and they are ordered by the amount of
variance they explain. Therefore it is possible to view the cumulative variance explained
by the components and decide how many to use in the k means clustering.
As mentioned previously, there is a lack of theory in this regard, however the
majority of the variance should be kept otherwise the analysis looses too much
information to make the PCA worth doing. Jolliffe (2002) describes the choice of a cut off
of variance as an ad hoc rule-of-thumb that works in practice. Jolliffe suggests a range
from 70% to 90% to retain m components where m is the smallest integer for which the
cumulative variance explained is greater than the cut off.
-
20
There is a function in R that calculates the Principal Components for a user, called
princomp(). This function takes in the relevant variables and performs a PCA. It is
then possible to view the cumulative variance explained by each component to choose
how many to use in the clustering process. For detailed explanations of Principal
Components, works by Jackson (1991), Jolliffe (2002), or Rencher and Christensen
(2012) are recommended. However the mains steps involved are:
1. Get data In this case 40 derived variables * 563 Garda Sub-districts
2. Subtract mean of each variable from each instance of the variable
3. Calculate correlation matrix
4. Calculate eigenvalues and eigenvectors of the correlation matrix
The Principal Components were calculated and their cumulative explanation of the
variance of the original derived variables displayed by entering the following two lines of
code in to R.
pca< - princomp(gardaVars[, - 1],cor=T,scores=T)
cumsum(pca$sdev^2/sum(pcs$sdev^ 2))
The cumulative variance explained by the components is shown in Figure 2.
Figure 2: Cumulative variance explained by components 1:40
-
21
As per general recommendations mentioned earlier, this study will use the first m
components that total to at least 80% of the variance. This means that the first nine
components will be used in the study, as they account for 80.65% of the variance. This
was seen as a good cut off as the other 31 components only accounted for 19.35% of the
variance in the original data set between them. Another reason for not including the tenth
component is that it is the first component that fails a test that suggests each component
should contribute more than of the cumulative variance (Jolliffe, 2002). As p is 40 and
, the fact that component ten only explains an extra 1.64% of the variance
excludes its use in the classification process.
Clustering
The work up to this point has concentrated on getting the data ready for the
clustering process. The nine principal components represent a much smaller data set than
the 40 derived variables, they are also not correlated. This means that they are ready for
use in a clustering algorithm. The method of clustering used in this thesis is the k means
technique as described in the literature review. The k means method was chosen as it is
the system of choice in most geodemographic classifications (Abbas, et al., 2009; Alexiou
& Singleton, 2015; Brunsdon, et al., 2014; Charlton, et al., 1985; Vickers & Rees, 2007).
K means requires the number of clusters to be known before it sets to minimise the
within cluster sum of squares. However this is not an issue in R. It is possible to run k
means through a loop in R, in this way it is possible to run the clustering exercise many
times with different possible numbers of clusters and for the user to pick the best number
for k based on the results. In order to pick the best number for k, the k means process was
run 100 times with k starting at one and being increased by one each time. The results
-
22
were then plotted on a scree plot shown in Figure 3. The code to run through this loop is
shown below.
nPC 14) would not add to
the classification. In addition splitting the clusters in to smaller units may have created
clusters that were too nuanced for the purpose of comparing Garda Sub-districts. The final
step in the clustering process was to join the cluster numbers to the GEOGID numbers of
individual Garda stations so that the results could be mapped.
It is possible to cluster a set of clusters in order to create higher order super groups
for description purposes. However with only 14 clusters and 563 areas it was felt that a
second level of clustering was unnecessary for this classification.
-
23
Figure 3: Scree plot of (WCSS) for k = [1:100]
Figure 4: Heatmap of clusters where k=14
K=14
-
24
Cluster Description and Naming
With the clusters set, they needed to be named and described. For some (Abbas, et
al., 2009; Parker, et al., 2007) the descriptions attached to clustering exercises such as this
one are the most interesting element of the final geodemographic classification process.
Vickers and Rees (2011) state that cluster names may be the primary source of
information used when judging a cluster in a classification by the end user.
Naming and describing the clusters was a multi stage process. All of the clusters
were mapped in order to check that the classification seemed spatially sensible. Then the z
scores of the original derived variables were calculated for each cluster, the mean z score
across all clusters was also calculated for each variable (), as was the standard deviation
- > z were deemed to have an extremely low z
score for that variable. Clusters where z z
score for that variable. These extreme highs or lows accounted for 25% of the variables.
The extreme values informed the name attached to the clusters as they were
deemed to be dominant and identifiable characteristics of the cluster in question (Abbas,
et al., 2009; Debenham, 2002; Vickers & Rees, 2007) . A table of the extremes identified
is reproduced in Table 2. In addition to the extreme z scores, the scores that were above or
below average for that variable informed more detail in the descriptions where necessary.
For the assistance of the end user a radar plot for each cluster is included in the
cluster descriptions. This plot shows the global average z score for each variable in red
and the z score for the cluster in blue. It also shows the extreme value cut offs in green
and purple.
-
25
Table 2: Extreme z scores for variables within clusters