garyfrussell

A Geodemographic Classification of

Ireland at Garda Sub-district Level

A Tool for Comparing Sub -districts of An Garda Sochna

Gary Russell

M.Sc. Geocomputation

National Centre for Geocomputation,

Maynooth University

2016

Professor Christopher Brunsdon Head of Department

Martin Charlton Programme Coordinator and Thesis Supervisor

i

Abstract

This thesis uses demographic data from the 2011 census to classify the catchment

areas of Garda stations in the Republic of Ireland (referred to as Garda Sub-districts). This

was accomplished by subjecting the data at a Garda Sub-district level to principal

component analysis and then using clustering techniques on the resulting principal

components. Similar (geodemographically speaking) Garda station areas were then

visualised using mapping techniques based on Central Statistics Office boundary shape

files. The clusters of Garda Sub-districts were named and described using the distribution

of the characteristic variables compared to the global mean of each variable. As an

example of the use for such data manipulation a mini Atlas of Garda Sub-districts was

created by comparing the crime figures of Garda Sub-districts within the resulting clusters

and visualising the results.

Acknowledgements

I would like to acknowledge and thank Professor Chris Brunsdon, Martin Charlton,

and Dr Ronan Foley. As well as all the staff and researchers in Maynooth University

Departments of Geography and Computer science, the National Centre for

Geocomputation and the All Ireland Research Observatory for the inspiration and

assistance in completing this thesis. I would also like to thank my three classmates for

their support and for putting up with my using them as sounding boards over the past year.

Lastly I would like to thank my wife, Siobhan and our two young children, Aoibhnn and

Joshua, for their unwavering support and belief that I could complete my B.A. and M.Sc.

despite the odd grumpy daddy moment.

ii

Table of Contents

Abstract ................................................................................................................................... i

Acknowledgements ................................................................................................................ i

Table of Contents .................................................................................................................. ii

List of Figures ....................................................................................................................... iv

List of Maps ........................................................................................................................... v

List of Tables ........................................................................................................................ vi

Introduction ......................................................................................................................... 1

Literature Review ................................................................................................................ 4

General ............................................................................................................................... 4

Theory ................................................................................................................................ 6

Modifiable Areal Unit Problem ......................................................................................... 6

Variables ............................................................................................................................ 8

Selection ......................................................................................................................... 8

Correlation ..................................................................................................................... 8

Units ............................................................................................................................... 9

Clustering ......................................................................................................................... 10

Methods ........................................................................................................................ 10

Cluster Description ...................................................................................................... 11

Data ..................................................................................................................................... 12

Building the Classification ................................................................................................ 16

Variables .......................................................................................................................... 16

Principal Components Analysis ....................................................................................... 19

Clustering ......................................................................................................................... 21

Cluster Description and Naming ...................................................................................... 24

Results ................................................................................................................................. 26

Cluster One: Comfortable home owning young families in semi rural areas. ................. 28

Cluster Two: Young, mobile, affluent, multicultural singles .......................................... 30

Cluster Three: Struggling labouring communities ........................................................... 32

Cluster Four: Young labouring families in outer commuter areas .................................. 34

Cluster Five: Settled older rural communities ................................................................. 36

Cluster Six: Urban city peripheral communities ............................................................. 38

iii

Cluster Seven: Struggling rural aging communities ........................................................ 40

Cluster Eight: Rural farming communities ...................................................................... 42

Cluster Nine: Small rural Townlands .............................................................................. 44

Cluster Ten: Labouring rural communities in older housing stock ................................. 46

Cluster Eleven: Young educated commuter families....................................................... 48

Cluster Twelve: Comfortable rural farming communities ............................................... 50

Cluster Thirteen: Affluent professional commuters in larger homes............................... 52

Cluster Fourteen: Semi rural periphery manufacturing communities ............................. 54

The Urban Rural Divide .................................................................................................. 56

Crime Atlas ........................................................................................................................ 58

Set Up .............................................................................................................................. 58

Theft ................................................................................................................................. 60

Assault ............................................................................................................................. 62

Burglary ........................................................................................................................... 64

Damage to Property or the Environment ......................................................................... 66

Dangerous Acts ................................................................................................................ 68

Drugs ................................................................................................................................ 70

Fraud ................................................................................................................................ 72

Kidnapping ....................................................................................................................... 74

Public Order ..................................................................................................................... 76

Robbery ............................................................................................................................ 78

Weapons ........................................................................................................................... 80

Offences against the State, Justice or Organised Crime .................................................. 82

Crime Atlas Comments .................................................................................................... 84

Issues with Data and Analysis .......................................................................................... 85

Conclusion .......................................................................................................................... 87

Bibliography ....................................................................................................................... 89

Appendix 1: R Code Used Throughout the Thesis ............................................................... 92

Appendix 2: Table of Census Themes .................................................................................. 99

Appendix 3: Garda Sub-district Look-Up Tables .............................................................. 105

iv

List of Figures

Figure 1: .................................................. 5

Figure 2: Cumulative variance explained by components 1:40 ....................................................... 20

Figure 3: Scree plot of (WCSS) for k = [1:100] ............................................................................... 23

Figure 4: Heatmap of clusters where k=14 ....................................................................................... 23

Figure 5: Rader plot of Cluster One ................................................................................................. 29

Figure 6: Rader plot of Cluster Two ................................................................................................ 31

Figure 7: Rader plot of Cluster Three .............................................................................................. 33

Figure 8: Rader plot of Cluster Four ................................................................................................ 35

Figure 9: Rader plot of Cluster Five ................................................................................................. 37

Figure 10: Rader plot of Cluster Six ................................................................................................ 39

Figure 11: Rader plot of Cluster Seven ............................................................................................ 41

Figure 12: Radar plot of Cluster Eight ............................................................................................. 43

Figure 13: Radar plot of Cluster Nine .............................................................................................. 45

Figure 14: Radar plot of Cluster Ten ............................................................................................... 47

Figure 15: Radar plot of Cluster Eleven ........................................................................................... 49

Figure 16: Radar plot of Cluster Twelve .......................................................................................... 51

Figure 17: Radar plot of Cluster Thirteen ........................................................................................ 53

Figure 18: Radar plot of Cluster Fourteen ....................................................................................... 55

Figure 19: Box plots by Cluster of theft related crime ..................................................................... 61

Figure 20: Box plots by Cluster of assault related crime ................................................................. 63

Figure 21: Box plots by Cluster of burglary related crime ............................................................... 65

Figure 22: Box plots by Cluster of damage related crime ................................................................ 67

Figure 23: Box plots by Cluster of crimes relating to dangerous acts .............................................. 69

Figure 24: Box plots by Cluster of drug related crime ..................................................................... 71

Figure 25: Box plots by Cluster of fraud related crime .................................................................... 73

Figure 26: Box plots by Cluster of kidnapping related crime .......................................................... 75

Figure 27: Box plots by Cluster of public order and social code crime ........................................... 77

Figure 28: Box plots by Cluster of robbery related crime ................................................................ 79

Figure 29: Box plots by Cluster of weapons related crime .............................................................. 81

Figure 30: Box plots by Cluster of offences against the State, justice or organised crime .............. 83

Figure 31: Variance in crime rates explained by the clustering classification ................................. 84

v

List of Maps

Map 1: Garda Sub-districts by division ............................................................................................ 14

Map 2: Garda Sub-districts by cluster .............................................................................................. 26

Map 3: Cluster One .......................................................................................................................... 28

Map 4: Cluster Two ......................................................................................................................... 30

Map 5: Cluster Three ....................................................................................................................... 32

Map 6: Cluster Four ......................................................................................................................... 34

Map 7: Cluster Five .......................................................................................................................... 36

Map 8: Cluster Six ........................................................................................................................... 38

Map 9: Cluster Seven ....................................................................................................................... 40

Map 10: Cluster Eight ...................................................................................................................... 42

Map 11: Cluster Nine ....................................................................................................................... 44

Map 12: Cluster Ten ......................................................................................................................... 46

Map 13: Cluster Eleven .................................................................................................................... 48

Map 14: Cluster Twelve ................................................................................................................... 50

Map 15: Cluster Thirteen ................................................................................................................. 52

Map 16: Cluster Fourteen ................................................................................................................. 54

Map 17: The Urban Rural Divide .................................................................................................... 56

Map 18: Theft ................................................................................................................................... 60

Map 19: Assault ............................................................................................................................... 62

Map 20: Burglary ............................................................................................................................. 64

Map 21: Damage to property or the environment ............................................................................ 66

Map 22: Dangerous Acts .................................................................................................................. 68

Map 23: Drugs .................................................................................................................................. 70

Map 24: Fraud .................................................................................................................................. 72

Map 25: Kidnapping ........................................................................................................................ 74

Map 26: Public Order ....................................................................................................................... 76

Map 27: Robbery .............................................................................................................................. 78

Map 28: Weapons ............................................................................................................................ 80

Map 29: State, Justice or Organised Crime ...................................................................................... 82

vi

List of Tables

Table 1: Derived variables used in the classification ....................................................................... 18

Table 2: Extreme z scores for variables within clusters ................................................................... 25

Table 3: Sub-districts in Cluster One ............................................................................................... 29

Table 4: Sub-districts in Cluster Two .............................................................................................. 31

Table 5: Sub-districts in Cluster Three ............................................................................................ 33

Table 6: Sub-districts in Cluster Four .............................................................................................. 35

Table 7: Sub-districts in Cluster Five ............................................................................................... 37

Table 8: Sub-districts in Cluster Six ................................................................................................ 39

Table 9: Sub-districts in Cluster Seven ............................................................................................ 41

Table 10: Sub-districts in Cluster Eight ........................................................................................... 43

Table 11: Sub-districts in Cluster Nine ............................................................................................ 45

Table 12: Sub-districts in Cluster Ten .............................................................................................. 47

Table 13: Sub-districts in Cluster Eleven ......................................................................................... 49

Table 14: Sub-districts in Cluster Twelve ........................................................................................ 51

Table 15: Sub-districts in Cluster Thirteen ...................................................................................... 53

Table 16: Sub-districts in Cluster Fourteen ...................................................................................... 55

1

Introduction

This thesis has two main parts; part one will focus on the statistical methods used

in geodemographics to classify Garda Sub-districts. The thesis will use principal

components analysis and clustering techniques to create the classification. Part two will

then consist of a crime atlas of Ireland at Garda Sub-district level; this atlas will also

contain information regarding the performance of the clustering exercise when used with

real crime data.

The census of Ireland provides a vast amount of data at many geographic levels

that are disseminated by the Central Statistics Office (CSO). For example the smallest

geography that the census is reported at is the Small Area (Central Statistics Office,

2014). There are 18,488 Small Areas in the Republic of Ireland. The census reports 764

variables for each of the 18,488 Small Areas, this gives fourteen million, one hundred and

twenty four thousand, eight hundred and thirty two (14,124,832) individual data points.

This thesis concentrates on the 563 catchment areas of Garda stations as defined

by An Garda Sochna and reported in the census as Garda Sub Divisions or Garda Sub-

districts (Central Statistics Office, 2014). Even though this gives 563 areas to work with,

that still equates to 430,132 data points, not including crime statistics for the 13 years

available. Making sense of that amount of data requires them to be summarised so that the

end user is not overwhelmed.

Traditionally crime statistics are reported by county or region of the country as

can be seen in newspapers and websites when reporting on crime; examples include The

Independent (MacCarthaigh & Phelan, 2014) and the Irish Mirror (Jordan, 2015). While it

makes intuitive sense to amalgamate Garda stations by county for the purposes of crime

2

statistics reporting, nuances and spatial variation within counties are lost. Another

approach could be to report crimes by station; however this may arguably be problematic.

While reporting by station would show differences within counties and nationally,

without knowing something of the characteristics of the individual stations, comparing

them becomes moot. The first law

(Tobler, 1970) is generally

accepted to hold. However, just because two Garda station areas are next to each other

little point comparing Ballina in County Mayo with Killala for example. The catchments

are neighbours, but Ballina is a generally urban environment with 14,329 people at the

last census in 2011, whereas Killala is a rural, coastal area that is physically larger than

the Ballina catchment but with only 3,766 people (Central Statistics Office, 2014).

This thesis advocates comparing stations not only on their location in An Garda

Sochna hierarchy, but rather based on their similarities, specifically the underlying

geodemographics. Perhaps Killala has more in common with Kilrush in Clare or

Duncannon in Wexford and comparing it to its peers is a more sensible approach than

comparing it to its neighbours. That being said, it is expected that the classification

carried out in this work will show clusters of clusters throughout Ireland in support of

There is much literature on the area of geodemographic classification, particularly

for marketing in the United Kingdom and America. Brunsdon et al. (2014) have created a

geodemographic classification of Ireland at the Small Area scale. Gale et al. (2015) have

used geodemographics in London relating to crime; however there does not appear to be a

classification at the national level for Ireland relating to Garda station catchments. It is

3

felt that this thesis may address a gap in the literature and provide a new tool for

comparing policing in Ireland.

The first aim is to create the classification. The steps will be described in detail

throughout this thesis; however, in short, census data will be used to create the

classification. Relevant variables will be chosen and transformed for use. Methods of

reducing the data to manageable proportions will be used and then the data will be

subjected to a clustering algorithm. The resulting clusters of similar Garda Sub-districts

will then be named and described. The second part of the thesis will then use these

clusters to map various crime statistics at a national and cluster by cluster basis to allow

comparison based on similar Garda stations to be made.

It is hoped that in creating this classification, policy makers may begin to ask

questions such as: if this area A has similar characteristics to area B, why are the crime

rates so different? Are more resources needed to reflect the similar catchment

demographics? Are the opening hours of a station appropriate given the social

demographic makeup of the population? It is further hoped that the answers to these

questions may be found by those with access to more information, such as Garda numbers

and skill breakdowns, policing and social services infrastructure in areas etc. Lastly it is

anticipated that this thesis can be built upon by carrying out the same classification with

up to date data when the census is completed and made available after census 2016.

4

Literat ure Review

General

Classification is a natural human process that helps us understand and make sense

of the world around us. Parker et al. (2007) assert that the natural process of classification

undertaken by lay people sociospatial construction

of reality (1989) quoted in Boyne (2006) speaks of a dream of universal

classification, or law, to describe the whole world that did not, and could not work.

So it was imagined that the entire world could be distributed according to a unique

code, that one universal law would reign over the totality of phenomena: two

hemispheres, five continents, masculine and feminine, animal and vegetable,

singular plural, right left, four seasons, five senses, six vowels, seven days, twelve

months, twenty-six letters. (Perec, 1989:155)

Vickers and Rees (2011) state that complex systems can be classified to help the

understanding of those systems. However there is no one right system of classification,

Dupr (2006) argues that classifications will be driven by the purpose for which they were

created and that differe

This is a view that is shared by Charlton et al. who produced one of the first open

the

lassification is an arbitrary thing (Charlton, et al., 1985).

Area classification is the act of grouping areas based on selected features with

those areas, the similarity of the characteristics of the selected features drives the

classification (Vickers & Rees, 2007). Geodemographic classifications are a type of area

one of the most commonly used

areas classifications at can make geodemographic classifications

useful is the descriptions that generally accompany them to give a textual summary of the

attributes of each class (Abbas, et al., 2009). Geodemographics are widely used in

5

marketing w popular segmentation technique(Doyle, 2011).

Many (Abbas, et al., 2009; Gale, et al., 2015; Singleton & Longley, 2008; Vickers &

Rees, 2007) describe geodemographic classifications as tools to summarise large sets of

spatially dependent data such as census data. Gale et al. (2015) state that geodemographic

classifications allow the highlighting of similarities between population structures in

different parts of a country. Gale et al. also point out that geodemographic classifications

give summaries based not only on the population but also on the built environment.

It would be remiss to discuss geodemographic classification without mentioning

Charles Booth. Between 1886 and 1903 Booth and several assistants accompanied police

officers on the beat around London to investigate places of work, working conditions,

homes and the urban environments. Through interviews and observations Booth created a

(London School of Economics and Political Science , 2012). (Booth, 1903)

was one of the first attempts to map [and classify] social-spatial structures (Alexiou &

Singleton, 2015). A portion of the digitised maps along with the classification Booth used

is shown below in Figure 1.

Figure 1:

Source: London School of Economics and Political Science (2012)

6

Theory

While geodemographics have a long history of being used in one form or another,

it must be acknowledged that the theory driving geodemographics is less robust than the

everything is related to everything else, but near things are more related

than distant things(Tobler, 1970). Singleton and Longley (2009) note that the theoretical

on (2015)

who state that classifications based on geodemographics lack solid theory. Another issue

with many geodemographic classifications is that they are not generally geographically

weighted, and due to the methods of their construction are aspatial in design despite

showing spatial correlations in the results (Alexiou & Singleton, 2015). While the issues

with theoretical grounding are acknowledged, Singleton and Longley express their hopes

for best practice geodemographics in that they are: focused, recognise the providence of

the data used, are scientifically reproducible and use the best methods available

(Singleton & Longley, 2009).

Modifiable Areal Unit Probl em

Geodemographic analysis is generally agreed to be best carried out at the smallest

areal unit available in order not to lose spatial variation that larger units may obscure

(Alexiou & Singleton (2015), Charlton, et al. (1985) and Gale, et al. (2015) are

examples). However the scale also depends on the purpose of the classification (Alexiou

& Singleton, 2015), therefore a balance must be struck. Another factor to consider is the

Modifiable Areal Unit Problem (MAUP). Gehlke and Biehl (1934) noted that choices in

data aggregation over space and the size of areal unit used in analysis have influence over

the correlation coefficient. This was expanded upon by Openshaw and Taylor (1979) who

7

coined the phrase Modifiable Areal Unit Problem. They conducted experiments on a

spatial data set and found that they could obtain correlations of between -.99 and .99 from

different levels of aggregation. Charlton and Brunsdon (2016) presented a paper at the

GIS Research UK Conference which revisited the work by Gehlke and Biehl using census

data from Ireland at several different official aggregation levels ranging from Small Areas

to Counties. They were able to show that the larger areas lost variance between areas and

While the MAUP is a factor, there is not much that can be done about it if, for

example, data is only available at one areal unit scale. If using official areal units one

must also be aware of the MAUP when comparing results over time, as these boundaries

may change. An example of this is the Irish Electoral Constituencies which were changed

by an act of the Irish Parliament (Houses of the Oireachtas, 2013) as required by Article

16.4 of the Irish Constitution every twelve years (Government Publications, 2016).

Likewise and more relevant for this work are changes to Garda Sub-districts, the

boundaries were changed in 2013 following the closure of some 100 Garda stations (An

Garda Siochana, 2013; Central Statistics Office, 2014). This is an issue noted in the

United Kingdom in relation to British police Basic Command Units (BCU), where Ashby

and Longley (2005) Maintaining the BCU families is an arduous task due to the

Therefore any follow up to this thesis

should be aware of the possibility the MAUP affecting results should the boundaries be

changed by An Garda Sochna.

8

Variables

Selection

Harris et al. (2005) explain that a geodemographic classification is created by

grouping areas that are alike in to a number of classes, often based on census data. As

geodemographic classifications are a way of summarising social, demographic and built

characteristics of zoned geography (Gale, et al., 2015), it makes sense to use census data.

Vickers and Rees (2007) also argue that a national census (in their case British, but the

principal holds) stands above other sources due to its amount of data and

. The choices of variables to use in the classification drive the

.

However, they do state that the choices are very difficult to make. Vickers and Rees

(2007) suggest that variables should be chosen only if there is a good reason; this implies

that including a variable just because one has the data is not the best policy.

Correlation

Variable correlation can be an issue when using census data to inform analysis.

Collinearity in the data can affect the performance of any significance tests that may be

carried out on, for example, linear regressions (Anderson, et al., 2010). In classification

exercises correlation between variables creates redundancy in the input data (Alexiou &

Singleton, 2015). Two main methods exist in dealing with correlation within the variables

and it is acknowledged that there is no general rule (Vickers & Rees, 2007). One

approach adopted is to remove one of the pairs of highly correlated variables from use

(Alexiou & Singleton, 2015; Vickers & Rees, 2007). Another approach is to use Principal

Components Analysis (PCA) to transform a set of N correlated variables into a set of n

uncorrelated principal components. This approach was used by Charlton et al. (1985) and

9

Brunsdon et al. (2014). With principal components, each component is a linear

combination of the parent variables so the variance is retained but the components are

uncorrelated (Alexiou & Singleton, 2015). Additionally, the first component accounts for

the most variance and each component adds less to the overall variance explained

(Jolliffe, 2002). The user can therefore decide how much variance they are willing to

sacrifice in order to reduce dimensionality in the data by using fewer principal

components than the number of input variables (Charlton, et al., 1985; Harris, et al., 2005;

Jolliffe, 2002).

Units

Variables used in geodemographics can be reported at different units such as

percentages of population, count data, indices etc. (Alexiou & Singleton, 2015). This can

make comparison of variables difficult. The census of Ireland (Central Statistics Office,

2014) reports most available data as a count of people; therefore it is relatively simple to

convert any required variable to percentage of population within the spatial unit. This not

only makes the variables easier to compare, it also stops areas with high population

figures affecting the analysis due to higher absolute numbers. The variables may also be

standardised using z scores to allow for true comparison of the individual variables

influence on a cluster. The z score is a measure of the relative location in a data set of the

observation, therefore data points in two different data sets with the same z score have the

same relative location, i.e. they are the same number of standard deviations from the

mean (Anderson, et al., 2010).

10

Clustering

Methods

Clustering involves finding subsets of interest within a larger set, the subsets are

called clusters and are usually homogeneous within each cluster and separated between

clusters (Hansen & Jaumard, 1997). Gordon (1987) notes that it is

, Vickers and Rees (2007) maintain that

there is no right or wrong way to classify. Commercial classifications tend to build from

the ground up, clustering at the smallest available level then aggregating in to larger

groups (Singleton & Longley, 2008). The open Output Area Classification in the UK,

however, was clustered from the top down by creating several large clusters that were

then subjected to clustering techniques separately (Vickers & Rees, 2007). It is widely

acknowledged among the available literature that k-means clustering is the technique of

choice for geodemographic clustering. This is shown in either the acknowledgement of k-

means in theoretical papers or the used of k-means in applied papers (Abbas, et al., 2009;

Alexiou & Singleton, 2015; Brunsdon, et al., 2014; Charlton, et al., 1985; Vickers &

Rees, 2007).

K-means clustering is seen to have something of an advantage over other methods

such as agglomerative, divisive, constructive or direct optimisation (described well in

Gordon (1987)). This is because they are all hierarchical in nature and will force a

hierarchy on the output even if one does not exist (Gordon, 1987). K-means will not force

number of clusters beforehand (Singleton & Longley, 2008). Clustering techniques

generally require a measure of dissimilarity between observations (Jolliffe, 2002). K-

means uses the squared Euclidean distance (Alexiou & Singleton, 2015). In essence k-

means uses k clusters to sort n observations while minimising the sum of squared errors

11

(Alexiou & Singleton, 2015; Ding & He, 2004). K-means assigns each observation to a

cluster while minimising sum of squares, a new set of means is then calculated and the

process begins again. The process only stops when the within cluster sum of squares

(WCSS) is minimised. This occurs when cluster assignments no longer change as any

changes would not make the sum of squares smaller (Alexiou & Singleton, 2015).

Cluster Description

Once the WCSS is minimised and the clusters are assigned, the results need to be

described. The aim of cluster descriptions is to provide a short profile of each cluster for

the end user. Vickers and Rees (2007) explain that profiles use text and visuals to help the

sentences. The cluster labelling and

description process is acknowledged by Vickers and Rees (2007) to be difficult and

subject to much thought, in order not to mislead the user or offend the people living in the

areas classified.

Cluster descriptions draw on the main identifiable (Debenham, 2002), dominant

(Abbas, et al., 2009) characteristics of a cluster. Often, the process involves using z scores

to identify extreme variables within the cluster compared to the global mean (Debenham,

2002; Vickers & Rees, 2007). The descriptions that are attached to geodemographic

classifications are viewed as useful to other researchers (Abbas, et al., 2009). Parker et al.

(2007) most

sociologically interes element of the geodemographic classification process.

Therefore the naming and description element of geodemographic classification should

not be overlooked or given less attention than the more statistical elements of the process.

12

Data

Three data sets are used throughout this work to create the classification:

1. Census 2011, all 764 reported census variables in columns, at Garda Sub-district

level in 563 rows (Central Statistics Office, 2014).

2. Garda Sub-district boundary files for use in mapping the outcomes (Central

Statistics Office, 2014a).

3. Crime data for Ireland at Garda Sub-district level (Central Statistics Office, 2016).

The crime data is agglomerated at the Garda Sub-district level to 12 crime types.

Attempts/threats to murder, assaults, harassments and related offences.

Dangerous or negligent acts.

Kidnapping and related offences.

Robbery, extortion and hijacking offences.

Burglary and related offences.

Theft and related offences.

Fraud, deception and related offences.

Controlled drug offences.

Weapons related offences.

Damage to property and to the environment.

Public order and other social code offences.

Offences against government, justice procedures and organisation of

crime.

13

The data for this study fall in to two main areas: socio-demographic data from the

Census of Ireland (Central Statistics Office, 2014), and crime data (Central Statistics

Office, 2016). Both sets of data are reported at the Garda Sub-district level.

There are 563 Garda Sub-districts in Ireland (Central Statistics Office, 2016). The

Sub-districts

Louisburgh in Mayo. They have populations ranging from 384 in Sraith Salach in Galway

to 98,078 in Blanchardstown in Dublin (Central Statistics Office, 2014).

These Sub-districts are based loosely on the official geography of Irish

Townlands, but were designed by the Examiner of Maps (GIS) at An Garda Sochna to

suit the needs (Creaner, 2016). The Sub-districts are a unique data set in

that they are designed for operational rather than statistical reasons. It is acknowledged by

Creaner that using Small Areas would be better statistically. However the use of Small

a catchment area in the middle of a motorway (for example), as may happen with Small

Areas (Creaner, 2016). The Garda Sub-districts are shown, grouped in to administrative

divisions in Map 11.

It is not known exactly how the 2011 census data were attached to the 2013 Garda

Sub-district geography. The assumption in this thesis is that the CSO would be able to

populate the new boundaries at the household level. At the time of writing the CSO have

not replied to my queries, however the designer of the Sub-districts (Creaner, 2016) has

agreed that this assumption is a reasonable one. Another option would be to populate the

Sub-districts by centroid based on a smaller unit such as Small Areas, if this is the case it

is not felt that there would be too much loss of overall validity due to the number of Small

1 All maps produced in this thesis use boundary files from the CSO website and contain Ordnance Survey Ireland Data (Ordnance Survey Ireland, 2012).

14

Areas (18,488) being assigned to one of 563 Sub-districts. Aside from this ambiguity,

there is no known issue with Irish Census Data.

Map 1: Garda Sub-districts by division

The census data used in this thesis is made up of 764 variables that are derived

from household level data and amalgamated to the various levels reported from 18,488

15

Small Areas to four Provinces (Central Statistics Office, 2014). The MAUP discussed in

the literature review section is relevant, as carrying out the classification at different scale

will produce different results. It may be possible to carry out the classification at Small

Area level and then combine these results in to larger unit scales. However this is not

appropriate for this study. Firstly, there are no smaller units that fit in to Sub-districts due

to the proprietary nature of the Sub-districts. Secondly all data required are available at

the Sub-district level. Lastly, Sub-districts are the smallest geographical unit that crime

data are released in Ireland (Central Statistics Office, 2016). Therefore it is acknowledged

that variation within the Garda Sub-districts is lost during this study. However because

this classification is for the purpose of being able to compare Garda Stations on a like for

like basis, it is felt that the loss is acceptable at a national level.

16

Building the Classification

The classification was built in R, a free, open source statistical computing

environment that can handle large amounts of data (The R Foundation, 2015). Some code

blocks will be included where needed for clarification. However in the interest of

reproducibility the full R code that created the classification is included in Appendix 1.

Variables

This classification aims to give reproducible results that are comparable to similar

studies at different scales. Charlton et al. (1985) chose their variables to give a

comparable classification between their open one and the commercial ACORN

classification in the UK. Brunsdon et al. (2014) chose Irish Census variables at the Small

Area Scale to reflect the OAC classification variables chosen by Vickers and Rees (2007).

Therefore this study will use the same variables as Brunsdon et al. The full list of variable

codes reported by the Census is available for download from the CSO website (Central

Statistics Office, 2014). An adapted list is included in Appendix 2 for reference should the

reader require clarification on any variable codes used.

The variables chosen for the classification exercise are actually derived variables.

Each variable is made up of two or more individual census variables to derive variables

that are percentages of the population of the area in question. For example one variable

used is that of lone parents. This variable is derived by adding the Lone Mothers with

Children (number of families) to Lone Fathers with Children (number of families),

dividing by the total number of families and multiplying the result by 100. The actual

code is shown below.

loneParent < - 100*(T4_3FTLF + T4_3FTLM) / T4_5TF

17

In all there are 40 derived variables used in the classification grouped in to six

areas: demographic, household composition, housing, socioeconomic, employment and

connectivity. The derived variables are shown in Table 1; the actual make up of each

derived variable can be seen in the R code in Appendix 1. As stated, the variables from

the CSO were reported at Garda Sub-district level. This aided in mapping as both the

census file and the boundary file contained unique geography ID numbers (GEOGID) for

each Sub-district. The numbers were slightly different in that one set was prefixed with an

find and

replace command in Excel before the census file was loaded in R, however it could have

just as easily been carried out in R.

18

Theme Derived Variable Description

Demographics Age0-4 Percentage of population aged 0-4

Age5-14 Percentage of population aged 5-14

Age25-44 Percentage of population aged 25-44

Age 45-64 Percentage of population aged 45-64

Age65+ Percentage of population aged 65 and over

EUNat Percentage of population that is European by nationality (excluding Irish)

RestofWorld Percentage of population where nationality was given as Rest of the World

BornOutsideIRE Percentage of population not born in Ireland

Housing Composition Separated Percentage of persons separated or divorced

SinglePerson Percentage of persons (non pensioners) living in one person households

Pensioner Percentage of persons who are pensioners

LoneParent Percentage of families that are lone parent families

NoChildren Percentage of families that are 'pre family' (no children born)

NonDependChildrenPercentage of families with children where the youngest child is 20+

Housing RentPublic Percentage of total households rented from local authority

RentPrivate Percentage of total households privately rented

Flats Percentage of total households defined as flats

NoCentralHeat Percentage of total household with no central heating

RoomsHH Average number of rooms per household

PeoplePerRoom Total persons total rooms

SepticTank Percentage of total households with an individual septic tank

Socioeconomic HEQual Percentage of persons with an Ordinary Bachelors Degree or higher

Employed Percentage of persons at work

TwoCars Percentage of households with two or more cars

JTWPublic Percentage of persons over age 5 who travel to school, college or work by means of bus or rail

HomeWork Percentage of persons self employed (Own account workers)

LLTI Percentage of persons reporting bad or very bad health

UnpaidCare Percentage of persons providing unpaid care

Employment Students Percentage of persons who are students

Unemployed Percentage of persons who are unemployed having lost or given up jobs

EconinactFam Percentage of persons looking after home/family - homemakers

Agric Percentage of workers who work in agriculture, forestry or fishing

Construction Percentage of workers who work in construction

Manufacturing Percentage of workers who work in manufacturing

Commerce Percentage of workers who work in commerce and trade

Transport Percentage of workers who work in transport and communication

Public Percentage of workers who work in public administration

Professional Percentage of workers who work in professional services

Connectivity Broadband Percentage of internet connected households with broadband

Internet Percentage of total households with some kind of internet access

Table 1: Derived variables used in the classification

19

It can be seen in Table 1 that there will be issues in the data using the derived

variables as they are. For example Separated can be expected to be correlated with

LoneParent. As mentioned previously, Principal Components Analysis (PCA) is a set of

methods that can take in variables that may be correlated and produce a set of

uncorrelated principal components. PCA is also used to reduce the size of a clustering

computational problem (Jackson, 1991).

Principal Components Analysis

Once the 40 variables were chosen and derived from the census variables they

were subjected to Principal Components Analysis. The reasons were twofold; firstly to

reduce the dimensionality of the data from 40 to a more manageable number. Secondly,

PCA was used to remove any correlation in the data. The cluster algorithm used for the

classification was k means, this assumes no correlation. Therefore PCA is essential to

provide k means with a set of uncorrelated variables to carry out the clustering. The

components are linear combinations of the original variables. Each one contains a

proportion of the variance in the original data and they are ordered by the amount of

variance they explain. Therefore it is possible to view the cumulative variance explained

by the components and decide how many to use in the k means clustering.

As mentioned previously, there is a lack of theory in this regard, however the

majority of the variance should be kept otherwise the analysis looses too much

information to make the PCA worth doing. Jolliffe (2002) describes the choice of a cut off

of variance as an ad hoc rule-of-thumb that works in practice. Jolliffe suggests a range

from 70% to 90% to retain m components where m is the smallest integer for which the

cumulative variance explained is greater than the cut off.

20

There is a function in R that calculates the Principal Components for a user, called

princomp(). This function takes in the relevant variables and performs a PCA. It is

then possible to view the cumulative variance explained by each component to choose

how many to use in the clustering process. For detailed explanations of Principal

Components, works by Jackson (1991), Jolliffe (2002), or Rencher and Christensen

(2012) are recommended. However the mains steps involved are:

1. Get data In this case 40 derived variables * 563 Garda Sub-districts

2. Subtract mean of each variable from each instance of the variable

3. Calculate correlation matrix

4. Calculate eigenvalues and eigenvectors of the correlation matrix

The Principal Components were calculated and their cumulative explanation of the

variance of the original derived variables displayed by entering the following two lines of

code in to R.

pca< - princomp(gardaVars[, - 1],cor=T,scores=T)

cumsum(pca$sdev^2/sum(pcs$sdev^ 2))

The cumulative variance explained by the components is shown in Figure 2.

Figure 2: Cumulative variance explained by components 1:40

21

As per general recommendations mentioned earlier, this study will use the first m

components that total to at least 80% of the variance. This means that the first nine

components will be used in the study, as they account for 80.65% of the variance. This

was seen as a good cut off as the other 31 components only accounted for 19.35% of the

variance in the original data set between them. Another reason for not including the tenth

component is that it is the first component that fails a test that suggests each component

should contribute more than of the cumulative variance (Jolliffe, 2002). As p is 40 and

, the fact that component ten only explains an extra 1.64% of the variance

excludes its use in the classification process.

Clustering

The work up to this point has concentrated on getting the data ready for the

clustering process. The nine principal components represent a much smaller data set than

the 40 derived variables, they are also not correlated. This means that they are ready for

use in a clustering algorithm. The method of clustering used in this thesis is the k means

technique as described in the literature review. The k means method was chosen as it is

the system of choice in most geodemographic classifications (Abbas, et al., 2009; Alexiou

& Singleton, 2015; Brunsdon, et al., 2014; Charlton, et al., 1985; Vickers & Rees, 2007).

K means requires the number of clusters to be known before it sets to minimise the

within cluster sum of squares. However this is not an issue in R. It is possible to run k

means through a loop in R, in this way it is possible to run the clustering exercise many

times with different possible numbers of clusters and for the user to pick the best number

for k based on the results. In order to pick the best number for k, the k means process was

run 100 times with k starting at one and being increased by one each time. The results

22

were then plotted on a scree plot shown in Figure 3. The code to run through this loop is

shown below.

nPC 14) would not add to

the classification. In addition splitting the clusters in to smaller units may have created

clusters that were too nuanced for the purpose of comparing Garda Sub-districts. The final

step in the clustering process was to join the cluster numbers to the GEOGID numbers of

individual Garda stations so that the results could be mapped.

It is possible to cluster a set of clusters in order to create higher order super groups

for description purposes. However with only 14 clusters and 563 areas it was felt that a

second level of clustering was unnecessary for this classification.

23

Figure 3: Scree plot of (WCSS) for k = [1:100]

Figure 4: Heatmap of clusters where k=14

K=14

24

Cluster Description and Naming

With the clusters set, they needed to be named and described. For some (Abbas, et

al., 2009; Parker, et al., 2007) the descriptions attached to clustering exercises such as this

one are the most interesting element of the final geodemographic classification process.

Vickers and Rees (2011) state that cluster names may be the primary source of

information used when judging a cluster in a classification by the end user.

Naming and describing the clusters was a multi stage process. All of the clusters

were mapped in order to check that the classification seemed spatially sensible. Then the z

scores of the original derived variables were calculated for each cluster, the mean z score

across all clusters was also calculated for each variable (), as was the standard deviation

- > z were deemed to have an extremely low z

score for that variable. Clusters where z z

score for that variable. These extreme highs or lows accounted for 25% of the variables.

The extreme values informed the name attached to the clusters as they were

deemed to be dominant and identifiable characteristics of the cluster in question (Abbas,

et al., 2009; Debenham, 2002; Vickers & Rees, 2007) . A table of the extremes identified

is reproduced in Table 2. In addition to the extreme z scores, the scores that were above or

below average for that variable informed more detail in the descriptions where necessary.

For the assistance of the end user a radar plot for each cluster is included in the

cluster descriptions. This plot shows the global average z score for each variable in red

and the z score for the cluster in blue. It also shows the extreme value cut offs in green

and purple.

25

Table 2: Extreme z scores for variables within clusters

garyfrussell

Documents