garyfrussell

Download GaryFRussell

If you can't read please download the document

Upload: gary-russell

Post on 15-Apr-2017

23 views

Category:

Documents


0 download

TRANSCRIPT

  • A Geodemographic Classification of

    Ireland at Garda Sub-district Level

    A Tool for Comparing Sub -districts of An Garda Sochna

    Gary Russell

    M.Sc. Geocomputation

    National Centre for Geocomputation,

    Maynooth University

    2016

    Professor Christopher Brunsdon Head of Department

    Martin Charlton Programme Coordinator and Thesis Supervisor

  • i

    Abstract

    This thesis uses demographic data from the 2011 census to classify the catchment

    areas of Garda stations in the Republic of Ireland (referred to as Garda Sub-districts). This

    was accomplished by subjecting the data at a Garda Sub-district level to principal

    component analysis and then using clustering techniques on the resulting principal

    components. Similar (geodemographically speaking) Garda station areas were then

    visualised using mapping techniques based on Central Statistics Office boundary shape

    files. The clusters of Garda Sub-districts were named and described using the distribution

    of the characteristic variables compared to the global mean of each variable. As an

    example of the use for such data manipulation a mini Atlas of Garda Sub-districts was

    created by comparing the crime figures of Garda Sub-districts within the resulting clusters

    and visualising the results.

    Acknowledgements

    I would like to acknowledge and thank Professor Chris Brunsdon, Martin Charlton,

    and Dr Ronan Foley. As well as all the staff and researchers in Maynooth University

    Departments of Geography and Computer science, the National Centre for

    Geocomputation and the All Ireland Research Observatory for the inspiration and

    assistance in completing this thesis. I would also like to thank my three classmates for

    their support and for putting up with my using them as sounding boards over the past year.

    Lastly I would like to thank my wife, Siobhan and our two young children, Aoibhnn and

    Joshua, for their unwavering support and belief that I could complete my B.A. and M.Sc.

    despite the odd grumpy daddy moment.

  • ii

    Table of Contents

    Abstract ................................................................................................................................... i

    Acknowledgements ................................................................................................................ i

    Table of Contents .................................................................................................................. ii

    List of Figures ....................................................................................................................... iv

    List of Maps ........................................................................................................................... v

    List of Tables ........................................................................................................................ vi

    Introduction ......................................................................................................................... 1

    Literature Review ................................................................................................................ 4

    General ............................................................................................................................... 4

    Theory ................................................................................................................................ 6

    Modifiable Areal Unit Problem ......................................................................................... 6

    Variables ............................................................................................................................ 8

    Selection ......................................................................................................................... 8

    Correlation ..................................................................................................................... 8

    Units ............................................................................................................................... 9

    Clustering ......................................................................................................................... 10

    Methods ........................................................................................................................ 10

    Cluster Description ...................................................................................................... 11

    Data ..................................................................................................................................... 12

    Building the Classification ................................................................................................ 16

    Variables .......................................................................................................................... 16

    Principal Components Analysis ....................................................................................... 19

    Clustering ......................................................................................................................... 21

    Cluster Description and Naming ...................................................................................... 24

    Results ................................................................................................................................. 26

    Cluster One: Comfortable home owning young families in semi rural areas. ................. 28

    Cluster Two: Young, mobile, affluent, multicultural singles .......................................... 30

    Cluster Three: Struggling labouring communities ........................................................... 32

    Cluster Four: Young labouring families in outer commuter areas .................................. 34

    Cluster Five: Settled older rural communities ................................................................. 36

    Cluster Six: Urban city peripheral communities ............................................................. 38

  • iii

    Cluster Seven: Struggling rural aging communities ........................................................ 40

    Cluster Eight: Rural farming communities ...................................................................... 42

    Cluster Nine: Small rural Townlands .............................................................................. 44

    Cluster Ten: Labouring rural communities in older housing stock ................................. 46

    Cluster Eleven: Young educated commuter families....................................................... 48

    Cluster Twelve: Comfortable rural farming communities ............................................... 50

    Cluster Thirteen: Affluent professional commuters in larger homes............................... 52

    Cluster Fourteen: Semi rural periphery manufacturing communities ............................. 54

    The Urban Rural Divide .................................................................................................. 56

    Crime Atlas ........................................................................................................................ 58

    Set Up .............................................................................................................................. 58

    Theft ................................................................................................................................. 60

    Assault ............................................................................................................................. 62

    Burglary ........................................................................................................................... 64

    Damage to Property or the Environment ......................................................................... 66

    Dangerous Acts ................................................................................................................ 68

    Drugs ................................................................................................................................ 70

    Fraud ................................................................................................................................ 72

    Kidnapping ....................................................................................................................... 74

    Public Order ..................................................................................................................... 76

    Robbery ............................................................................................................................ 78

    Weapons ........................................................................................................................... 80

    Offences against the State, Justice or Organised Crime .................................................. 82

    Crime Atlas Comments .................................................................................................... 84

    Issues with Data and Analysis .......................................................................................... 85

    Conclusion .......................................................................................................................... 87

    Bibliography ....................................................................................................................... 89

    Appendix 1: R Code Used Throughout the Thesis ............................................................... 92

    Appendix 2: Table of Census Themes .................................................................................. 99

    Appendix 3: Garda Sub-district Look-Up Tables .............................................................. 105

  • iv

    List of Figures

    Figure 1: .................................................. 5

    Figure 2: Cumulative variance explained by components 1:40 ....................................................... 20

    Figure 3: Scree plot of (WCSS) for k = [1:100] ............................................................................... 23

    Figure 4: Heatmap of clusters where k=14 ....................................................................................... 23

    Figure 5: Rader plot of Cluster One ................................................................................................. 29

    Figure 6: Rader plot of Cluster Two ................................................................................................ 31

    Figure 7: Rader plot of Cluster Three .............................................................................................. 33

    Figure 8: Rader plot of Cluster Four ................................................................................................ 35

    Figure 9: Rader plot of Cluster Five ................................................................................................. 37

    Figure 10: Rader plot of Cluster Six ................................................................................................ 39

    Figure 11: Rader plot of Cluster Seven ............................................................................................ 41

    Figure 12: Radar plot of Cluster Eight ............................................................................................. 43

    Figure 13: Radar plot of Cluster Nine .............................................................................................. 45

    Figure 14: Radar plot of Cluster Ten ............................................................................................... 47

    Figure 15: Radar plot of Cluster Eleven ........................................................................................... 49

    Figure 16: Radar plot of Cluster Twelve .......................................................................................... 51

    Figure 17: Radar plot of Cluster Thirteen ........................................................................................ 53

    Figure 18: Radar plot of Cluster Fourteen ....................................................................................... 55

    Figure 19: Box plots by Cluster of theft related crime ..................................................................... 61

    Figure 20: Box plots by Cluster of assault related crime ................................................................. 63

    Figure 21: Box plots by Cluster of burglary related crime ............................................................... 65

    Figure 22: Box plots by Cluster of damage related crime ................................................................ 67

    Figure 23: Box plots by Cluster of crimes relating to dangerous acts .............................................. 69

    Figure 24: Box plots by Cluster of drug related crime ..................................................................... 71

    Figure 25: Box plots by Cluster of fraud related crime .................................................................... 73

    Figure 26: Box plots by Cluster of kidnapping related crime .......................................................... 75

    Figure 27: Box plots by Cluster of public order and social code crime ........................................... 77

    Figure 28: Box plots by Cluster of robbery related crime ................................................................ 79

    Figure 29: Box plots by Cluster of weapons related crime .............................................................. 81

    Figure 30: Box plots by Cluster of offences against the State, justice or organised crime .............. 83

    Figure 31: Variance in crime rates explained by the clustering classification ................................. 84

  • v

    List of Maps

    Map 1: Garda Sub-districts by division ............................................................................................ 14

    Map 2: Garda Sub-districts by cluster .............................................................................................. 26

    Map 3: Cluster One .......................................................................................................................... 28

    Map 4: Cluster Two ......................................................................................................................... 30

    Map 5: Cluster Three ....................................................................................................................... 32

    Map 6: Cluster Four ......................................................................................................................... 34

    Map 7: Cluster Five .......................................................................................................................... 36

    Map 8: Cluster Six ........................................................................................................................... 38

    Map 9: Cluster Seven ....................................................................................................................... 40

    Map 10: Cluster Eight ...................................................................................................................... 42

    Map 11: Cluster Nine ....................................................................................................................... 44

    Map 12: Cluster Ten ......................................................................................................................... 46

    Map 13: Cluster Eleven .................................................................................................................... 48

    Map 14: Cluster Twelve ................................................................................................................... 50

    Map 15: Cluster Thirteen ................................................................................................................. 52

    Map 16: Cluster Fourteen ................................................................................................................. 54

    Map 17: The Urban Rural Divide .................................................................................................... 56

    Map 18: Theft ................................................................................................................................... 60

    Map 19: Assault ............................................................................................................................... 62

    Map 20: Burglary ............................................................................................................................. 64

    Map 21: Damage to property or the environment ............................................................................ 66

    Map 22: Dangerous Acts .................................................................................................................. 68

    Map 23: Drugs .................................................................................................................................. 70

    Map 24: Fraud .................................................................................................................................. 72

    Map 25: Kidnapping ........................................................................................................................ 74

    Map 26: Public Order ....................................................................................................................... 76

    Map 27: Robbery .............................................................................................................................. 78

    Map 28: Weapons ............................................................................................................................ 80

    Map 29: State, Justice or Organised Crime ...................................................................................... 82

  • vi

    List of Tables

    Table 1: Derived variables used in the classification ....................................................................... 18

    Table 2: Extreme z scores for variables within clusters ................................................................... 25

    Table 3: Sub-districts in Cluster One ............................................................................................... 29

    Table 4: Sub-districts in Cluster Two .............................................................................................. 31

    Table 5: Sub-districts in Cluster Three ............................................................................................ 33

    Table 6: Sub-districts in Cluster Four .............................................................................................. 35

    Table 7: Sub-districts in Cluster Five ............................................................................................... 37

    Table 8: Sub-districts in Cluster Six ................................................................................................ 39

    Table 9: Sub-districts in Cluster Seven ............................................................................................ 41

    Table 10: Sub-districts in Cluster Eight ........................................................................................... 43

    Table 11: Sub-districts in Cluster Nine ............................................................................................ 45

    Table 12: Sub-districts in Cluster Ten .............................................................................................. 47

    Table 13: Sub-districts in Cluster Eleven ......................................................................................... 49

    Table 14: Sub-districts in Cluster Twelve ........................................................................................ 51

    Table 15: Sub-districts in Cluster Thirteen ...................................................................................... 53

    Table 16: Sub-districts in Cluster Fourteen ...................................................................................... 55

  • 1

    Introduction

    This thesis has two main parts; part one will focus on the statistical methods used

    in geodemographics to classify Garda Sub-districts. The thesis will use principal

    components analysis and clustering techniques to create the classification. Part two will

    then consist of a crime atlas of Ireland at Garda Sub-district level; this atlas will also

    contain information regarding the performance of the clustering exercise when used with

    real crime data.

    The census of Ireland provides a vast amount of data at many geographic levels

    that are disseminated by the Central Statistics Office (CSO). For example the smallest

    geography that the census is reported at is the Small Area (Central Statistics Office,

    2014). There are 18,488 Small Areas in the Republic of Ireland. The census reports 764

    variables for each of the 18,488 Small Areas, this gives fourteen million, one hundred and

    twenty four thousand, eight hundred and thirty two (14,124,832) individual data points.

    This thesis concentrates on the 563 catchment areas of Garda stations as defined

    by An Garda Sochna and reported in the census as Garda Sub Divisions or Garda Sub-

    districts (Central Statistics Office, 2014). Even though this gives 563 areas to work with,

    that still equates to 430,132 data points, not including crime statistics for the 13 years

    available. Making sense of that amount of data requires them to be summarised so that the

    end user is not overwhelmed.

    Traditionally crime statistics are reported by county or region of the country as

    can be seen in newspapers and websites when reporting on crime; examples include The

    Independent (MacCarthaigh & Phelan, 2014) and the Irish Mirror (Jordan, 2015). While it

    makes intuitive sense to amalgamate Garda stations by county for the purposes of crime

  • 2

    statistics reporting, nuances and spatial variation within counties are lost. Another

    approach could be to report crimes by station; however this may arguably be problematic.

    While reporting by station would show differences within counties and nationally,

    without knowing something of the characteristics of the individual stations, comparing

    them becomes moot. The first law

    (Tobler, 1970) is generally

    accepted to hold. However, just because two Garda station areas are next to each other

    little point comparing Ballina in County Mayo with Killala for example. The catchments

    are neighbours, but Ballina is a generally urban environment with 14,329 people at the

    last census in 2011, whereas Killala is a rural, coastal area that is physically larger than

    the Ballina catchment but with only 3,766 people (Central Statistics Office, 2014).

    This thesis advocates comparing stations not only on their location in An Garda

    Sochna hierarchy, but rather based on their similarities, specifically the underlying

    geodemographics. Perhaps Killala has more in common with Kilrush in Clare or

    Duncannon in Wexford and comparing it to its peers is a more sensible approach than

    comparing it to its neighbours. That being said, it is expected that the classification

    carried out in this work will show clusters of clusters throughout Ireland in support of

    There is much literature on the area of geodemographic classification, particularly

    for marketing in the United Kingdom and America. Brunsdon et al. (2014) have created a

    geodemographic classification of Ireland at the Small Area scale. Gale et al. (2015) have

    used geodemographics in London relating to crime; however there does not appear to be a

    classification at the national level for Ireland relating to Garda station catchments. It is

  • 3

    felt that this thesis may address a gap in the literature and provide a new tool for

    comparing policing in Ireland.

    The first aim is to create the classification. The steps will be described in detail

    throughout this thesis; however, in short, census data will be used to create the

    classification. Relevant variables will be chosen and transformed for use. Methods of

    reducing the data to manageable proportions will be used and then the data will be

    subjected to a clustering algorithm. The resulting clusters of similar Garda Sub-districts

    will then be named and described. The second part of the thesis will then use these

    clusters to map various crime statistics at a national and cluster by cluster basis to allow

    comparison based on similar Garda stations to be made.

    It is hoped that in creating this classification, policy makers may begin to ask

    questions such as: if this area A has similar characteristics to area B, why are the crime

    rates so different? Are more resources needed to reflect the similar catchment

    demographics? Are the opening hours of a station appropriate given the social

    demographic makeup of the population? It is further hoped that the answers to these

    questions may be found by those with access to more information, such as Garda numbers

    and skill breakdowns, policing and social services infrastructure in areas etc. Lastly it is

    anticipated that this thesis can be built upon by carrying out the same classification with

    up to date data when the census is completed and made available after census 2016.

  • 4

    Literat ure Review

    General

    Classification is a natural human process that helps us understand and make sense

    of the world around us. Parker et al. (2007) assert that the natural process of classification

    undertaken by lay people sociospatial construction

    of reality (1989) quoted in Boyne (2006) speaks of a dream of universal

    classification, or law, to describe the whole world that did not, and could not work.

    So it was imagined that the entire world could be distributed according to a unique

    code, that one universal law would reign over the totality of phenomena: two

    hemispheres, five continents, masculine and feminine, animal and vegetable,

    singular plural, right left, four seasons, five senses, six vowels, seven days, twelve

    months, twenty-six letters. (Perec, 1989:155)

    Vickers and Rees (2011) state that complex systems can be classified to help the

    understanding of those systems. However there is no one right system of classification,

    Dupr (2006) argues that classifications will be driven by the purpose for which they were

    created and that differe

    This is a view that is shared by Charlton et al. who produced one of the first open

    the

    lassification is an arbitrary thing (Charlton, et al., 1985).

    Area classification is the act of grouping areas based on selected features with

    those areas, the similarity of the characteristics of the selected features drives the

    classification (Vickers & Rees, 2007). Geodemographic classifications are a type of area

    one of the most commonly used

    areas classifications at can make geodemographic classifications

    useful is the descriptions that generally accompany them to give a textual summary of the

    attributes of each class (Abbas, et al., 2009). Geodemographics are widely used in

  • 5

    marketing w popular segmentation technique(Doyle, 2011).

    Many (Abbas, et al., 2009; Gale, et al., 2015; Singleton & Longley, 2008; Vickers &

    Rees, 2007) describe geodemographic classifications as tools to summarise large sets of

    spatially dependent data such as census data. Gale et al. (2015) state that geodemographic

    classifications allow the highlighting of similarities between population structures in

    different parts of a country. Gale et al. also point out that geodemographic classifications

    give summaries based not only on the population but also on the built environment.

    It would be remiss to discuss geodemographic classification without mentioning

    Charles Booth. Between 1886 and 1903 Booth and several assistants accompanied police

    officers on the beat around London to investigate places of work, working conditions,

    homes and the urban environments. Through interviews and observations Booth created a

    (London School of Economics and Political Science , 2012). (Booth, 1903)

    was one of the first attempts to map [and classify] social-spatial structures (Alexiou &

    Singleton, 2015). A portion of the digitised maps along with the classification Booth used

    is shown below in Figure 1.

    Figure 1:

    Source: London School of Economics and Political Science (2012)

  • 6

    Theory

    While geodemographics have a long history of being used in one form or another,

    it must be acknowledged that the theory driving geodemographics is less robust than the

    everything is related to everything else, but near things are more related

    than distant things(Tobler, 1970). Singleton and Longley (2009) note that the theoretical

    on (2015)

    who state that classifications based on geodemographics lack solid theory. Another issue

    with many geodemographic classifications is that they are not generally geographically

    weighted, and due to the methods of their construction are aspatial in design despite

    showing spatial correlations in the results (Alexiou & Singleton, 2015). While the issues

    with theoretical grounding are acknowledged, Singleton and Longley express their hopes

    for best practice geodemographics in that they are: focused, recognise the providence of

    the data used, are scientifically reproducible and use the best methods available

    (Singleton & Longley, 2009).

    Modifiable Areal Unit Probl em

    Geodemographic analysis is generally agreed to be best carried out at the smallest

    areal unit available in order not to lose spatial variation that larger units may obscure

    (Alexiou & Singleton (2015), Charlton, et al. (1985) and Gale, et al. (2015) are

    examples). However the scale also depends on the purpose of the classification (Alexiou

    & Singleton, 2015), therefore a balance must be struck. Another factor to consider is the

    Modifiable Areal Unit Problem (MAUP). Gehlke and Biehl (1934) noted that choices in

    data aggregation over space and the size of areal unit used in analysis have influence over

    the correlation coefficient. This was expanded upon by Openshaw and Taylor (1979) who

  • 7

    coined the phrase Modifiable Areal Unit Problem. They conducted experiments on a

    spatial data set and found that they could obtain correlations of between -.99 and .99 from

    different levels of aggregation. Charlton and Brunsdon (2016) presented a paper at the

    GIS Research UK Conference which revisited the work by Gehlke and Biehl using census

    data from Ireland at several different official aggregation levels ranging from Small Areas

    to Counties. They were able to show that the larger areas lost variance between areas and

    While the MAUP is a factor, there is not much that can be done about it if, for

    example, data is only available at one areal unit scale. If using official areal units one

    must also be aware of the MAUP when comparing results over time, as these boundaries

    may change. An example of this is the Irish Electoral Constituencies which were changed

    by an act of the Irish Parliament (Houses of the Oireachtas, 2013) as required by Article

    16.4 of the Irish Constitution every twelve years (Government Publications, 2016).

    Likewise and more relevant for this work are changes to Garda Sub-districts, the

    boundaries were changed in 2013 following the closure of some 100 Garda stations (An

    Garda Siochana, 2013; Central Statistics Office, 2014). This is an issue noted in the

    United Kingdom in relation to British police Basic Command Units (BCU), where Ashby

    and Longley (2005) Maintaining the BCU families is an arduous task due to the

    Therefore any follow up to this thesis

    should be aware of the possibility the MAUP affecting results should the boundaries be

    changed by An Garda Sochna.

  • 8

    Variables

    Selection

    Harris et al. (2005) explain that a geodemographic classification is created by

    grouping areas that are alike in to a number of classes, often based on census data. As

    geodemographic classifications are a way of summarising social, demographic and built

    characteristics of zoned geography (Gale, et al., 2015), it makes sense to use census data.

    Vickers and Rees (2007) also argue that a national census (in their case British, but the

    principal holds) stands above other sources due to its amount of data and

    . The choices of variables to use in the classification drive the

    .

    However, they do state that the choices are very difficult to make. Vickers and Rees

    (2007) suggest that variables should be chosen only if there is a good reason; this implies

    that including a variable just because one has the data is not the best policy.

    Correlation

    Variable correlation can be an issue when using census data to inform analysis.

    Collinearity in the data can affect the performance of any significance tests that may be

    carried out on, for example, linear regressions (Anderson, et al., 2010). In classification

    exercises correlation between variables creates redundancy in the input data (Alexiou &

    Singleton, 2015). Two main methods exist in dealing with correlation within the variables

    and it is acknowledged that there is no general rule (Vickers & Rees, 2007). One

    approach adopted is to remove one of the pairs of highly correlated variables from use

    (Alexiou & Singleton, 2015; Vickers & Rees, 2007). Another approach is to use Principal

    Components Analysis (PCA) to transform a set of N correlated variables into a set of n

    uncorrelated principal components. This approach was used by Charlton et al. (1985) and

  • 9

    Brunsdon et al. (2014). With principal components, each component is a linear

    combination of the parent variables so the variance is retained but the components are

    uncorrelated (Alexiou & Singleton, 2015). Additionally, the first component accounts for

    the most variance and each component adds less to the overall variance explained

    (Jolliffe, 2002). The user can therefore decide how much variance they are willing to

    sacrifice in order to reduce dimensionality in the data by using fewer principal

    components than the number of input variables (Charlton, et al., 1985; Harris, et al., 2005;

    Jolliffe, 2002).

    Units

    Variables used in geodemographics can be reported at different units such as

    percentages of population, count data, indices etc. (Alexiou & Singleton, 2015). This can

    make comparison of variables difficult. The census of Ireland (Central Statistics Office,

    2014) reports most available data as a count of people; therefore it is relatively simple to

    convert any required variable to percentage of population within the spatial unit. This not

    only makes the variables easier to compare, it also stops areas with high population

    figures affecting the analysis due to higher absolute numbers. The variables may also be

    standardised using z scores to allow for true comparison of the individual variables

    influence on a cluster. The z score is a measure of the relative location in a data set of the

    observation, therefore data points in two different data sets with the same z score have the

    same relative location, i.e. they are the same number of standard deviations from the

    mean (Anderson, et al., 2010).

  • 10

    Clustering

    Methods

    Clustering involves finding subsets of interest within a larger set, the subsets are

    called clusters and are usually homogeneous within each cluster and separated between

    clusters (Hansen & Jaumard, 1997). Gordon (1987) notes that it is

    , Vickers and Rees (2007) maintain that

    there is no right or wrong way to classify. Commercial classifications tend to build from

    the ground up, clustering at the smallest available level then aggregating in to larger

    groups (Singleton & Longley, 2008). The open Output Area Classification in the UK,

    however, was clustered from the top down by creating several large clusters that were

    then subjected to clustering techniques separately (Vickers & Rees, 2007). It is widely

    acknowledged among the available literature that k-means clustering is the technique of

    choice for geodemographic clustering. This is shown in either the acknowledgement of k-

    means in theoretical papers or the used of k-means in applied papers (Abbas, et al., 2009;

    Alexiou & Singleton, 2015; Brunsdon, et al., 2014; Charlton, et al., 1985; Vickers &

    Rees, 2007).

    K-means clustering is seen to have something of an advantage over other methods

    such as agglomerative, divisive, constructive or direct optimisation (described well in

    Gordon (1987)). This is because they are all hierarchical in nature and will force a

    hierarchy on the output even if one does not exist (Gordon, 1987). K-means will not force

    number of clusters beforehand (Singleton & Longley, 2008). Clustering techniques

    generally require a measure of dissimilarity between observations (Jolliffe, 2002). K-

    means uses the squared Euclidean distance (Alexiou & Singleton, 2015). In essence k-

    means uses k clusters to sort n observations while minimising the sum of squared errors

  • 11

    (Alexiou & Singleton, 2015; Ding & He, 2004). K-means assigns each observation to a

    cluster while minimising sum of squares, a new set of means is then calculated and the

    process begins again. The process only stops when the within cluster sum of squares

    (WCSS) is minimised. This occurs when cluster assignments no longer change as any

    changes would not make the sum of squares smaller (Alexiou & Singleton, 2015).

    Cluster Description

    Once the WCSS is minimised and the clusters are assigned, the results need to be

    described. The aim of cluster descriptions is to provide a short profile of each cluster for

    the end user. Vickers and Rees (2007) explain that profiles use text and visuals to help the

    sentences. The cluster labelling and

    description process is acknowledged by Vickers and Rees (2007) to be difficult and

    subject to much thought, in order not to mislead the user or offend the people living in the

    areas classified.

    Cluster descriptions draw on the main identifiable (Debenham, 2002), dominant

    (Abbas, et al., 2009) characteristics of a cluster. Often, the process involves using z scores

    to identify extreme variables within the cluster compared to the global mean (Debenham,

    2002; Vickers & Rees, 2007). The descriptions that are attached to geodemographic

    classifications are viewed as useful to other researchers (Abbas, et al., 2009). Parker et al.

    (2007) most

    sociologically interes element of the geodemographic classification process.

    Therefore the naming and description element of geodemographic classification should

    not be overlooked or given less attention than the more statistical elements of the process.

  • 12

    Data

    Three data sets are used throughout this work to create the classification:

    1. Census 2011, all 764 reported census variables in columns, at Garda Sub-district

    level in 563 rows (Central Statistics Office, 2014).

    2. Garda Sub-district boundary files for use in mapping the outcomes (Central

    Statistics Office, 2014a).

    3. Crime data for Ireland at Garda Sub-district level (Central Statistics Office, 2016).

    The crime data is agglomerated at the Garda Sub-district level to 12 crime types.

    Attempts/threats to murder, assaults, harassments and related offences.

    Dangerous or negligent acts.

    Kidnapping and related offences.

    Robbery, extortion and hijacking offences.

    Burglary and related offences.

    Theft and related offences.

    Fraud, deception and related offences.

    Controlled drug offences.

    Weapons related offences.

    Damage to property and to the environment.

    Public order and other social code offences.

    Offences against government, justice procedures and organisation of

    crime.

  • 13

    The data for this study fall in to two main areas: socio-demographic data from the

    Census of Ireland (Central Statistics Office, 2014), and crime data (Central Statistics

    Office, 2016). Both sets of data are reported at the Garda Sub-district level.

    There are 563 Garda Sub-districts in Ireland (Central Statistics Office, 2016). The

    Sub-districts

    Louisburgh in Mayo. They have populations ranging from 384 in Sraith Salach in Galway

    to 98,078 in Blanchardstown in Dublin (Central Statistics Office, 2014).

    These Sub-districts are based loosely on the official geography of Irish

    Townlands, but were designed by the Examiner of Maps (GIS) at An Garda Sochna to

    suit the needs (Creaner, 2016). The Sub-districts are a unique data set in

    that they are designed for operational rather than statistical reasons. It is acknowledged by

    Creaner that using Small Areas would be better statistically. However the use of Small

    a catchment area in the middle of a motorway (for example), as may happen with Small

    Areas (Creaner, 2016). The Garda Sub-districts are shown, grouped in to administrative

    divisions in Map 11.

    It is not known exactly how the 2011 census data were attached to the 2013 Garda

    Sub-district geography. The assumption in this thesis is that the CSO would be able to

    populate the new boundaries at the household level. At the time of writing the CSO have

    not replied to my queries, however the designer of the Sub-districts (Creaner, 2016) has

    agreed that this assumption is a reasonable one. Another option would be to populate the

    Sub-districts by centroid based on a smaller unit such as Small Areas, if this is the case it

    is not felt that there would be too much loss of overall validity due to the number of Small

    1 All maps produced in this thesis use boundary files from the CSO website and contain Ordnance Survey Ireland Data (Ordnance Survey Ireland, 2012).

  • 14

    Areas (18,488) being assigned to one of 563 Sub-districts. Aside from this ambiguity,

    there is no known issue with Irish Census Data.

    Map 1: Garda Sub-districts by division

    The census data used in this thesis is made up of 764 variables that are derived

    from household level data and amalgamated to the various levels reported from 18,488

  • 15

    Small Areas to four Provinces (Central Statistics Office, 2014). The MAUP discussed in

    the literature review section is relevant, as carrying out the classification at different scale

    will produce different results. It may be possible to carry out the classification at Small

    Area level and then combine these results in to larger unit scales. However this is not

    appropriate for this study. Firstly, there are no smaller units that fit in to Sub-districts due

    to the proprietary nature of the Sub-districts. Secondly all data required are available at

    the Sub-district level. Lastly, Sub-districts are the smallest geographical unit that crime

    data are released in Ireland (Central Statistics Office, 2016). Therefore it is acknowledged

    that variation within the Garda Sub-districts is lost during this study. However because

    this classification is for the purpose of being able to compare Garda Stations on a like for

    like basis, it is felt that the loss is acceptable at a national level.

  • 16

    Building the Classification

    The classification was built in R, a free, open source statistical computing

    environment that can handle large amounts of data (The R Foundation, 2015). Some code

    blocks will be included where needed for clarification. However in the interest of

    reproducibility the full R code that created the classification is included in Appendix 1.

    Variables

    This classification aims to give reproducible results that are comparable to similar

    studies at different scales. Charlton et al. (1985) chose their variables to give a

    comparable classification between their open one and the commercial ACORN

    classification in the UK. Brunsdon et al. (2014) chose Irish Census variables at the Small

    Area Scale to reflect the OAC classification variables chosen by Vickers and Rees (2007).

    Therefore this study will use the same variables as Brunsdon et al. The full list of variable

    codes reported by the Census is available for download from the CSO website (Central

    Statistics Office, 2014). An adapted list is included in Appendix 2 for reference should the

    reader require clarification on any variable codes used.

    The variables chosen for the classification exercise are actually derived variables.

    Each variable is made up of two or more individual census variables to derive variables

    that are percentages of the population of the area in question. For example one variable

    used is that of lone parents. This variable is derived by adding the Lone Mothers with

    Children (number of families) to Lone Fathers with Children (number of families),

    dividing by the total number of families and multiplying the result by 100. The actual

    code is shown below.

    loneParent < - 100*(T4_3FTLF + T4_3FTLM) / T4_5TF

  • 17

    In all there are 40 derived variables used in the classification grouped in to six

    areas: demographic, household composition, housing, socioeconomic, employment and

    connectivity. The derived variables are shown in Table 1; the actual make up of each

    derived variable can be seen in the R code in Appendix 1. As stated, the variables from

    the CSO were reported at Garda Sub-district level. This aided in mapping as both the

    census file and the boundary file contained unique geography ID numbers (GEOGID) for

    each Sub-district. The numbers were slightly different in that one set was prefixed with an

    find and

    replace command in Excel before the census file was loaded in R, however it could have

    just as easily been carried out in R.

  • 18

    Theme Derived Variable Description

    Demographics Age0-4 Percentage of population aged 0-4

    Age5-14 Percentage of population aged 5-14

    Age25-44 Percentage of population aged 25-44

    Age 45-64 Percentage of population aged 45-64

    Age65+ Percentage of population aged 65 and over

    EUNat Percentage of population that is European by nationality (excluding Irish)

    RestofWorld Percentage of population where nationality was given as Rest of the World

    BornOutsideIRE Percentage of population not born in Ireland

    Housing Composition Separated Percentage of persons separated or divorced

    SinglePerson Percentage of persons (non pensioners) living in one person households

    Pensioner Percentage of persons who are pensioners

    LoneParent Percentage of families that are lone parent families

    NoChildren Percentage of families that are 'pre family' (no children born)

    NonDependChildrenPercentage of families with children where the youngest child is 20+

    Housing RentPublic Percentage of total households rented from local authority

    RentPrivate Percentage of total households privately rented

    Flats Percentage of total households defined as flats

    NoCentralHeat Percentage of total household with no central heating

    RoomsHH Average number of rooms per household

    PeoplePerRoom Total persons total rooms

    SepticTank Percentage of total households with an individual septic tank

    Socioeconomic HEQual Percentage of persons with an Ordinary Bachelors Degree or higher

    Employed Percentage of persons at work

    TwoCars Percentage of households with two or more cars

    JTWPublic Percentage of persons over age 5 who travel to school, college or work by means of bus or rail

    HomeWork Percentage of persons self employed (Own account workers)

    LLTI Percentage of persons reporting bad or very bad health

    UnpaidCare Percentage of persons providing unpaid care

    Employment Students Percentage of persons who are students

    Unemployed Percentage of persons who are unemployed having lost or given up jobs

    EconinactFam Percentage of persons looking after home/family - homemakers

    Agric Percentage of workers who work in agriculture, forestry or fishing

    Construction Percentage of workers who work in construction

    Manufacturing Percentage of workers who work in manufacturing

    Commerce Percentage of workers who work in commerce and trade

    Transport Percentage of workers who work in transport and communication

    Public Percentage of workers who work in public administration

    Professional Percentage of workers who work in professional services

    Connectivity Broadband Percentage of internet connected households with broadband

    Internet Percentage of total households with some kind of internet access

    Table 1: Derived variables used in the classification

  • 19

    It can be seen in Table 1 that there will be issues in the data using the derived

    variables as they are. For example Separated can be expected to be correlated with

    LoneParent. As mentioned previously, Principal Components Analysis (PCA) is a set of

    methods that can take in variables that may be correlated and produce a set of

    uncorrelated principal components. PCA is also used to reduce the size of a clustering

    computational problem (Jackson, 1991).

    Principal Components Analysis

    Once the 40 variables were chosen and derived from the census variables they

    were subjected to Principal Components Analysis. The reasons were twofold; firstly to

    reduce the dimensionality of the data from 40 to a more manageable number. Secondly,

    PCA was used to remove any correlation in the data. The cluster algorithm used for the

    classification was k means, this assumes no correlation. Therefore PCA is essential to

    provide k means with a set of uncorrelated variables to carry out the clustering. The

    components are linear combinations of the original variables. Each one contains a

    proportion of the variance in the original data and they are ordered by the amount of

    variance they explain. Therefore it is possible to view the cumulative variance explained

    by the components and decide how many to use in the k means clustering.

    As mentioned previously, there is a lack of theory in this regard, however the

    majority of the variance should be kept otherwise the analysis looses too much

    information to make the PCA worth doing. Jolliffe (2002) describes the choice of a cut off

    of variance as an ad hoc rule-of-thumb that works in practice. Jolliffe suggests a range

    from 70% to 90% to retain m components where m is the smallest integer for which the

    cumulative variance explained is greater than the cut off.

  • 20

    There is a function in R that calculates the Principal Components for a user, called

    princomp(). This function takes in the relevant variables and performs a PCA. It is

    then possible to view the cumulative variance explained by each component to choose

    how many to use in the clustering process. For detailed explanations of Principal

    Components, works by Jackson (1991), Jolliffe (2002), or Rencher and Christensen

    (2012) are recommended. However the mains steps involved are:

    1. Get data In this case 40 derived variables * 563 Garda Sub-districts

    2. Subtract mean of each variable from each instance of the variable

    3. Calculate correlation matrix

    4. Calculate eigenvalues and eigenvectors of the correlation matrix

    The Principal Components were calculated and their cumulative explanation of the

    variance of the original derived variables displayed by entering the following two lines of

    code in to R.

    pca< - princomp(gardaVars[, - 1],cor=T,scores=T)

    cumsum(pca$sdev^2/sum(pcs$sdev^ 2))

    The cumulative variance explained by the components is shown in Figure 2.

    Figure 2: Cumulative variance explained by components 1:40

  • 21

    As per general recommendations mentioned earlier, this study will use the first m

    components that total to at least 80% of the variance. This means that the first nine

    components will be used in the study, as they account for 80.65% of the variance. This

    was seen as a good cut off as the other 31 components only accounted for 19.35% of the

    variance in the original data set between them. Another reason for not including the tenth

    component is that it is the first component that fails a test that suggests each component

    should contribute more than of the cumulative variance (Jolliffe, 2002). As p is 40 and

    , the fact that component ten only explains an extra 1.64% of the variance

    excludes its use in the classification process.

    Clustering

    The work up to this point has concentrated on getting the data ready for the

    clustering process. The nine principal components represent a much smaller data set than

    the 40 derived variables, they are also not correlated. This means that they are ready for

    use in a clustering algorithm. The method of clustering used in this thesis is the k means

    technique as described in the literature review. The k means method was chosen as it is

    the system of choice in most geodemographic classifications (Abbas, et al., 2009; Alexiou

    & Singleton, 2015; Brunsdon, et al., 2014; Charlton, et al., 1985; Vickers & Rees, 2007).

    K means requires the number of clusters to be known before it sets to minimise the

    within cluster sum of squares. However this is not an issue in R. It is possible to run k

    means through a loop in R, in this way it is possible to run the clustering exercise many

    times with different possible numbers of clusters and for the user to pick the best number

    for k based on the results. In order to pick the best number for k, the k means process was

    run 100 times with k starting at one and being increased by one each time. The results

  • 22

    were then plotted on a scree plot shown in Figure 3. The code to run through this loop is

    shown below.

    nPC 14) would not add to

    the classification. In addition splitting the clusters in to smaller units may have created

    clusters that were too nuanced for the purpose of comparing Garda Sub-districts. The final

    step in the clustering process was to join the cluster numbers to the GEOGID numbers of

    individual Garda stations so that the results could be mapped.

    It is possible to cluster a set of clusters in order to create higher order super groups

    for description purposes. However with only 14 clusters and 563 areas it was felt that a

    second level of clustering was unnecessary for this classification.

  • 23

    Figure 3: Scree plot of (WCSS) for k = [1:100]

    Figure 4: Heatmap of clusters where k=14

    K=14

  • 24

    Cluster Description and Naming

    With the clusters set, they needed to be named and described. For some (Abbas, et

    al., 2009; Parker, et al., 2007) the descriptions attached to clustering exercises such as this

    one are the most interesting element of the final geodemographic classification process.

    Vickers and Rees (2011) state that cluster names may be the primary source of

    information used when judging a cluster in a classification by the end user.

    Naming and describing the clusters was a multi stage process. All of the clusters

    were mapped in order to check that the classification seemed spatially sensible. Then the z

    scores of the original derived variables were calculated for each cluster, the mean z score

    across all clusters was also calculated for each variable (), as was the standard deviation

    - > z were deemed to have an extremely low z

    score for that variable. Clusters where z z

    score for that variable. These extreme highs or lows accounted for 25% of the variables.

    The extreme values informed the name attached to the clusters as they were

    deemed to be dominant and identifiable characteristics of the cluster in question (Abbas,

    et al., 2009; Debenham, 2002; Vickers & Rees, 2007) . A table of the extremes identified

    is reproduced in Table 2. In addition to the extreme z scores, the scores that were above or

    below average for that variable informed more detail in the descriptions where necessary.

    For the assistance of the end user a radar plot for each cluster is included in the

    cluster descriptions. This plot shows the global average z score for each variable in red

    and the z score for the cluster in blue. It also shows the extreme value cut offs in green

    and purple.

  • 25

    Table 2: Extreme z scores for variables within clusters