historic postcode directories

Download Historic Postcode Directories

Post on 13-Apr-2017

18 views

Category:

Data & Analytics

0 download

Embed Size (px)

TRANSCRIPT

  • Historic Postcode Directories - Progress and Plans

    Postcode GeoReferencing User Group, 5th April

    James Crone, EDINA.

  • Overview

    About EDINA Project Background and Context Progress To Date Plans for coming months Outstanding Issues

  • EDINA

    A JISC funded national data centre based at Edinburgh University Data Library.

    Provides the UK tertiary education and research community online access to a library of data, information and research resources.

    The largest section of which (Geo Data Services), comprised of GIS Specialists and Software Engineers provides access to 2 key online services - Digimap & UKBORDERS.

    We and our user community have an interest in both contemporary and historical postcode products.

  • Background & Context

    What are the historical postcode directories? - datasets which list all unit postcodes within the UK and assigns to them a national grid reference, geographic lookups and counts of assigned addresses.

    ESRC has purchased Gridlinked versions of AFPD (2001-2006) for use by the academic community. This community also has an interest in historic versions of the AFPD and thus ONS supplied to ESRC historic postcode directories (1980-2000) for free on the basis that ESRC would QA the historic versions.

    At this point all versions of postcode directories received by ESRC have been available to users through the EDINA UKBORDERS service since October 2004.

    Steady stream of user downloads. Data for census years most popular but interestingly significant interest in non-census years.

  • Deliverables Objectives/Deliverables of the QA set out formally in August 2004 MOU

    between ESRC & ONS:

    Key Deliverable is a Quality Controlled postcode instance database spanning 1980 to present day. From this ESRC will derive snapshot historical versions of the postcode directories replacing the versions of unknown quality that are currently in existence.

    Postcode Instance - defined as the existence of a postcode for a certain period of time which is unique on both postcode label and date of introduction.

    Postcode Instance = Postcode Label + Date of Introduction

    Instance db will have number of fields DOI, DOT, most recent easting & northing and higher geography lookups (1991 ED/OA; 1998 Ward; 2001 OA).

    The ONS Ward History Database will be used to check the veracity of ward codes within the historic versions of the postcode directories.

  • Progress to Date

    4 sequential work phases to complete these objectives:

    I. Data Loading (complete) II. Quality Assurance I - Audit (complete) III. Quality Assurance II - Verification (in progress) IV. Production of Historic Snapshots

    At this point first 2 of these are complete and we are currently engaged in the verification phase.

    ... Taking each phase in turn

  • Phase I Data Loading

    Postcode directories were supplied by ONS from 1980 to present day.

    Origin of data varies:

    Central Postcode Directories: 1980 - 1990 (except 1989) AFPDs: 1991 - 1998 (except 1996 & 1997) NHSPD: 1996 & 1997 AFPD (NHS Variant): 1999 AFPD (Gridlink version): 2000 + Gridlink versions of AFPD from 2001 to current release.

    With the exception of 1989, a complete set, quite remarkable given that digital curation & preservation a fairly recent concern.

  • Phase I Data Loading

    We took each historic version, loaded it into it`s own database table (database used is PostgreSQL) & then merged each years table into a super table giving all postcodes from all versions of the AFPD.

    Given the differing origins of the year tables and the tendency for number of attributes to increase over time, the harmonisation of these snapshots itself was an "interesting" data management challenge. For practical purposes fields were distilled down to a core set.

    The super table was reduced to a table with distinct postcodes labels (giving the labels of all postcodes since 1980) and then to the more valuable postcode instance table.

    Composite merged table - 50,986,078 rows Distinct postcode unit table - 2,330,886 rows Postcode Instance table - 2,763,839 rows

  • Phase I Data Loading

    By itself Date of Introduction only tells us when a postcode was instantised. In order to be able to examine the lifecycle of each instance we also need to know if this instance has been terminated or is still live.

    To each instance we attempted to add a Date Of Termination (DOT) by searching through each of the historic AFPD version tables and determining if the instance was terminated. Not a trivial task given volumes of data and number of searches required.

    At the same time each instance also had associated with it latest grid reference.

    Instance database is therefore quite rich as it holds both the temporal and spatial history for the instances associated with a postcode.

  • Phase II Quality Assurance (Audit)

    Rationale for Quality Assurance The quality of the instance database will be propagated to derived products therefore essential that we have an understanding of which instances are genuine and which can be regarded as spurious and which may need to be fixed or weeded out.

    First Step Analysis of the frequency of instances associated with distinct postcodes.

    Frequency of instances associated with distinct postcodes:

    Num of postcode instances : Frequency1 : 2,379,1402 : 343,9953 : 34,9864 : 4,8395 : 5716 : 857 : 278 : 269 : 13810 : 1811 : 812 : 213 : 4

    Straightaway can see that in some cases distinct postcodes have multiple instances associated with them.

  • Phase II Quality Assurance (Audit)

    Majority of postcodes represented by only a single instance. But significant number of postcodes have multiple instances associated with them why?

    Genuine Postcode Recycling

    Spurious Instances due to imputation problems or systematic tablewide update procedures in past versions (i.e. update for all Scottish 1973 instances in 1980 table).

    Expected vs. Divergent Cases.

  • Phase II Quality Assurance (Audit)

  • Phase II Quality Assurance (Audit)

  • Phase II Quality Assurance (Audit)

    Programmatic tests were designed to flag cases in the Instance database which diverged from what we expected.

    Do this by taking each postcode in turn and examining the timelines associated with its instances. Errors grouped into 3 types:

    Type I - in which the DOI = DOT (the instance is instantised & terminated at the same point in time)

    Type II (A) in which all instances of the postcode are live or (B) there are other inconsistencies within the timeline such as blank dates of termination within a sequence of instances.

    Type III - multiple dates of termination - postcode instantised once but has multiple dates of termination

    Name of these errors is a convenience not to be confused with Type I/II errors in Statistics!

  • 3558

    347828

    206001

    44480

    50000

    100000

    150000

    200000

    250000

    300000

    350000

    400000

    I II.A II.B III

    Spurious Instance Type

    Cou

    nt

    Phase II Quality Assurance (Audit)

  • Phase II Quality Assurance (Audit)

    As we can see the Type II error cases represent the bulk of the errors so effort has been directed at identifying different varieties of this type of error. We will spend a few minutes examining two such examples now.

  • Phase II Quality Assurance (Audit)

    Case A

    6 instances never with a date of termination - conflict immediately after the first case.

    Is it valid for there to be so many postcodes which have multiple live instances?

    Are all of these cases a result of postcode recycling or are they in fact due to inconsistencies within the dataset itself?

  • Phase II Quality Assurance (Audit)

    Case B

    Again we have 6 instances - this time there is a blank date of termination within the timeline (which conflicts with the latter 2 instances)

  • Phase II Quality Assurance (Audit)

    Why are these a problem? - when we create the historic cuts we don`t want any ambiguity.

    need to be sure that all live postcodes are truly live (and should not have been terminated).

    that where a postcode has multiple instances associated with it, these are genuine and not a result of problems with how the data was created or updated.

    that all data is consistent as possible.

    How to reconcile these Spurious cases?

  • Phase III QA - Verification

    Type I errors - unclear - we can`t see any logic behind this - to which we ask is it valid for an instance to introduced and terminated in the same month?

    Type II errors - problem less clear cut as we have already seen - different species of the same problem causing instances to diverge from the expected norm.

    Type III errors - multiple dates of termination - As a rule, pick either the earliest OR latest and apply to all cases

    Mainly Concerned in rest of presentation with dealing with the Type II errors.

    Key Assumption Instance database holds information about the location of each instance in space and time. Instances which are similar in both these respects can be merged.

  • Phase III QA - Verification

  • Phase III QA - Verification

    Time - According to Royal Mail:

    A postcode is only supposed to be reused after a minimum period of 3 years has elapsed & residential postcodes are never reused.

    On this basis where we have 2 instances which are instantised within less than 3 years of one anoth