historic postcode directories

Historic Postcode Directories - Progress and Plans

Postcode GeoReferencing User Group, 5th April

James Crone, EDINA.

Overview

• About EDINA• Project Background and Context• Progress To Date• Plans for coming months• Outstanding Issues

EDINA

• A JISC funded national data centre based at Edinburgh University Data Library.

• Provides the UK tertiary education and research community online access to a library of data, information and research resources.

• The largest section of which (Geo Data Services), comprised of GIS Specialists and Software Engineers provides access to 2 key online services - Digimap & UKBORDERS.

• We and our user community have an interest in both contemporary and historical postcode products.

Background & Context

• What are the historical postcode directories? - datasets which list all unit postcodes within the UK and assigns to them a national grid reference, geographic lookups and counts of assigned addresses.

• ESRC has purchased Gridlinked versions of AFPD (2001-2006) for use by the academic community. This community also has an interest in historic versions of the AFPD and thus ONS supplied to ESRC historic postcode directories (1980-2000) for free on the basis that ESRC would QA the historic versions.

• At this point all versions of postcode directories received by ESRC have been available to users through the EDINA UKBORDERS service since October 2004.

• Steady stream of user downloads. Data for census years most popular but interestingly significant interest in non-census years.

Deliverables

• Objectives/Deliverables of the QA set out formally in August 2004 MOU between ESRC & ONS:

• Key Deliverable is a Quality Controlled postcode instance database spanning 1980 to present day. From this ESRC will derive snapshot historical versions of the postcode directories replacing the versions of unknown quality that are currently in existence.

• Postcode Instance - defined as the existence of a postcode for a certain period of time which is unique on both postcode label and date of introduction.

• Postcode Instance = Postcode Label + Date of Introduction

• Instance db will have number of fields – DOI, DOT, most recent easting & northing and higher geography lookups (1991 ED/OA; 1998 Ward; 2001 OA).

• The ONS Ward History Database will be used to check the veracity of ward codes within the historic versions of the postcode directories.

Progress to Date

• 4 sequential work phases to complete these objectives:

• I. Data Loading (complete)• II. Quality Assurance I - Audit (complete)• III. Quality Assurance II - Verification (in progress)• IV. Production of Historic Snapshots

• At this point first 2 of these are complete and we are currently engaged in the verification phase.

• ... Taking each phase in turn

Phase I – Data Loading

• Postcode directories were supplied by ONS from 1980 to present day.

• Origin of data varies:

• Central Postcode Directories: 1980 - 1990 (except 1989)• AFPDs: 1991 - 1998 (except 1996 & 1997)• NHSPD: 1996 & 1997• AFPD (NHS Variant): 1999• AFPD (Gridlink version): 2000• + Gridlink versions of AFPD from 2001 to current release.

• With the exception of 1989, a complete set, quite remarkable given that digital curation & preservation a fairly recent concern.


• We took each historic version, loaded it into it`s own database table (database used is PostgreSQL) & then merged each years table into a super table giving all postcodes from all versions of the AFPD.

• Given the differing origins of the year tables and the tendency for number of attributes to increase over time, the harmonisation of these snapshots itself was an "interesting" data management challenge. For practical purposes fields were distilled down to a core set.

• The super table was reduced to a table with distinct postcodes labels (giving the labels of all postcodes since 1980) and then to the more valuable postcode instance table.

• Composite merged table - 50,986,078 rows• Distinct postcode unit table - 2,330,886 rows• Postcode Instance table - 2,763,839 rows


• By itself Date of Introduction only tells us when a postcode was instantised. In order to be able to examine the lifecycle of each instance we also need to know if this instance has been terminated or is still live.

• To each instance we attempted to add a Date Of Termination (DOT) by searching through each of the historic AFPD version tables and determining if the instance was terminated. Not a trivial task given volumes of data and number of searches required.

• At the same time each instance also had associated with it latest grid reference.

• Instance database is therefore quite rich as it holds both the temporal and spatial history for the instances associated with a postcode.

Phase II – Quality Assurance (Audit)

• Rationale for Quality Assurance – The quality of the instance database will be propagated to derived products therefore essential that we have an understanding of which instances are genuine and which can be regarded as spurious and which may need to be fixed or weeded out.

• First Step – Analysis of the frequency of instances associated with distinct postcodes.

• Frequency of instances associated with distinct postcodes:

Num of postcode instances : Frequency1 : 2,379,1402 : 343,9953 : 34,9864 : 4,8395 : 5716 : 857 : 278 : 269 : 13810 : 1811 : 812 : 213 : 4

• Straightaway can see that in some cases distinct postcodes have multiple instances associated with them.


• Majority of postcodes represented by only a single instance. But significant number of postcodes have multiple instances associated with them – why?

• Genuine Postcode Recycling

• Spurious Instances due to imputation problems or systematic tablewide update procedures in past versions (i.e. update for all Scottish 1973 instances in 1980 table).

• Expected vs. Divergent Cases.


• Programmatic tests were designed to flag cases in the Instance database which diverged from what we expected.

• Do this by taking each postcode in turn and examining the timelines associated with its instances. Errors grouped into 3 types:

• Type I - in which the DOI = DOT (the instance is instantised & terminated at the same point in time)

• Type II – (A) in which all instances of the postcode are live or (B) there are other inconsistencies within the timeline such as blank dates of termination within a sequence of instances.

• Type III - multiple dates of termination - postcode instantised once but has multiple dates of termination

Name of these errors is a convenience – not to be confused with Type I/II errors in Statistics!

3558

347828

206001

44480

50000

100000

150000

200000

250000

300000

350000

400000

I II.A II.B III

Spurious Instance Type

Co

un

t



• As we can see the Type II error cases represent the bulk of the errors so effort has been directed at identifying different varieties of this type of error. We will spend a few minutes examining two such examples now.


• Case A

• 6 instances never with a date of termination - conflict immediately after the first case.

• Is it valid for there to be so many postcodes which have multiple live instances?

• Are all of these cases a result of postcode recycling or are they in fact due to inconsistencies within the dataset itself?


• Case B

• Again we have 6 instances - this time there is a blank date of termination within the timeline (which conflicts with the latter 2 instances)


• Why are these a problem? - when we create the historic cuts we don`t want any ambiguity.

• need to be sure that all live postcodes are truly live (and should not have been terminated).

• that where a postcode has multiple instances associated with it, these are genuine and not a result of problems with how the data was created or updated.

• that all data is consistent as possible.

• How to reconcile these Spurious cases?

Phase III – QA - Verification

• Type I errors - unclear - we can`t see any logic behind this - to which we ask is it valid for an instance to introduced and terminated in the same month?

• Type II errors - problem less clear cut as we have already seen - different species of the same problem causing instances to diverge from the expected norm.

• Type III errors - multiple dates of termination - As a rule, pick either the earliest OR latest and apply to all cases

• Mainly Concerned in rest of presentation with dealing with the Type II errors.

• Key Assumption – Instance database holds information about the location of each instance in space and time. Instances which are similar in both these respects can be merged.


• Time - According to Royal Mail:

• A postcode is only supposed to be reused after a minimum period of 3 years has elapsed & residential postcodes are never reused.

• On this basis where we have 2 instances which are instantised within less than 3 years of one another we can assume that they are referring to the same thing.


Space (Geography)

• Nearby things tend to be more similar than things that are more distant apart.

• Instances located close to one another likely reference the same set of addresses. Instances located more distant apart may represent recycling events.

• For a postcode instance can see how its instances change in position over time - are they spatially stationary or more dynamic?

• How quantify this within the instance table? - for each set of instances associated with a postcode unit compute change in easting & northing between instances.


• BUT we need to be aware of the spatial accuracy issue. Accuracy with which grid references have been assigned to postcodes has increased over time as methodologies have changed with technology advances.

• An overall increase in accuracy of georeferencing over time.

• Instance location change may therefore operate at multiple scales – a local change due to changes in georeferencing plus a larger change brought about by recycling.


• Summary statistics for all instances:

• 75% of postcodes with multiple instances record no change in location whatsoever.

• Of those that do exhibit location change, in 90% of cases this was between 1m and 3km with the remaining cases exhibiting a change of up to 500km.

• Clearly it would be useful if we had a spatial threshold (like the 3 year temporal threshold) that we could use to decide whether 2 instances should be merged or kept separate as genuine reuses.

• We argue that using a combination of temporal & spatial measures of similarity it is possible to discriminate between genuine and spurious instances.


• Research has only recently began to engage with this problem, progress has been hindered by the size of the datasets involved and the pain involved in isolating indicative cases.

• Significant time has been invested in exploring the problem but we are by no means experts - we need feedback - does this methodology seem appropriate - are our core assumptions logical?

• Plans are to explore the effects of applying different threshold values - using known cases of reuse to inform selection of threshold value.

• Pick a threshold value - determine the effects of applying this to the dataset as a whole in terms of i.e. number of merges that this yields taking samples to determine the validity of results - are instances inappropriately merged.


• Demonstrate application of these rules by going back to the Spurious cases we looked at earlier.

•Case A - using our temporal rule of 3 years - these 6 could be compressed to 3 instances. Using our spatial rule (assuming that our upper spatial threshold exceeds 100m) these could be compressed to a single instance.


•Case B - the inconsistent instance must either be terminated or merged with another instance. Applying the temporal rule it could be merged with the following instance. However its location is quite different and so we might decide that this falls outside our threshold and so instead we might terminate it with the start date of the following instance.

Phase IV – Create QA Instance DB

At some point in order to move forward we are going to have to proceed, implement the rules from phase 3 and carry out the updates to the instance database.

• In doing this we run the risk of going in one of two directions - we can be either be too inclusive leading to too many instances being merged together or we cannot be inclusive enough with not enough instances merged together.

• We intend to be pragmatic about this - we simply cannot have so many possibly false instances associated with each postcode. Unlikely that we are going to be able to resolve all cases.

• Once the rules are in place, implementation of them should be fairly straight forward.

Creation of Historic Snapshots

• With Quality Controlled Instance database in place, yearly historic version of the postcode directories can then be derived by pulling out all instances that exist within a particular time slice.

Outstanding Issues

• Reconciling the spurious instances still an ongoing task.

• We would welcome comments/feedback about the assumptions/methodologies we have chosen to adapt both from ONS and from other expert users of the AFPD.

• Is there any documentation which might shed light on procedures used to update the datasets in the past & might explain some of the systematic inconsistencies we have discovered?

Conclusions

• 1. Historical & Contemporary postcode directory datasets are being accessed by academic users through UKBORDERS.

• 2. QA process data has been received and loaded - raw instance database has been created.

• 3. Quality Assurance Audit has been carried out - quality of dataset has been assessed.

• 4. Significant Progress has been made in reconciling inconsistencies, but work remains before derived data can be created and exposed to user community.

• 5. Feedback on work to date and input from others users is requested in order to bring work to a close.

Contact Details

• http://edina.ac.uk/

• [email protected]

• Questions?

mailto:[email protected]

historic postcode directories

Data & Analytics