get your hands dirty cleaning data. 2008 european emu users meeting, 3rd june. - elizabeth bruton,...

25
Get your hands dirty Get your hands dirty cleaning data. cleaning data. 2008 European EMu Users 2008 European EMu Users Meeting, 3rd June. Meeting, 3rd June. - Elizabeth Bruton, Museum of - Elizabeth Bruton, Museum of the History of Science, the History of Science, Oxford Oxford [email protected] [email protected]

Upload: clarence-harrison

Post on 02-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Get your hands dirty Get your hands dirty cleaning data. cleaning data.

2008 European EMu Users 2008 European EMu Users Meeting, 3rd June.Meeting, 3rd June.

- Elizabeth Bruton, Museum of the - Elizabeth Bruton, Museum of the History of Science, OxfordHistory of Science, Oxford

[email protected]@mhs.ox.ac.uk

OutlineOutline

►Data MigrationData Migration►Problem -> Solution approachProblem -> Solution approach►ToolsTools►Manual Data CleaningManual Data Cleaning►ExamplesExamples►Current and Future Practices Current and Future Practices

(Documentation, Policing, Review)(Documentation, Policing, Review)

Data MigrationData Migration

►First step towards better, cleaner dataFirst step towards better, cleaner data►Steps:Steps:

Prepare and analyse legacy systemPrepare and analyse legacy system Data mappingData mapping KE EMu system designKE EMu system design Data migrationData migration

Legacy System AnalysisLegacy System Analysis

►Prepare and analyse previous (legacy) Prepare and analyse previous (legacy) systemsystem Data: structure and relationships - tables Data: structure and relationships - tables

and fields.and fields.►PrimaryPrimary►SecondarySecondary►Cross-referenceCross-reference

Documentation and usageDocumentation and usage Redundant dataRedundant data

Legacy Data analysisLegacy Data analysis

Data MappingData Mapping

KE EMu system designKE EMu system design

►Default and Default and additional fields additional fields across different across different modulesmodules

► Field titlesField titles► Screen DesignerScreen Designer

e.g. Summary tab for e.g. Summary tab for ecatalogue moduleecatalogue module

► Finally data migrationFinally data migration

Data cleaning overviewData cleaning overview

► Problem -> solution approachProblem -> solution approach Input dataInput data OperationsOperations Output dataOutput data

►Manual or automated operations or both?Manual or automated operations or both?►Which tools to use for automated Which tools to use for automated

operations?operations? KE EMu tools – many powerful built-in tools within KE EMu tools – many powerful built-in tools within

EMuEMu Non-KE EMu tools – scripts to use on data Non-KE EMu tools – scripts to use on data

imported from EMu; reimport back into EMuimported from EMu; reimport back into EMu BothBoth

KE EMu Tools: TexqlKE EMu Tools: Texql

►KE Texpress Texql queriesqueries Similar syntax to Similar syntax to

mySQL or SQLmySQL or SQL

►Uses:Uses: Analysing data and Analysing data and

data structuredata structure Analysing search Analysing search

queriesqueries Advanced search Advanced search

queriesqueries

KE EMu Tools: Global ReplaceKE EMu Tools: Global Replace

► Very useful, powerful Very useful, powerful but also potentially but also potentially ‘dangerous’ tool‘dangerous’ tool

► Can use in combination Can use in combination with search query or with search query or list options within EMulist options within EMu

► Can use regular Can use regular expressions and/or expressions and/or wildcard searcheswildcard searches

► Powerful tool for single Powerful tool for single field or Field A->Field B field or Field A->Field B operations operations

KE EMu Tools: Record Merge KE EMu Tools: Record Merge

►Does what it says on the tinDoes what it says on the tin►Merge one or more duplicate record(s) Merge one or more duplicate record(s)

into single recordinto single record►Only ‘attachments’ to different modules Only ‘attachments’ to different modules

are merged into record are merged into record notnot data data►Ditto tool can be used for easily copying Ditto tool can be used for easily copying

data from one record to another data from one record to another ►Attachments to original duplicate Attachments to original duplicate

record(s) are removed so records can record(s) are removed so records can be deletedbe deleted

KE EMu Tools: ReportsKE EMu Tools: Reports

► Tool to present Tool to present information in information in assorted waysassorted ways

►Can be used to Can be used to produce reports but produce reports but can also be used as can also be used as data export tooldata export tool

►Microsoft Excel or Microsoft Excel or CSV format CSV format appropriate for appropriate for more advanced more advanced data operationsdata operations

Non-KE EMu Tools: ScriptingNon-KE EMu Tools: Scripting

►Personally use php and mySQLPersonally use php and mySQL►Perl is also useful scripting tool; used by Perl is also useful scripting tool; used by

KEKE►Have written CSV to mySQL file checker Have written CSV to mySQL file checker

and converter in phpand converter in php►Then run more advanced operations on Then run more advanced operations on

data using php scriptsdata using php scripts►PhpMyAdmin can export data in many PhpMyAdmin can export data in many

formats including CSVformats including CSV

Non-KE EMu Tools: ScriptingNon-KE EMu Tools: Scripting

►Systematic ApproachSystematic Approach Keep copy of original dataKeep copy of original data Produce data mapping or data cleaning Produce data mapping or data cleaning

documentdocument Perform operations using php file on Perform operations using php file on

mySQL tablemySQL table Check data produced (manual or Check data produced (manual or

automatic) and output logsautomatic) and output logs Validate data in EMu and then importValidate data in EMu and then import

Manual Data CleaningManual Data Cleaning

►Some problems cannot be done Some problems cannot be done automatically, either partially or automatically, either partially or entirelyentirely

►Need to be ‘eyeballed’ by a person, Need to be ‘eyeballed’ by a person, preferably someone familiar with the preferably someone familiar with the museum’s collectionsmuseum’s collections

Example: Parties RecordsExample: Parties Records

►Legacy system used two systems of Legacy system used two systems of noting object ‘makers’noting object ‘makers’ Freetext ‘Maker’ field with no centralised Freetext ‘Maker’ field with no centralised

system (1:1 ratio); used for applicable system (1:1 ratio); used for applicable recordsrecords

Assigned makers with centralised system; Assigned makers with centralised system; only used for first 3,000 or so recordsonly used for first 3,000 or so records

►Freetext data imported into EMu resulted Freetext data imported into EMu resulted in approximately 5,500 Parties recordsin approximately 5,500 Parties records

Example: Parties RecordsExample: Parties Records

►Good example of mapping freetext field Good example of mapping freetext field to more structured data field with to more structured data field with 1:Many ratio1:Many ratio

►KE ran script which ‘detected’ maker KE ran script which ‘detected’ maker type and formatted accordingly, i.e. type and formatted accordingly, i.e. Maker Type etcMaker Type etc

►But still much cleaning up to be doneBut still much cleaning up to be done►Two approaches: automatic then Two approaches: automatic then

manualmanual

Example: Parties RecordsExample: Parties Records

►Problem: Creation-related data within Problem: Creation-related data within legacy system were all free-text fieldslegacy system were all free-text fields

►The museum wanted to keep this data The museum wanted to keep this data in some format as it contained valuable in some format as it contained valuable information, such as ambiguities or information, such as ambiguities or uncertaintiesuncertainties

►e.g. Italy or France, Attributed to Smith e.g. Italy or France, Attributed to Smith & Jones, possibly last quarter of 19& Jones, possibly last quarter of 19thth century etccentury etc

Example: Parties RecordsExample: Parties Records

►This data did not fit neatly into defined, This data did not fit neatly into defined, structure fields such as Parties, Places structure fields such as Parties, Places or Creation Dateor Creation Date

►AlsoAlso wanted to clean Parties records wanted to clean Parties records►Solution: Automatic batch process then Solution: Automatic batch process then

manual cleaningmanual cleaning

Example: Parties Records – Example: Parties Records – Automatic ApproachAutomatic Approach

Exported Creation data (Parties, Place, Exported Creation data (Parties, Place, Creation Date) from EMuCreation Date) from EMu

Ran script which checked for and removed Ran script which checked for and removed duplicates in Parties and Placeduplicates in Parties and Place

Note: The above operation deleted rather Note: The above operation deleted rather than manipulated data but still integral part than manipulated data but still integral part of data cleaning operationof data cleaning operation

Copied cleaned Parties, Place, Creation Data Copied cleaned Parties, Place, Creation Data into single free-text field: Creation Notesinto single free-text field: Creation Notes

Re-imported data into EMu using Import ToolRe-imported data into EMu using Import Tool

Example: Parties Records – Example: Parties Records – Automatic ApproachAutomatic Approach

Began data cleaning by running Global Began data cleaning by running Global Replace operation within EMu eparties Replace operation within EMu eparties module, removing 'Signed by', 'Attributed module, removing 'Signed by', 'Attributed to', or 'Made by' from the relevant parties to', or 'Made by' from the relevant parties recordsrecords

Next: Manual ApproachNext: Manual Approach

Example: Parties Records – Example: Parties Records – Manual ApproachManual Approach

Cleaned records: Check Parties Type Cleaned records: Check Parties Type (Person or Organisation) and edited (Person or Organisation) and edited records (Surname, Forename, Organisation records (Surname, Forename, Organisation etc)etc)

Merged and deleted duplicate recordsMerged and deleted duplicate records Checked and deleted unattached parties Checked and deleted unattached parties

recordsrecords

Example: Parties Records – End Example: Parties Records – End ResultResult

►Currently have 3,300 cleaner Parties Currently have 3,300 cleaner Parties recordsrecords

Current and Future PracticesCurrent and Future Practices

►CurrentCurrent Systematic approach to data cleaning; Systematic approach to data cleaning;

incorporated into monthly museum EMu incorporated into monthly museum EMu Users' MeetingUsers' Meeting

ReviewReview

►In ProgressIn Progress DocumentationDocumentation

►FutureFuture PolicingPolicing

ConclusionConclusion

►Data cleaning and policing is an ongoing Data cleaning and policing is an ongoing process for an institution of any sizeprocess for an institution of any size

►Data standards must be set and Data standards must be set and adhered toadhered to

►Needs to be approached and done in a Needs to be approached and done in a systematic waysystematic way

►Any questions?Any questions?