nicolson

37
Rubbish in Rubbish out: applying good data governance techniques to gain maximum benefit from publisher data UKSG Conference April 2013 Phil Nicolson

Upload: uksg-connecting-the-knowledge-community

Post on 22-Jan-2015

1.229 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

  • 1. UKSG Conference April 2013Phil Nicolson

2. Data Governance What is Data Governance What is Data Quality The challenges Data governance programme A publisher approach The outcome: Book author example ICEDIS Summary 3. Data governanceI think that the key issue here, is that theinformation is probably incorrect, inaccurate and in aform that almost certainly shouldnt have been usedDr John Thomson cardiologist at Leeds General Infirmary, Sky News 30/3/2013 4. Data Governance a definition Data governance is defined as the processes, policies,standards, organisation, and technologies required to manageand ensure the availability, accessibility, quality, consistency,auditability, and security of data 5. Data Quality - definitions Data are of high quality "if they are fit for their intended usesin operations, decision making and planning" Data are deemed of high quality if they correctly representthe real-world construct to which they refer 6. Data Quality Data quality attributes: Accurate Reliable Complete Appropriate Timely Credible Up-to-date 7. The challenge: Data Sources Multiple data sources system data silos Multiple locations geographic data silos Data entered through multiple channels Data entered by different people 8. The challenge: Data SourcesTypical publisher systems: Data can be entered by: Financial system Organisation staff CRM/Sales database Authors Authentication system Society members Fulfilment Agents in the supply chain Usage statistics 3rd party organisations Submissions system .. Author database .. 9. The challenge: Institutions UCL: University College London (UK) Universit Catholique de Louvain (Belgium) Universidad Cristiana Latinoamericana (Ecuador) University College Lilleblt (Denmark) Centro Universitario Celso Lisboa (Brazil) Union County Library (USA) NPL: National Physical Laboratory (UK) National Physical Laboratory (India) York Uni. University of York (UK) York University (Canada) Northeastern University: Northeastern University (Boston, USA) Northeastern University (Shenyang, China) 10. The challenge: IndividualsHow can we uniquely identify individuals? Of the 700,000individuals known to the RSC in 2012 there were: Smith: ~1,500 Jones: ~1,000 Li: >10,000 11. Consequences of poor data 12. Biggest obstacle(s) to data qualityimprovement in your organization?Lack of accountability and responsibility for data quality55.4%Too many information silos51.8%Lack of awareness or communication of the magnitude of data quality problems51.4%Lack of common understanding of what data quality means 50.2%Lack of awareness or communication of the opportunities associated with high quality data 45.0%Lack of senior leadership in tackling data quality issues 44.2%Lack of data quality policies, plans, and procedures42.2%Perception that data quality is an IT issue only rather than an organisation wide issue 41.8%The State of Information and Data Quality 2012 Industry Survey& Report, (IAIDQ)Understanding how Organizations Manage the Quality of their Information and Data Assets.Pierce, Yonke, Malik, Nagaraj 13. Data Governance why it is vitalprocesses, policies, standards ensure quality and consistency Increase consistency and confidence in our decision making Maximise the income generation potential of our data Provide excellent customer service Designating accountability for information quality Minimising or eliminating re-work Optimise staff effectiveness Decreasing the risk of regulatory fines Improving data securityData is one of the most valuable assets within an organisation 14. Data governance a new culture 15. Data governance programme 16. Plan & prioritise Sponsorship: director level sponsor? Program management: business or IT driven? Organisational structure: local, national, international? Scope: focus on the most important data? Ownership: who are the business owners of critical data? New system implementation: protect investment 17. Plan & prioritise Resources: dedicated staff? Funding: which area of the business will fund the program? Business drivers: what are the major business drivers? Barriers: what are the main barriers (cultural, funding,resources, priorities etc.) and can they be mitigated 18. Audit & Analyse Audit existing data quality Review all relevant systems How poor is it? Incomplete data Invalid Out of date . 19. Clean existing data Prioritise Quick wins Highlight progress What can be automated? Introduce unique identifiers 20. Identifiers available People Organisations International Standard Name International Standard NameIdentifier (ISNI) Identifier (ISNI) Open Researcher and Ringgold IDContributor ID (ORCID) DUNS Number (D&B) and Scopus Author Identifierother business and finance ResearcherIDIDs MDR PID Numbers and othermarketing IDs Library of Congress MARCCode List for Organizations 21. ISNIISNI is designedISNI NumberISNI Numberto be a bridgeidentifier Party ID 1 Party ID 2 ProprietaryProprietary Information and/or Information and/orMetadata Metadata 22. Author IDs ORCID is designed to persistently identify and disambiguatescholarly researchers and attach them to research output ORCID identifiers utilize a format compliant with the ISNI ISOstandard ISNI has reserved a block of identifiers for use by ORCID, sothere will be no overlaps in assignments Recorded as http://orcid.org/0000-0001-2345-6789http://about.orcid.org/http://www.isni.org/ 23. Use cases Disambiguation of researchersand connection to all theirresearch Links to contributors, editors,compilers and others involvedin the research process Embed IDs into researchworkflows and the supplychain Integrate systems 24. Institutional IDs Ringgold is an ISNI Registration Agency Unique institutional ID number maps data across systems ISNI numbers should be used across the scholarly supplychain to: Disambiguate institutional records Eradicate duplication of data Map institutions into their hierarchy Link systems using the institutional ID as the lynchpin 25. Minimising the impact of data silos Standard identifiers (both individual and institution) can beused to breakdown silos by enabling better system linking: 26. Improve data capture Data quality policy Web forms Closer collaboration with 3rd parties to encourage use ofindustry standard identifiers such as ISNI or ORCID 27. Data capture - data quality policy Design to ensure accuracy, quality and consistency Individual responsibilities: All staff are responsible for the accuracy and consistency of data Capture data in such a way that it is uniquely identifiable and easily shared within the organisation and with 3rd parties Records relating to individuals Records relating to institutions Reporting of inaccuracies to Data Owners Data owners responsibilities: All source data systems must have a designated Data Owner Data owner retains overall responsibility for all records within their source data system 28. Improve data capture web forms Required fields Validation Address validation postcode lookup Institution validation institution lookup Internal and external web form consistency Language barriers Help and hints Free-text fields 29. On-going monitoring Dashboards Regular audits Metrics InstitutionalLinking Rate Staff awareness Reporting of errors 30. A publisher example Develop a Data Governance Programme Data champion Engagement at all levels Ownership at all levels Allocate necessary resources Guidelines/Policy - Data quality policy Processes put in place Education - raise awareness New staff training on Data Governance and their wider impact Change of culture 31. A publisher example Ringgold and DataSalon client All institutional records contain Ringgold Identifiers System linking via Individual and Institutional identifiers Data (both good and bad) visible to all via MasterVision Use of data governance dashboards Tidying of existing data Simple reporting of incorrect data across organisation New data captured correctly 32. Author database1. Create a data governance dashboard to monitor problem areas: Book authors with no related institution Unknown book authors Author records without an affiliation entry Author records with commas in theaffiliation entry Book authors without an email address Book authors with an invalid email address2. Correct problem records in existing data Dashboard clearly highlighted all records ofconcern and these records were corrected 33. Author database3. Ensure new records are created correctly Raise staff understanding of the importance of capturing data correctly andthe impact it has across the organisation as a whole (data silos) Training covering data governance4. Ensure appropriate Ringgold coverage Where institutions were discovered in the Author database that didnt existwithin Identify these were reported to Ringgold. This not only means thatindividual authors can be linked to the new institution but that anyindividuals in other data sources at the same institution can be linked. Thisbenefits all users of our data and potentially highlights new salesopportunities.5. Monitor data quality on an on-going basis Books data governance dashboard update on a weekly basis. 34. Author database results100.00% 10% will never link: Missing data (old records)95.00% Institution no longer exists90.00% Retired author85.00% Genuinely no related institution All data sources ANKO 80.00%75.00% End of process:70.00% 15% increase in authors linked to institutions - information valuable in supporting all areas of the business Ready for data migration 35. ICEDIS The international standards organization EDItEUR is working toencourage improvements in the ways that "party" information iscommunicated Some parts of the supply chain continue to send unstructured name &address records, making matching, disambiguation and automatic ingestnear impossible ICEDIS has collaborated with EDItEUR to develop a highly structureddata model for exchanging names, addresses and standard identifiers. The group has recently been validating the model by means of a "paperpilot", using a small library of about 100 name & address types An XML schema and HTML documentation are freely availablewww.editeur.orgwww.editeur.org/138/[email protected] 36. Summary Your data is a very valuable asset when managed correctly Establishing a data governance programme will enable you togain maximum benefit from that data Data governance is as much about changing the culture of anorganisation as it is about processes and procedures It will take time but the benefits can be enormous 37. Phil NicolsonData ManagerRinggold [email protected]