data quality in data warehouse and business intelligence environments - discussion paper

1 “Gartner Hype Cycle”, Gartner, July 2013

2 “International Journal of Latest Trends in Computing Vol.1, Issue2 p139: The Empirical Study on the Factors Affecting Data Warehousing Success”, Md. Ruhul Amin & Md.

Taslim Arefin, December 2010 3 “The State of Data Quality” Experian 2013 4 “2012 BI and Information Management Trends”, Information Week, November 2011

Data Quality By DesignManaging Data Quality for Data Warehousing and Business Intelligence environments

ź More than 50% of DW/BI projects are failing to meet expectations, with more than 50% of data projects having limited acceptance or be outright failures as a result of lack of attention to data quality issuest 2

ź Common data errors plague 91% of organisations. 3

ź 46% of businesses cite data quality as a barrier to BI adoption. 4

“Data is the life blood of the business” is a phrase that gets bandied about. Yet all too often, the quality of data is typically not thought about – at least not up front and until it’s too late. If data really was “life blood of the business,” you’d expect data quality to be fundamental!

One explanation data quality being overlooked is that the I.T. department is often responsible for delivering and operating the DWH/BI environment. What ensues ends up being an agenda based on “how do we build it”, not a “why are we doing this”. This needs to change.

Fast Facts

Data Quality By DesignData Quality By Design is an approach that aims to make the quality of data a foundational part of business systems design and implementation, both for business processes and business applications. Data warehouses (DW), Business Intelligence (BI) and Analytics initiatives are an excellent entry point to introduce the concept of data quality by design, simply because the implementation of these solutions is entirely focussed on delivering information to business users for decision-making purposes. (For the purposes of this paper, “DW/BI” will hereafter be used to refer to and all such solution initiatives).

There are many different approaches and methodologies for DW/BI implementation. However, all DW/BI methods will typically feature four major stages of delivery:

ź Data Discovery (Source data analysis)

ź Data Modelling (Business functional models, logical models, physical data structures)

ź Data Movement (ETL/ELT/ESB)

ź Develop Key Outputs (Business Intelligence, standard reports & ad hoc analytics)

Additionally, most modern DW/BI methodologies will follow an iterative, incremental (Agile) delivery approach; sequential (“waterfall”) methods are typically not recommended.

Data Quality techniques can help support each stage in the DW/BI delivery process. The remainder of this paper will explore the relationship between Data Quality techniques and these key stages of DW/BI solution delivery.

It’s probably reasonable to claim that we have finally reached a stage of pervasiveness with Data Warehouses & Business Intelligence (BI). Data warehouses are now generally accepted as part of mainstream business systems, to the point where they are almost ubiquitous. Certainly, most major businesses would have some form of BI solution these days and business users now expect their Business Intelligence analyses to be available any time, any where on any device.

Additionally, “Big Data” is getting information and analytics onto the business agenda like never before (although according to Gartner’s recent Hype Cycle analysis, key components of “Big Data” solutions such as cloud-based grid computing, in-memory databases and 1

MapReduce are only moving through the “trough of disillusionment” phase.) All in all, we really are now living the dream of the digital economy in the Information Age.

And yet, time and again, we hear about the failure of data warehouses – while things may be improving, they’re moving only slowly.

Introduction

Data Quality during Data DiscoveryData Discovery process steps:

ź Agreeing scope of suitable source data sets

ź Source data analysis

ź Logical source-to-target mapping

How Data Quality techniques can help:

ź Data Profiling: Identify previously unknown issues with data as part of discovery

ź Data Inspection: Increase Data Stewards’ understanding of the data - Get more intimate with data - Discovery of additional context and narrative - Articulation of new business rules (“why is it that…?”)

ź Corrective Action Planning: feedback to remediate data issues before solution development / testing.

Data Quality Profiling:

Data quality profiling is an excellent diagnostic method for gaining additional understanding of the data.

Profiling the source data helps inform both business requirements definition and detailed solution designs for data-related project, as well as enabling data issues to be managed ahead of project implementation.

Profiling may be required at several levels:

ź Simple profiling with a single table (e.g. Primary Key constraint violations)

ź Medium complexity profiling across two or more interdependent tables (e.g. Foreign Key violations)

ź Complex profiling across two or more data sets, with applied business logic (e.g. reconciliation checks)

ź Field by field analysis is required to truly understand the data gaps.

Any data profiling analysis must not only identify the issues and underlying root causes, but must also identify the business impact of the data quality problem (effectiveness, efficiency, risk inhibitors).

This will help identify any the value of remediating the data. Root cause analysis also helps identify any process outliers and drives out requirements for remedial action on managing any identified exceptions.

Be sure to profile your data and take baseline measures before applying any remedial actions – this will enable you to measure the impact of any changes.

“The data is always right” – a data quality error indicates a failure in the process, the system or the people. Use the data to inform and drive process change.

RECOMMENDED ACTION POINTS:

ź Data Quality Profiling and root-cause analysis to be undertaken as an initiation activity as part of all data warehouse, master data and application migration project phases.

DQ as part of data modellingData modelling steps:

ź Define Business Functions & requirements

ź Identify Logical Data Model subject areas

ź Derive target physical data models

How DQ techniques can help:

ź Data gap analysis: identify requirements for new data that is useful to a business function, but not currently captured.

ź Identify unnecessary data: screen for data sets that currently get captured but that actually have no utility to the business.

ź Develop Data Quality Rules: articulate explicit rules for how the data should be represented, and build screening capability (the “data quality firewall”) to ensure that business data is fit for purpose.

ź Manage fragmentation and duplication of data: identify more explicitly where data needs to be integrated and replicated, and apply more rigorous controls to ensure that the “golden record” for a given data entity is maintained in the designated system of record only.

The Enterprise Information Model is a crucial tool for driving consistency, integrity and utility of information across the business. A number of discreet steps are identified to derive the overall Enterprise Information Model for the enterprise:

ź Establish a core top-down Business Functional Model of the business (a.k.a. Conceptual Business Model). This describes the idealised view of what should be happening in the enterprise from a functional perspective. It establishes the business context(s) that operate upon the data and within which data is then managed. Structured interviews with business leadership team are a good entry point for capturing the core structure and expectations of the idealised business functional model. (N.B. this is not the process model, nor a model of the organisational structure. Departments are not functions!).

ź Derive the core Business Glossary of common informational terms (high-level consistent business lexicon that supports shared interpretation and clear semantic meaning of the data). The Business Glossary will be underpinned by a much more detailed and expansive set of Technical Metadata which captures the explicit and context–specific metadata definitions, derivation, business rules, calculation logic and lineage for each atomic information term in use within the enterprise.

ź Derive the logical data model for the enterprise, which represents the core entities, attributes and relationships for informational items required to support the identified business functions.

ź Model the Reporting Catalogue of key business questions (operational queries, historic reports, analytical and predictive models, data mining models) that the business should be asking in order to monitor and drive its performance. This will almost certainly include questions that are currently not being asked by the business community, and may include questions that currently cannot be answered using existing data. Note too that many of these questions will be cross-functional in nature. Creative thinking is required, including learning from what other industries are doing.

ź Derive the CRUD Matrix (Create, Read, Update, Delete) for business data by mapping the Functional Model to the entities in the Logical Data Model. Every business function must act upon at least one entity, and every entity must be acted upon by at least one function.

ź Map to the Information Asset Register, which documents the catalogue of current data holdings: what data currently exists within the enterprise, where it exists, for what purpose(s) it is currently used, and by whom.


ź Continue to develop the Enterprise Information Model elements: - Business Functional Model, Business Glossary, Technical Metadata, Logical Data Model, Reporting Catalogue, CRUD Matrix, Information Asset Register.

ź Adopt, apply and enhance these models and tools within data-related projects as an explicit part of the Systems Development Lifecycle (SDLC).

DQ as a part of Data Movement (ETL/ELT, data migration, integration etc.)

Data Validation and Dimensions of Data Quality

In any data quality validation process, target tolerances & business rules need to be established.

A pragmatic approach to data quality measurement takes into account fitness-for-purpose(s). How this is measured will be up to the individual business, however a suggested schema for profiling data is based across the following “ACE” dimensions:

ź Availability - Currency: the reference period of the data; is the data still up-to-date and relevant, or is it “stale”? (e.g. the only available Customer List was last updated in 2007). - Timeliness: is the data made available when it is needed? (e.g. I currently don’t receive the Sales Order figures until three days after Month-End.)

ź Completeness - Individual Record Completeness: Are all required field provided within each data record, or are there inappropriate NULL values? (e.g. in the sales transaction Ref# 340254 for Mr. Smith on 16/03/14, there is no Sales Quantity recorded). - Data Set Completeness: Are all expected records present, or are there gaps in the history? (e.g. In my list of Sales transactions for 2014, there are no rows of data for July.)

ź Error Free - Integrity/Coherence: Do different data sets join up as intended? (e.g. Do the “Orders” and “Dispatches” data sets both include the Customer Reference Number, and are the CRNs the same?) - Uniqueness: Are discreet values expected and preserved? (e.g. do all customers have a unique Customer Reference Number, or do we have two customers with the same CRN?) - Validity: Is the data within an acceptable range of known parameters. (e.g. I have a dispatch note for 31st November 2013).

Data Movement Process Steps:

ź Detailed source-to-target mapping

ź Audit and integrity checks

ź Integration code & test

ź Reconciliation checks


ź Data validation: pre-load checks as precursor to incorporating data into the data warehouse

ź Data Quality Firewall profile and alert: feedback loop to source (alert, trouble ticket generation).

Data Quality Firewall

Perceptions of data quality are affected by our past experiences. If we can proactively influence the quality of data before a new solution is delivered, then there is a greater chance of project success.

A “Data Quality Firewall” establishes a visible window on data quality for both business and technical operations. With respect to project delivery, new solutions should apply the above model to identify and incorporate a number of data quality firewall elements into their design and delivery:

ź Up front visibility of known data quality issues, based on proactive profiling of the data (see above).

ź Strong metadata management at the Business Glossary level to ensure consistent business understanding of data, and good technical metadata management of data to align with the underlying detailed business rules.

ź Using profiling to inform and develop the library of Data Quality Rules that apply to data.

ź Ongoing profiling of data prior to Data Warehouse loading, with rejecting any data records that fail profiling checks.

ź Feedback loop from the DQ firewall to inform and influence changes to both operational systems and data warehouse designs (“improving the data improves the design, which improves the data”).

ź Parameter-based targets and tolerances for data quality, with automated alerts and reporting summaries on identified issues that fail DQ profiling checks.

ź Automated generation of trouble tickets on any identified issues.

ź Preferred approach of “load trusted”: validate data before accepting it into the DW/BI environment (Compare with “Load everything” environments that flag erroneous records for after-the-fact correction (not ideal)

Note too that the just because we have the ability to profile a feature doesn’t mean we should! 100% data quality is almost never necessary (at least for analytic decision-making). Pragmatism should be applied to prioritise profiling and remedial efforts.


ź Incorporate “Data Quality Firewall” considerations as a necessary requirement for all new projects

ź Coach Data Quality by Design into Business Analyst and Solution Architecture teams.

For measuring a particular data set against each dimension, appropriate tolerances need to established to define what is (or is not) deemed to be acceptable for general usage. A Data Quality Declaration statement (DQD) with each data set it issues provides contextual and narrative guidance to data consumers as to the relative suitability of the data set within a given context. The consumer can then adjudge whether or not the data set is suitable for their purpose.

Simple GREEN/RED indicator for each DQ Dimension, together with a summary statement of the source context of the data, including description of the provenance, relevance and authority of the data source.


ź Consider defining a Data Quality Declaration for each data.

ź Interpretability: any explanatory narrative.

ź Accessibility: How the data set can be accessed, and it’s format.

Data Quality as part of Business IntelligenceBI Process Steps:

ź Define semantic layer (business glossary terms)

ź Prepare key outputs (reports, analysis, ad hoc capability)


ź Publish a Data Quality Declaration to indicate level of trust in the data consumed by a report

ź Metadata management, lineage & communication of shared understanding

ConclusionsData Quality techniques enhance the delivery and operation at each stage of delivering the DW/BI environment. This drives better solutions design, increased adoption and enhanced business value. Ideally, data quality capabilities will be incorporated during initial stages of DW/BI solution delivery. However, capability can also be retro-fitted to drive better trust of the DW/BI output.

Ongoing profiling, validation and correction of data ensures the contents of the DW/BI solution remain trusted.

About the authorAlan D. Duncan is an evangelist for information and analytics as enablers of better business outcomes, and a member of the Advisory Board for QFire Software.

An executive-level leader in the field of Information and Data Management Strategy, Governance and Business Analytics, he has over 20 years of international business experience, working with blue-chip companies in a range of industry sectors.

Alan was named by Information-Management.com in their 2012 list of “Top 12 Data Governance gurus you should be following on Twitter”.

Twitter: Blog: @Alan_D_Duncan http://informationaction.blogspot.com.au

data quality in data warehouse and business intelligence environments - discussion paper

Data & Analytics