towards a process oriented view on statistical data quality michaela denk, wilfried grossmann
DESCRIPTION
Approaches Towards Data Quality The usual approach towards data quality is the Reporting View Define a number of so called quality dimensions and evaluate the final product according to criteria for these dimensions Some frequently used dimensions: Accuracy, Relevance, Accessibility, Timeliness, Coherence, Comparability,... 3Grossmann, DenkTRANSCRIPT
Towards a Process Oriented View on Statistical Data Quality
Michaela Denk, Wilfried Grossmann
Contents Approaches Towards Data Quality Example Data Integration A Generic Statistical Workflow Model Quality Assessment Conclusions
2Grossmann, Denk
Approaches Towards Data Quality The usual approach towards data quality is the
Reporting View Define a number of so called quality dimensions
and evaluate the final product according to criteria for these dimensions
• Some frequently used dimensions:Accuracy, Relevance, Accessibility, Timeliness, Coherence, Comparability,...
3Grossmann, Denk
Approaches Towards Data Quality These dimensions are many times broken down
in sub-dimensions• Example Accuracy:
Sampling Effects, Representativity, Over-Coverage, Under-Coverage, Missing Values, Imputation Error, ....
Such an approach is fine as long as production of data follows a predefined scheme, which has limited degrees of freedom
4Grossmann, Denk
Approaches Towards Data Quality If we have a number of different opportunities for
data production such an approach is probably not the best one Compare the ideas of Total Quality Management
(TQM) in industrial production:Systematic treatment of the influence of different production steps on quality of the final product
We need a Processing View on data quality: How is data quality influenced by production?
5Grossmann, Denk
Approaches Towards Data Quality How can we arrive at a Processing View on data
quality? We need a statistical workflow model We have to organize the processing information
necessary for quality assessment in appropriate way
• Compare (old) ideas of B. Sundgren about capture of metadata
6Grossmann, Denk
Approaches Towards Data Quality We have to know functions for assessing quality
Output_Quality = F(Input_Quality, Processing_Quality)
Such functions have to be applied according to • The object we are interested in, e.g. a variable
or a population or a classification• The quality aspect we are interested in
7Grossmann, Denk
Example Data Integration Data integration occurs many times in
statistical data production, in particular in case of data production from administrative sources
It uses a number of operations usually understood as data pre-processing
Basic goal: Combine information from two or more already existing data sets
8Grossmann, Denk
Example Data Integration Example for a Data Integration Dataflow Input → Integration → Post-
alignment
9Grossmann, Denk
V1 (Gender)
Matching Key
Matching Key
V1 (Gender)
V2 (Status)
Source Data D1
Source Data D2
V1 (Gender)
Matching Key
V1 (Gender)
V2 (Status)
Data after IntegrationV1
(Gender)Matching
KeyV2 (Status)
Data after Post-alignment
Example Data Integration Top level task description
Match the datasets according matching key Align V1 (gender) Align V2 (status)
10Grossmann, Denk
Example Data Integration Details, Decisions to be made Are datasets appropriate?
• Quality of matching keys• Quality of data sources
Method for identification of matches? Method for handling ambiguities in V1 (Gender)? Method for imputation of V2 (Status)? How is quality measured
• At level of a summary measure?• At level of a specific variable?• At level of individual records?
11Grossmann, Denk
Example Data Integration There are no generally accepted standard tools
and methods for answering such questions Probably we have to compare a number of
alternative approaches Apply the generic format for different datasets Try different statistical methods and models Use different methods for quality assessment
• Traditional formulas• Simulation based evaluation• Assessment by using strategic surveys
12Grossmann, Denk
Example Data Integration Conclusion Different statistical methods may be an essential
part of data production and quality assessment There is no longer such a clear distinction
between “objective” data collection and statistical analysis
Statistics generates added value beyond (administrative) accounting and IT
13Grossmann, Denk
A Generic Statistical Workflow Model Statistical Workflow: A mixture from
Business Workflow (Process oriented) Scientific Workflow (Data oriented)
Quality evaluation is the main control element of the process
We have to consider the workflow at two levels Meta-level (Control of the process) Data-level (Production of data)
14Grossmann, Denk
A Generic Statistical Workflow Model Building blocks of the workflow model
Transformations (Basic data operations) Process components (Tasks) defined by:
• Task definition• Pre-Alignment• Feasibility Check• Main Transformation• Post-Alignment• Quality Evaluation
Workflow (Sequence of Process components)
15Grossmann, Denk
A Generic Statistical Workflow Model Example for Data Integration Component
Workflow
16Grossmann, Denk
A Generic Statistical Workflow Model In order to understand how statistics influences
the boxes and data quality let us zoom into the box for post-alignment
17Grossmann, Denk
Quality Assessment For quality assessment we need a detailed
description of the changes in meta-information during the dataflow
18Grossmann, Denk
Quality Assessment Example for meta-
information flow in data integration
Details for register based census in the presentation of Fiedler/Lenk in
Session 26 (Thursday)
19Grossmann, Denk
Quality Assessment Example:
Assessment of accuracy of variables V1 (Gender) and V2 (Status) in the example
20Grossmann, Denk
Quality Assessment V1 (Gender) Input
• Coincidence of matching keys in both datasets• Matching of the variable Gender in both datasets• Beliefs about quality of the variable in both sources
Accuracy Assessment• It seems that models developed in decision analysis
(calculus from belief networks) are appropriate • Alternatively we can use a strategic sample to check
whether our prior beliefs are correct and our decision rule is confirmed by statistical arguments
21Grossmann, Denk
Quality Assessment V2 (Status): Input
• Coincidence of matching keys in both datasets• Reliability of the model used for imputation• Measurement technique for quality of imputation
Accuracy Assessment• In this case we can apply traditional statistical
techniques like false classification rate, ROC-curve, simulation
22Grossmann, Denk
Conclusions We have presented a model, which allows tighter
coupling of quality assessment to the data production process
Such a model seems useful if data production has more degrees of freedom
What data should be used? What techniques should be used
The approach allows identification of the different factors influencing quality
23Grossmann, Denk
Conclusions It allows formulation of precise questions about
possible alternatives and defines new issues for research in statistical data quality
Hopefully it helps to understand better the added value generated by statistics
24Grossmann, Denk
Thank you for attention
25Grossmann, Denk