aspects of data quality (excellent!)

2
Data quality is multidimensional, and involves data management, modelling and an alysis, quality control and assurance, storage and presentation. As independentl y stated by Chrisman [2] and Strong et al. [3], data quality is related to use a nd cannot be assessed independently of the user. In a database, the data have no actual quality or value [4]; they only have potential value that is realized on ly when someone uses the data to do something useful. Information quality relate s to its ability to satisfy its customers and to meet customers needs [5] Chapman goes on to enumerate a set of factors that contribute to fitness-for-use , citing Redman [6]: Accessibility Accuracy Timeliness Completeness Consistency with other sources Relevance Comprehensiveness Providing a proper level of detail Easy to read Easy to interpret Data qualityHigh quality data needs to pass a set of quality criteria. Those inc lude: Accuracy: An aggregated value over the criteria of integrity, consistency and de nsity Integrity: An aggregated value over the criteria of completeness and validity Completeness: Achieved by correcting data containing anomalies Validity: Approximated by the amount of data satisfying integrity constraints Consistency: Concerns contradictions and syntactical anomalies Uniformity: Directly related to irregularities Density: The quotient of missing values in the data and the number of total valu es ought to be known Uniqueness: Related to the number of duplicates in the data [edit] The process of data cleansingData Auditing: The data is audited with the use of statistical methods to detect anomalies and contradictions. This eventual ly gives an indication of the characteristics of the anomalies and their locatio ns. Workflow specification: The detection and removal of anomalies is performed by a sequence of operations on the data known as the workflow. It is specified after the process of auditing the data and is crucial in achieving the end product of high quality data. In order to achieve a proper workflow, the causes of the ano malies and errors in the data have to be closely considered. If for instance we find that an anomaly is a result of typing errors in data input stages, the layo ut of the keyboard can help in manifesting possible solutions. Workflow execution: In this stage, the workflow is executed after its specificat ion is complete and its correctness is verified. The implementation of the workf low should be efficient even on large sets of data which inevitably poses a trad e-off because the execution of a data cleansing operation can be computationally expensive. Post-Processing and Controlling: After executing the cleansing workflow, the res ults are inspected to verify correctness. Data that could not be corrected durin g execution of the workflow are manually corrected if possible. The result is a new cycle in the data cleansing process where the data is audited again to allow the specification of an additional workflow to further cleanse the data by auto matic processing. [edit] Popular methods usedParsing: Parsing in data cleansing is performed for t he detection of syntax errors. A parser decides whether a string of data is acce ptable within the allowed data specification. This is similar to the way a parse

Upload: aejiii

Post on 07-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

8/6/2019 Aspects of Data Quality (Excellent!)

http://slidepdf.com/reader/full/aspects-of-data-quality-excellent 1/2

Data quality is multidimensional, and involves data management, modelling and analysis, quality control and assurance, storage and presentation. As independently stated by Chrisman [2] and Strong et al. [3], data quality is related to use and cannot be assessed independently of the user. In a database, the data have noactual quality or value [4]; they only have potential value that is realized on

ly when someone uses the data to do something useful. Information quality relates to its ability to satisfy its customers and to meet customers needs [5]

Chapman goes on to enumerate a set of factors that contribute to fitness-for-use, citing Redman [6]:

AccessibilityAccuracyTimelinessCompletenessConsistency with other sourcesRelevanceComprehensivenessProviding a proper level of detailEasy to

read

Easy to interpret

Data qualityHigh quality data needs to pass a set of quality criteria. Those include:

Accuracy: An aggregated value over the criteria of integrity, consistency and densityIntegrity: An aggregated value over the criteria of completeness and validityCompleteness: Achieved by correcting data containing anomaliesValidity: Approximated by the amount of data satisfying integrity constraintsConsistency: Concerns contradictions and syntactical anomaliesUniformity: Directly related to irregularities

Density: The quotient of missing values in the data and the number of total values ought to be knownUniqueness: Related to the number of duplicates in the data[edit] The process of data cleansingData Auditing: The data is audited with theuse of statistical methods to detect anomalies and contradictions. This eventually gives an indication of the characteristics of the anomalies and their locations.Workflow specification: The detection and removal of anomalies is performed by asequence of operations on the data known as the workflow. It is specified afterthe process of auditing the data and is crucial in achieving the end product ofhigh quality data. In order to achieve a proper workflow, the causes of the ano

malies and errors in the data have to be closely considered. If for instance we

find that an anomaly is a result of typing errors in data input stages, the layout of the keyboard can help in manifesting possible solutions.Workflow execution: In this stage, the workflow is executed after its specification is complete and its correctness is verified. The implementation of the workflow should be efficient even on large sets of data which inevitably poses a trade-off because the execution of a data cleansing operation can be computationallyexpensive.

Post-Processing and Controlling: After executing the cleansing workflow, the results are inspected to verify correctness. Data that could not be corrected during execution of the workflow are manually corrected if possible. The result is anew cycle in the data cleansing process where the data is audited again to allowthe specification of an additional workflow to further cleanse the data by auto

matic processing.

[edit] Popular methods usedParsing: Parsing in data cleansing is performed for the detection of syntax errors. A parser decides whether a string of data is acceptable within the allowed data specification. This is similar to the way a parse

8/6/2019 Aspects of Data Quality (Excellent!)

http://slidepdf.com/reader/full/aspects-of-data-quality-excellent 2/2

r works with grammars and languages.Data Transformation: Data Transformation allows the mapping of the data from their given format into the format expected by the appropriate application. This includes value conversions or translation functions as well as normalizing numericvalues to conform to minimum and maximum values.

Duplicate Elimination: Duplicate detection requires an algorithm for determiningwhether data contains duplicate representations of the same entity. Usually, da

ta is sorted by a key that would bring duplicate entries closer together for faster identification.Statistical Methods: By analyzing the data using the values of mean, standard deviation, range, or clustering algorithms, it is possible for an expert to find values that are unexpected and thus erroneous. Although the correction of such data is difficult since the true value is not known, it can be resolved by settingthe values to an average or other statistical value. Statistical methods can al

so be used to handle missing values which can be replaced by one or more plausible values that are usually obtained by extensive data augmentation algorithms.