data quality
DESCRIPTION
Data Quality. Class 5. Goals. Data Quality Rules (Continued) Example Use of Data Quality Rules. Data Quality Rules Classes. 1) Null value rules 2) Value rules 3) Domain membership rules 4) Domain Mappings 5) Relation rules - PowerPoint PPT PresentationTRANSCRIPT
Data Quality Rules Classes
1) Null value rules 2) Value rules 3) Domain membership rules 4) Domain Mappings 5) Relation rules 6) Table, Cross-table, and Cross-message assertions 7) In-Process directives 8) Operational Directives 9) Other rules
Representing Data Quality Rules
Data is divided into 2 sets:– conformers– violators
Sets can be represented using SQL Create SQL statements representing violating
set
Using SQL
Direct queries Embedded queries
– Using ODBC/JDBC, can create validation scripts in C C++ Java Visual Basic Etc.
Null Value Representations
Maintain a table of null representation types and names:
create table nullreps (name varchar(30),
nulltype char(1),
description varchar(1024),
source varchar(512),
nullval varchar(100),
nullrepid integer);
Null Value Rules
Allows nulls– If the rule is “allows nulls” without any additional
characterization Nothing necessary
– If the rule is “allows nulls,” but only of a specific type Must check for real nulls (and possibly blanks and spaces): SELECT * from <table> WHERE <table>.<attribute> is
NULL;
Null Value Rules
Does not allow nulls– Must check for nulls(and possibly blanks and
spaces): SELECT * from <table> WHERE <table>.<attribute> is
NULL;
Value Rules
Value rule is specified as some set of constraints
Makes use of operators and functions:– +, -, *, /, <, <=, >, >=, !=, ==, AND, OR– User defined functions
Example:– value >= 0 AND value <= 100
Value Rules 2
Validation test is opposite of constraint Use DeMorgan’s laws
– If constraint was “value >= 0 AND value <= 100), use:
SELECT * from <table> where <table>.<attribute> < 0 OR
<table>.<attribute> > 100;
Domain Membership
Domains are stored in a database table Test for domain membership of an attribute is a
test to make sure that all values are represented in domain table
Domain Reference Tables
create table domainref (
name varchar(30),
dtype char(1),
description varchar(1024),
source varchar(512),
domainid integer
);
Domain Membership
Test for membership of attribute foo in the domain named bar:
SELECT * from <table> where foo not in
(SELECT value from domainvals where domainid =
(SELECT domainid from domainref where
domainref.name = “bar”));
Domain Assignment
The values in the attribute define the domain:– Find all the values not in the domain already– Update domain tables with those values
Domain Assignment 2
SELECT * from <table> where foo not in (SELECT value from domainvals where domainid =
(SELECT domainid from domainref where domainref.name = “bar”));
For all values in this set, create a record with (the value, the domain id for “bar”), and insert into domainvals.
Mapping Membership
Similar to domain membership, except:– Must include domain membership tests for both
values– Also must be looked up in the mapping tables
Completeness
Defines when a record is complete– Ex: IF (Orders.Total > 0.0), Complete With
{Orders.Billing_Street, Orders.Billing_City, Orders.Billing_State, Orders.Billing_ZIP}
Format:– Condition– List of fields that must be complete
Completeness 2
Equivalent to a set of null tests using condition Select * from <table> where <condition is true>
and <list of not null tests>;
Exemption
Defines which fields must be missingIF (Orders.Item_Class != “CLOTHING”) Exempt
{Orders.Color,
Orders.Size
}
Format:– Condition– List of fields that must be null
Exemption 2
If condition is true, the fields may not be null Equivalent for test of condition and test for not
nulls
Consistency
Define a relationship between attributes based on field content– IF (Employees.title == “Staff Member”) Then
(Employees.Salary >= 20000 AND Employees.Salary < 30000)
– Format: Condition Assertion
Consistency 2
If condition is true, the assertion must be true Equivalent to test for cases where the condition
is true and the assertion is false:
Select * from <table> where <condition> and not <assertion>;
Derivation
Prescriptive form of consistency rule Details how one attribute’s value is determined based
on other attributesIF (Orders.NumberOrdered > 0) Then {
Orders.Total = (Orders.NumberOrdered * Orders.Price) * 1.05
}
Format:– Condition– assignment
Derivation 2
The assigned fields must be updated if condition is true
Find all records where the condition is true Generate update SQL calls with updated
values Execute updates
Functional Dependence
Functional Dependence between columns X and Y:– For any two records R1 and R2 in a table,
if field X of record R1 contains value x and field X of record R2 contains the same value x, then if field Y of record R1 contains the value y, then field Y of record R2 must contain the value y.
In other words, attribute Y is said to be determined by attribute X.
Functional Dependence 2
Rule Format:– Attribute X determines Attribute Y
Validation test makes sure that the functional dependence criterion is met
This means that if we extract the X value from the set of all distinct value pairs, that set should have no duplicates
Functional Dependence 3
Create view FD as select distinct X, Y from <table>;
Select count (*) from FD; Select count (distinct X) from <table>;
These should be the same numbers.
Primary Key/Uniqueness
A set of attributes defined as a primary key must uniquely identify a record
Can also be viewed as a uniqueness constraint Format:
– {attribute list} is PRIMARY– {attribute list} is UNIQUE
Primary
Test to make sure that the number of distinct records with the expected key is the same as the number of records
Select count(*) from <table>; Select count (distinct <attribute list>) from
<table>;
These numbers should be the same
Uniqueness
Test for multiple record occurrences with the same set of values that should have been unique, if there is a separate known primary key
SELECT <table>.<attribute>, <table>.<attribute>
FROM <table> AS t1, <table> AS t2
WHERE t1.<attribute> = t2.<attribute> and t1.<primary> <> t2.<primary>;
Foreign Key
When the values in field f in table T is chosen from the key values in field g in table S, field S.g is said to be a foreign key for field T.f
If f is a foreign key, the key must exist in table S, column g (=referential integrity)
Foreign Key 2
Similar to primary key Test is to make sure that all values in foreign
key field exist in target table
Select * from <source table> where <attribute> not in (Select distinct <attribute> from <target table>);
Use of Data Quality Rules
Data Validation Root Cause Analysis Message Transformation Data-driven GUIs Metadata Collection
Data Validation
Translate rule set into select statements Create a program that:
– Loads select statements into an array, indexed by a unique integer
– Connects to database via ODBC– Iterates through the array of select statements those
results
Data Validation 2
– Each type of rule has an expected result; check against the expected result
– Outputs the result of each statement to output file, tagged by rule identifier
– Results can be tallied to yield an overall percentage of valid records to total records
Root Cause Analysis
Root cause analysis can be started by looking at the counts of violated rules
Use the most frequently violated rule as a starting place
Message Transformation
Electronic Data Interchange Use DQ rules to validate incoming messages Use DQ rules (derivations, mappings) to
transform incoming messages into an internal format
Data-driven GUIs
Data dependence is specified in a collection of rules
Generate equivalence classes of data values based on dependence specification
Data-driven GUIS
First, look for all independent attributes – this is class 0 For class i, collect all attributes that depend on class (i
– 1) The GUI will be constructed to iteratively request data
from class 0..n Based on the results from collecting data at step j, the
rules associated with the actual values are applied, determining which values are requested at step j + 1