data quality: the “other” face of big data barna saha, divesh srivastava at&t labs-research
TRANSCRIPT
![Page 1: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/1.jpg)
Data Quality: the “other” Face of Big Data
Barna Saha, Divesh SrivastavaAT&T Labs-Research
![Page 2: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/2.jpg)
Outline
¨ Introduction
¨ Discovering data quality semantics
¨ Repairing inconsistencies
¨ Open problems + Q/A
2
![Page 3: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/3.jpg)
Big Data + Data Quality
¨ Big data: all about the V’s – Size: huge volume of data from multiple sources– Speed: dynamic data, collected and analyzed at high velocity– Complexity: huge variety of data and sources
¨ Goal: to extract significant value from big data
¨ Key issue: data quality– Raw data is often of questionable veracity– How do we obtain high quality information?
3
![Page 4: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/4.jpg)
Case Study: Big Data Quality [LDL+13]¨ Study on two domains
– Belief of clean data– Poor quality data can have big impact
4
#Sources Period #Objects #Local-attrs
#Global-attrs
Considered items
Stock 55 7/2011 1000*20 333 153 16000*20
Flight 38 12/2011 1200*31 43 15 7200*31
![Page 5: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/5.jpg)
Case Study: Big Data Quality
¨ Is the data consistent?– Tolerance to 1% value difference
5
![Page 6: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/6.jpg)
Case Study: Big Data Quality
¨ Why such inconsistency?– Semantic ambiguity
6
Yahoo! Finance
Nasdaq
52wk Range: 25.38-95.71
52 Wk: 25.38-93.72
Day’s Range: 93.80-95.71
![Page 7: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/7.jpg)
Case Study: Big Data Quality
¨ Why such inconsistency?– Unit errors
7
76,821,000
76.82B
![Page 8: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/8.jpg)
Case Study: Big Data Quality
8
![Page 9: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/9.jpg)
Case Study: Big Data Quality
¨ Why such inconsistency?– Pure errors
9
FlightView FlightAware
Orbitz
6:15 PM
6:15 PM
6:22 PM
9:40 PM8:33 PM
9:54 PM
![Page 10: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/10.jpg)
Case Study: Big Data Quality
¨ Why such inconsistency?– Random sample of 20 data items + 5 items with largest # of values
10
![Page 11: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/11.jpg)
Case Study: Big Data Quality
11
¨ Copying between sources?
![Page 12: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/12.jpg)
Case Study: Big Data Quality
¨ Copying on erroneous data?
12
![Page 13: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/13.jpg)
Case Study: Lessons Learned
¨ Big data has considerable inconsistency– Even in domains where poor quality data can have big impact– Semantics ambiguity, out of date data, unexplainable errors
¨ Data sources often copy from each other– Copying can happen on erroneous data, spreading poor quality data
13
![Page 14: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/14.jpg)
Data Quality: By the Numbers
¨ Impact of poor data quality– Erroneous data costs US businesses $600 billion/year [E02]– In DW projects, data cleaning takes 30-80% of time and budget– Data quality tools market is growing at 16% annually, way over 7%
average for other IT segments [G07]
¨ How much data is erroneous– Enterprise data error rates: average of 1-5%, some > 30% [R98]– Only 1/3rd of XML Web documents with XSD/DTD are valid, 14%
even lack well-formedness [GM11]
14
![Page 15: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/15.jpg)
Small Data Quality: How Was It Achieved?¨ Specify all domain knowledge as integrity constraints on data
– Reject updates that do not preserve integrity constraints– Works well when the domain is well understood and static
15
![Page 16: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/16.jpg)
Big Data Quality: A Different Approach?¨ Big data: integrity constraints cannot be specified a priori
– Data diversity → complete domain knowledge is infeasible– Data evolution → domain knowledge quickly becomes obsolete– Too much rejected data → “small” data
16
![Page 17: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/17.jpg)
Big Data Quality: A Different Approach?¨ Big data: integrity constraints cannot be specified a priori
– Data diversity → complete domain knowledge is infeasible– Data evolution → domain knowledge quickly becomes obsolete
¨ Solution: let the data speak for itself– Learn models (semantics) from the data– Identify data glitches as violations of the learned models– Repair data glitches and models in a timely manner
17
![Page 18: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/18.jpg)
In This Tutorial
¨ A focus on well-structured data and logic-based data quality– Models: logical constraints, e.g., (C)FDs, IDs, MDs, EGDs, DCs– Repairs: cost-based modifications to the data and models
¨ What we do not discuss in this tutorial– Logic-based: consistent query answering, without data repairs– Statistics-based: statistical models, anomaly detection– Unstructured data: quality of audio, video
18
![Page 19: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/19.jpg)
Outline
¨ Introduction
¨ Discovering data quality semantics
¨ Repairing inconsistencies
¨ Open problems + Q/A
19
![Page 20: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/20.jpg)
Outline
¨ Introduction
¨ Discovering data quality semantics
¨ Repairing inconsistencies
¨ Open problems + Q/A
20
![Page 21: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/21.jpg)
A Systematic Way to Data Quality
¨ Impose integrity constraints ¨ Errors and inconsistencies in data emerge as violation of
the constraints
![Page 22: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/22.jpg)
Discovering/ Learning Data Quality Semantics
¨ “small data” manually specify rules that govern the data semantics¨ “big data”
– let the data speak for itself– Learn rules and patterns from the data
![Page 23: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/23.jpg)
Discovering/ Learning Data Quality Semantics
¨ Variety of data– Looking at condition and context– Statistically robust measure
¨ Volume of data– Scalable algorithms
Efficiency vs Accuracy¨ Velocity of data
– Streaming and incremental algorithms
![Page 24: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/24.jpg)
Instance of Sales Relation
[name, type, country][price, tax]
![Page 25: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/25.jpg)
An Instance of Sales Relation
[name, type, country][price, tax]
The functional dependencydoes not hold
![Page 26: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/26.jpg)
An Instance of Sales Relation
[name, type, country][price, tax]
![Page 27: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/27.jpg)
An Instance of Sales Relation
[name, type, country][price, tax]
Conditional Functional Dependency
![Page 28: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/28.jpg)
Full VS Condition
¨ Functional dependency specifies integrity constraints over the whole database
¨ High variety of data one size does NOT fit all
– Conditional Functional Dependency– Similarly, conditional inclusion dependency,
conditional sequential dependency, conditional conservation dependency
![Page 29: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/29.jpg)
An Instance of Sales Relation
[name, type, country][price, tax]
Consider pattern[ -, -, UK || -, -]
![Page 30: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/30.jpg)
An Instance of Sales Relation
[name, type, country][price, tax]
Consider pattern[ -, -, UK || -, -]
![Page 31: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/31.jpg)
An Instance of Sales Relation
[name, type, country][price, tax]
Consider pattern[ -, -, UK || -, -]
Pattern must have enough support but it is ok to have small violations—these are possibly data errors
![Page 32: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/32.jpg)
An Instance of Sales Relation
[name, type, country][price, tax]
Consider pattern[ -, -, UK || -, -]
Local Support= 7/20=0.35Local Confidence=6/7=0.857
![Page 33: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/33.jpg)
An Instance of Sales Relation
[name, type, country][price, tax]
Global Support= 15/20=0.75Global Confidence=13/15=0.87
![Page 34: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/34.jpg)
Exact VS Soft/ Approximate
¨ Exact approaches might lead to over fitting and large number of patterns– Open world assumption
¨ Notion of support and confidence for statistically robust measures
![Page 35: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/35.jpg)
Learning Conditional Functional Dependencies (CFD)
¨ Given an embedded FD, learn the pattern tableaux¨ Learn CFD from the scratch
– Learn FD and also the pattern¨ Learnt CFD should have enough support and
confidence
![Page 36: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/36.jpg)
Learning Pattern Tableaux [GKK+08]
¨ Generate the smallest size tableaux with given global support and global confidence – NP-Complete– Hard to Approximate
¨ Generate the smallest size tableaux with given global support and local confidence – NP-Complete– APX Hard– in tableaux size
![Page 37: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/37.jpg)
Efficiency VS Accuracy [GKK+08]¨ Trade-off running time with accuracy of solution
– Learning Pattern Tableaux given embedded FD XY in tableaux size
Consider all instantiations of X Prune based on local confidence Now apply PARTIAL GREEDY COVERAGE until the desired
support is reached
![Page 38: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/38.jpg)
An Instance of Sales Relation
[name, type, country][price, tax]
Consider pattern[ -, -, UK || -, -]
SET
ELEMENTS
![Page 39: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/39.jpg)
Efficiency VS Accuracy [GKK+08]¨ Trade-off running time with accuracy of solution
– Learning Pattern Tableaux given embedded FD XY in tableaux size
Consider all instantiations of X Prune based on local confidence Now apply PARTIAL GREEDY COVERAGE until the desired
support is reached
¨ X=(A, B, C) A={a}, B={b}, C={c}¨ All instantiations of X : {-, -, -}, {a, -, -}, {-, b, -}, {-, -, c}, {a, b, -}, {a, -, c},
{-, b, c}, {a,b,c}¨ If |X|=K then the number of patterns is 2K
All Instantiations of X
![Page 40: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/40.jpg)
Efficiency VS Accuracy [GKK+08]¨ Trade-off running time with accuracy of solution
– Learning Pattern Tableaux given embedded FD XY in tableaux size
Consider all instantiations of X Prune based on local confidence Now apply PARTIAL GREEDY COVERAGE until the desired
support is reached
¨ X=(A, B, C) A={a}, B={b}, C={c}¨ All instantiations of X : {-, -, -}, {a, -, -}, {-, b, -}, {-, -, c}, {a, b, -}, {a, -, c},
{-, b, c}, {a,b,c}¨ If |X|=K then the number of patterns is 2K
All Instantiations of XToo many sets to consider in each iteration
![Page 41: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/41.jpg)
Efficiency VS Accuracy [GKK+08]
¨ Incremental generation of search space
{-,-,-}
{a,-,-} {-,b,-} {-,-,c}
{a,b,-} {a,-,c} {-,b,c}
{a,b,c}
Do not Instantiate the entire search space of X
![Page 42: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/42.jpg)
Efficiency VS Accuracy [GKK+08]
¨ Incremental generation of search space
{-,-,-}
{a,-,-} {-,b,-} {-,-,c}
{a,b,-} {a,-,c} {-,b,c}
{a,b,c}
Start from here, if local confidence is not met then explore its children which are not already pruned
Do not Instantiate the entire search space of X
![Page 43: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/43.jpg)
Efficiency VS Accuracy
¨ Incremental generation of search space
{-,-,-}
{a,-,-} {-,b,-} {-,-,c}
{a,b,-} {a,-,c} {-,b,c}
{a,b,c}
Start from here, if local confidence is not met then explore its children which are not already pruned
Do not Instantiate the entire search space of X
If local confidence is met then remove the entire sub-lattice incident on it
![Page 44: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/44.jpg)
Efficiency VS Accuracy
¨ Incremental generation of search space
{-,-,-}
{a,-,-} {-,b,-} {-,-,c}
{a,b,-} {a,-,c} {-,b,c}
{a,b,c}
Start from here, if local confidence is not met then explore its children which are not already pruned
Do not Instantiate the entire search space of X
If local confidence is met then remove the entire sub-lattice incident on it
¨ Same search space exploration as PARTIAL GREEDY SET COVER
![Page 45: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/45.jpg)
Streaming Validation of CFD [CGK+09]
¨ Massive amount of data arrives online¨ Learn CFD from sampled data, validate against
voluminous data– Data does not fit in memory
Create concise summary of data (fast)
![Page 46: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/46.jpg)
Streaming Validation of CFD [CGK+09]
¨ Simple summaries do not work– Uniform sampling– Uniform group sampling
CFD
- -
Confidence=0.75 Confidence=1
![Page 47: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/47.jpg)
Streaming Validation of CFD [CGK+09]
¨ Simple summaries do not work– Uniform sampling– Uniform group sampling
CFD
- -
Confidence=0.625 Confidence=1
![Page 48: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/48.jpg)
Streaming Validation of CFD [CGK+09]
¨ Given a relation R and an embedded FD: X Y, create a synopsis of the data so that given any arbitrary CFD we can return an estimate of its confidence such that
Approximation for Efficiency
![Page 49: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/49.jpg)
Streaming Validation of CFD [CGK+09]¨ Two Pass Algorithm
– Sample (reservoir sampling) O() rows uniformly– For each sampled row that satisfies CFD on X
Sample (reservoir sampling) from its support O() rows and estimate confidence
Alternate: Maintain heavy hitter with space O()– Return average confidence
![Page 50: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/50.jpg)
Streaming Validation of CFD [CGK+09]¨ Converting to a Single Pass
– Main Idea Classify groups based on exponentially decreasing support Keep summary for groups sampled at each level
![Page 51: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/51.jpg)
Streaming Validation of CFD [CGK+09]¨ Converting to a Single Pass
Estimate support of the group:
Estimate confidence of the group
Overall Estimate=
![Page 52: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/52.jpg)
Learning CFD from the scratch [FGLX+09]¨ Classification of CFD
– Constant CFD: patterns only contain constants– Variable CFD: patterns may contain wildcard “-”
Learning Constant CFD is more efficient than Variable CFD
Variable CFD gives more concise pattern
![Page 53: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/53.jpg)
Learning CFD from the scratch [FGLX+09]
¨ What kind of CFD do we want to learn ?– Minimal CFD:
Constant minimal CFD :
Variable minimal CFD : or,
Frequent CFD: must have support over a threshold
![Page 54: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/54.jpg)
Learning CFD from the Scratch [FGLX+09]
¨ A useful definition– Free Item set:
– Closed Item set
![Page 55: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/55.jpg)
Learning CFD from the Scratch [FGLX+09]
¨ A useful definition– Free Item set:
– Closed Item set
1. Clearly if is a minimal CFD then is free and has the same support, so contained in close2. Also there should not exist any free with the property that and
![Page 56: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/56.jpg)
Learning CFD from the Scratch [FGLX+09]
¨ CFD Miner• Suppose we have all k-frequent closed sets and their corresponding k-
frequent free sets to our disposal (GCGROWTH)
[Property 1: only possible consequent]• If there exists a = [Property 2]• Return for each the CFD
![Page 57: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/57.jpg)
Variable CFD• CTANE:
Extension of TANE for FD Level-wise algorithm explores the
attribute-set/pattern lattice• FASTCFD
Extension of FASTFD for FD Depth first search approach
Learning CFD from the Scratch [FGLX+09]
![Page 58: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/58.jpg)
Some Other Dependencies
¨ Inclusion¨ Matching¨ Sequential¨ Conservation¨ Denial
58
![Page 59: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/59.jpg)
Inclusion Dependency
¨ Example. every manager is an employee¨ Extension by condition and approximation
– Example: Most persons in English DBpedia born in the 19th century and dying in USA are also in German DBpedia
¨ Learning CIND given IND
59
![Page 60: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/60.jpg)
Matching Dependency
• Generalization of entity resolution• If two tuples show similarities in values in certain
attributes, then a given attribute value of these tuples must be matched (made same)
If name and phone numbers are sufficiently similar make their address identical
![Page 61: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/61.jpg)
Sequential Dependency
• Useful to express relationships between ordered attributes
• : difference between Y-attribute values of any two consecutive records when sorted on X must be in
• Can identify missing data (gaps too large), extraneous data (gaps too low), out of order data
• Extension: approximate, conditional• Creating pattern tableaux efficiently
![Page 62: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/62.jpg)
Conservation Dependency
• Useful to express relationships between two or multiple time series
• Extension: approximate, conditional• Creating pattern tableaux efficiently
Total inflow over time must match total outflow over time
![Page 63: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/63.jpg)
Learning Pattern Tableaux Efficiently [GKK+12]
• Conservation Dependency: Quick Flavor
• Extension with condition and approximation
Total inflow over time must match total outflow over time
![Page 64: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/64.jpg)
Conservation Dependency: Defining the measure [GKK+12]
¨ Confidence of an interval = ¨ Ignores duration of violation
Incoming traffic at a router
Outgoing traffic at a router
10 8 6 4 610 8 6 4 6
IN
OUT
a1 a2 a3 a4 a5
a1 a2 a3 a4 a5
b1 b2 b3 b4 b5 b1 b2 b3 b4 b5
34
![Page 65: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/65.jpg)
Conservation Dependency: Defining the measure [GKK+12]
¨ Confidence of an interval = ¨ Ignores duration of violation
Incoming traffic at a router
Outgoing traffic at a router
10 8 6 4 610 8 6 4 6
IN
OUT
a1 a2 a3 a4 a5
a1 a2 a3 a4 a5
b1 b2 b3 b4 b5 b1 b2 b3 b4 b5
34
![Page 66: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/66.jpg)
¨ Rightward Matching between IN and OUT: travel minimally to right to get matched
¨ A special case of EARTH MOVER DISTANCE
Confidence=1
Incoming traffic at a router
Outgoing traffic at a router
Confidence=0
10 8 6 4 6
5
10 8 6 4 6
IN
OUT
a1 a2 a3 a4 a5 a1 a2 a3 a4 a5
b1 b2 b3 b4 b5b1 b2 b3 b4 b5
34
Conservation Dependency: Defining the measure [GKK+12]
![Page 67: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/67.jpg)
¨ Confidence=
Confidence=1
Incoming traffic at a router
Outgoing traffic at a router
Confidence=0
10 8 6 4 6
5
10 8 6 4 6
IN
OUT
EMD=114Maximum EMD Possible=114
EMD=0Maximum EMD Possible=114
Conservation Dependency: Defining the measure [GKK+12]
![Page 68: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/68.jpg)
¨ Confidence=
Confidence=1
Incoming traffic at a router
Outgoing traffic at a router
Confidence=0
10 8 6 4 6
5
10 8 6 4 6
IN
OUT
EMD=114Maximum EMD Possible=114
EMD=0Maximum EMD Possible=114
How do we find all maximal intervals with high confidence efficiently ?
Conservation Dependency: Defining the measure [GKK+12]
![Page 69: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/69.jpg)
Conservation Dependency [GKK+12]
• Key Idea• Look at the cumulative curves (comes from EMD)• Consider only a subset of intervals (for efficiency)• Generate these subsets going backward from the n-th data point
( to ensure guaranteed approximation factor in near-linear time)
![Page 70: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/70.jpg)
Conservation Dependency
• Key Idea• Look at the cumulative curves (comes from EMD)• Consider only a subset of intervals (for efficiency)• Generate these subsets going backward from the n-th data point
( to ensure guaranteed approximation factor in near-linear time)
Efficiency VS Accuracy
![Page 71: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/71.jpg)
Denial Constraints
• Universally quantified first order logic• Much more expressive than FD and CFD• Examples:
A.) if two persons live in the same state, then one earning a lower salary has a lower tax rate;
B.) it is not possible to have single tax exemption greater than salary• Useful for data repairing, discovery of denial constraints
(with two attributes)
![Page 72: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/72.jpg)
Semi-structured Data
• Flexible representation• Easy customization• Error-Prone
• Vast majority of XML documents on the WEB do not have an accompanying DTD or XSD Schema description
![Page 73: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/73.jpg)
Learning DTD/XSD from XML corpus
• A good inference algorithm should satisfy1. Specialization: must minimally cover the given XML documents2. Generalization: cover all documents valid according to the “unknown” target schema but may not be present in the sample
![Page 74: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/74.jpg)
Learning Document Type Definitions (DTDs)
DTD: Context free grammar with regular expression (RE) on the RHS.
For every element name, infer the RE describing all the strings that appear below that element name in the XML corpus
A seminal result by Gold: Class of all REs cannot be learned only from positive
examples Which subset of REs can be learnt efficiently ?
![Page 75: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/75.jpg)
Learning Document Type Definitions (DTDs)[BNST+06] Which subset of REs can be learnt efficiently ?
Class of SINGLE OCCURRENCE REs (SORE)Every element name can appear only once. Example: is a SORE but is not
Class of CHAIN REGULAR EXPRESSIONS (CHARES)Subset of SORE: chain of factors Example: Experimentally performs better for generalization
![Page 76: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/76.jpg)
Learning SORE [BNST+06]
¨ SORE is 2-testableA language is 2-testable when there is a set of start element names , a set of final element names , and a set of 2-grams such that iff the first symbol of belongs to , the last symbol of belongs to and every 2-grams of is in ¨ Example
a
b
a
b
cc
a
b
b a
![Page 77: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/77.jpg)
Learning SORE [BNST+06]
¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.
¨ Convert the automaton to RE by rewriting
a
b
a
b
c
a
b
b a
Rewrite RulesDISJUNCTION: set of nodes all have same predecessor and successor set
i.) have no edge among themselves concatenate the nodes to have a single node (ii.) they have all the edges among themselves concatenate the nodes to have a single node (and add a self-loopa
(a+b) c
a
c
c
![Page 78: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/78.jpg)
Learning SORE [BNST+06]
¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.
¨ Convert the automaton to RE by rewriting
Rewrite RulesSelf-loop: Delete r and add
a
(a+b) cc
a
(𝑎+𝑏 )+¿¿ cc
![Page 79: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/79.jpg)
Learning SORE [BNST+06]
¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.
¨ Convert the automaton to RE by rewriting
Rewrite RulesConcatenation:
Concatenate into a single node
a
(𝑎+𝑏 )+¿¿ cc
(𝑎+𝑏 )+¿𝑐 ¿
![Page 80: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/80.jpg)
Learning SORE [BNST+06]
¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.
¨ Convert the automaton to RE by rewriting
Rewrite RulesOptional: all successors of r are also successors of predecessors of r
Relabel r by r? And remove all edges from r’s predecessors to r’s successors
a
(𝑎+𝑏 )+¿¿ cc
(𝑎+𝑏 )+¿𝑐 ¿
![Page 81: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/81.jpg)
Learning SORE [BNST+06]
¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.
¨ Convert the automaton to RE by rewriting
Rewrite RulesOptional: all successors of r are also successors of predecessors of r
Relabel r by r? And remove all edges from r’s predecessors to r’s successors
a
(𝑎+𝑏 )+¿¿ cc
(𝑎+𝑏 )+¿𝑐 ¿
If the underlying DTD is indeed SORE, the algorithm learns it
![Page 82: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/82.jpg)
Learning XSD from XML corpus
¨ Content model of an element depends on context– Items in an order contains id and price– Items in a stock contains id, quantity in stock and depending on
whether it is atomic or composed—a list of sub-items– DTD does not distinguish between order items and stock items
¨ Single occurrence XSD only contains single occurrence regular expressions
![Page 83: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/83.jpg)
Outline
¨ Introduction
¨ Discovering data quality semantics
¨ Repairing inconsistencies
¨ Open problems + Q/A
83
![Page 84: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/84.jpg)
Repair Techniques
¨ Glitch repairs by value modification, for FDs + InDs [BFF+05]– Introduced the idea of cell equivalence classes
¨ Glitch + model repairs, for FDs [CM11]– Introduced the idea of model repairs
¨ Glitch repairs, for EGDs [GMP+13]– Introduced chase-based technique to repair many constraints
84
![Page 85: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/85.jpg)
Repair Techniques
¨ Glitch repairs by value modification, for FDs + InDs [BFF+05]– Introduced the idea of cell equivalence classes
¨ Glitch + model repairs, for FDs [CM11]– Introduced the idea of model repairs
¨ Glitch repairs, for EGDs [GMP+13]– Introduced chase-based technique to repair many constraints
85
![Page 86: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/86.jpg)
Repairs Using Value Modification [BFF+05]¨ Problem: Given a database D, FD and InD constraints C, such that
(D, C) is inconsistent, find repair D’ of D with minimum cost(D’)
¨ Result: The problem is NP-hard even for only FDs or only InDs
¨ Key ideas:– Focus on value modifications of FD RHS attributes– Cost model for repairs is based on value accuracy, repair similarity– Equivalence classes of cells with identical values in the repair
permits a delayed assignment of a value to an equivalence class
86
![Page 87: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/87.jpg)
Repairs Using Value Modification [BFF+05]
¨ InD: Equip[Tel] → Customer[Tel]
87
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
![Page 88: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/88.jpg)
Repairs Using Value Modification [BFF+05]
¨ InD: Equip[Tel] → Customer[Tel]
88
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
![Page 89: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/89.jpg)
Repairs Using Value Modification [BFF+05]
¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip]
89
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
![Page 90: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/90.jpg)
Repairs Using Value Modification [BFF+05]
¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip]
90
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
X
![Page 91: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/91.jpg)
Repairs Using Value Modification [BFF+05]¨ Repair alternatives when records ti and tj violate FD: X → Y
¨ Value modification of LHS attributes X– Modify tj[X] to a value different from ti[X]– Unclear what (different) value should be assigned to tj[X]
¨ Value modification of RHS attributes Y– Modify tj[Y] to equal ti[Y] or vice versa– Use cost of repair to choose between alternatives– FD violations can always be repaired by modifying RHS attributes Y– Naïve approach can lead to non-termination
91
![Page 92: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/92.jpg)
Repairs Using Value Modification [BFF+05]
¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip]
92
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
![Page 93: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/93.jpg)
Repairs Using Value Modification [BFF+05]
¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]
93
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
![Page 94: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/94.jpg)
Repairs Using Value Modification [BFF+05]
¨ FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]
94
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
X
![Page 95: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/95.jpg)
Repairs Using Value Modification [BFF+05]
¨ FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]
95
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8145 Bob Jones 5 Valley Centre NY 10012 1
t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
?
![Page 96: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/96.jpg)
Repairs Using Value Modification [BFF+05]
¨ InD: Equip[Tel] → Customer[Tel] FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]
96
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8145 Bob Jones 5 Valley Centre NY 10012 1
t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
X
![Page 97: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/97.jpg)
Repairs Using Value Modification [BFF+05]¨ Repair alternatives when record ti violates InD: Ri[X] → Rj[Y]
¨ Value modification of ti[X] – Modify tj[X] to a value tj[Y] for some tj in Rj
¨ Value modification of tj[Y] – Modify tj[Y] for some tj in Rj to equal ti[X]
¨ Use cost of repair to choose between alternatives
97
![Page 98: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/98.jpg)
Repairs Using Value Modification [BFF+05]
¨ InD: Equip[Tel] → Customer[Tel] FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]
98
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8145 Bob Jones 5 Valley Centre NY 10012 1
t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8145 L55011 LU ze400 Mar-03 1
![Page 99: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/99.jpg)
Repairs Using Value Modification [BFF+05]
¨ Greedily build equivalence classes of cells– {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)}– {(t1, Name), (t4, Name)}– …
99
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
![Page 100: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/100.jpg)
Repairs Using Value Modification [BFF+05]
¨ Greedily build equivalence classes of cells, assign unique value– {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)} → 555-8145– {(t1, Name), (t4, Name)} → Alice Smith– …
100
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
![Page 101: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/101.jpg)
Repair Techniques
¨ Glitch repairs by value modification, for FDs + InDs [BFF+05]– Introduced the idea of cell equivalence classes
¨ Glitch + model repairs, for FDs [CM11]– Introduced the idea of model repairs
¨ Glitch repairs, for EGDs [GMP+13]– Introduced chase-based technique to repair many constraints
101
![Page 102: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/102.jpg)
Repairing Data and Constraints [CM11]¨ Motivation: evolution of data semantics
¨ Problem: Given a database D, FD constraints C, such that (D, C) is inconsistent, find repair (D’, C’) with minimum cost
¨ Key ideas:– Allow value modifications of FD RHS or LHS attributes– Allow modifications of FDs in C by augmenting the LHS– Cost model for repairs is based on minimum description length
102
![Page 103: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/103.jpg)
Repairing Data and Constraints [CM11]
¨ FD: [District, Region] → [AC, City, State]
103
Tid District Region Municipal AC Tel Street Zip City State
t1 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY
t2 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY
t3 Brookside Granville Glendale 613 299-1010 Westlane 10211 NY MA
t4 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA
t5 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA
t6 Brookside Granville Queen 517 930-2525 Main 60415 Chicago IL
t7 Brookside Granville Queen 517 888-5152 Main 60415 Chicago IL
t8 Brookside Granville Queen 517 888-5152 Main 60601 Chicago IL
t9 Brookside Granville Queen 517 888-5152 Bay 60601 Chicago IL
![Page 104: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/104.jpg)
Repairing Data and Constraints [CM11]
¨ FD: [District, Region] → [AC, City, State]– Expensive repair using only value modifications
104
Tid District Region Municipal AC Tel Street Zip City State
t1 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY
t2 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY
t3 Brookside Granville Glendale 613 299-1010 Westlane 10211 NY MA
t4 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA
t5 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA
t6 Brookside Granville Queen 517 930-2525 Main 60415 Chicago IL
t7 Brookside Granville Queen 517 888-5152 Main 60415 Chicago IL
t8 Brookside Granville Queen 517 888-5152 Main 60601 Chicago IL
t9 Brookside Granville Queen 517 888-5152 Bay 60601 Chicago IL
![Page 105: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/105.jpg)
Repairing Data and Constraints [CM11]¨ Repair alternatives when records ti and tj violate FD: X → Y
¨ Value modification of RHS attributes Y
¨ Value modification of LHS attributes X– Modify tj[X] to a value different from ti[X], supported by the data
¨ Repair constraints by augmenting LHS (X) with a new attribute– New attribute provides additional context
¨ Choose from alternatives using MDL-based cost model
105
![Page 106: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/106.jpg)
MDL-Based Cost Model [CM11]
¨ Quantifies trade-off of a data repair versus a constraint repair
¨ Cost-model based on the three properties– Accuracy: value modifications must minimize distance– Redundancy: value modifications must be well supported in data,
constraint repairs must result in a higher degree of consistency– Conciseness: repaired constraints should explain, but not overfit
¨ Minimum description length (MDL) principle– Length of model + length to encode data given the model
106
![Page 107: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/107.jpg)
Repairing Data and Constraints [CM11]
¨ Cheap repair of constraints and data– FD: [District, Region, Municipal] → [AC, City, State]– t3.State = NY
107
Tid District Region Municipal AC Tel Street Zip City State
t1 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY
t2 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY
t3 Brookside Granville Glendale 613 299-1010 Westlane 10211 NY MA
t4 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA
t5 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA
t6 Brookside Granville Queen 517 930-2525 Main 60415 Chicago IL
t7 Brookside Granville Queen 517 888-5152 Main 60415 Chicago IL
t8 Brookside Granville Queen 517 888-5152 Main 60601 Chicago IL
t9 Brookside Granville Queen 517 888-5152 Bay 60601 Chicago IL
![Page 108: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/108.jpg)
EGD Based Cleaning Framework [GMP+13]¨ Many possible repairing strategies to obtain preferred values
– Using “master” data, e.g., table Src– Using confidence and distance – Using freshness and currency
¨ Issue: interaction between dependencies– Sensitivity to the order in which repairs are applied
108
![Page 109: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/109.jpg)
Validating XML
¨ Validate well-formedness first: strong validation¨ Validate assuming well-formedness: validaton
109
![Page 110: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/110.jpg)
Validating XML
¨ How to validate well-formedness in small space ?¨ What class of DTD can be validated in small memory when XML
document streams in ?
110
![Page 111: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/111.jpg)
Validating Well-formedness in streaming setting¨ Streaming XML document¨ Can we check if the document is well-formed in small space ?
![Page 112: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/112.jpg)
Well-formedness of XML Documents
¨ Open and close tags of XML documents must be well-formed
112
<article> <title>
A Relational Model for Large Shared Data Banks <authors> </title> <author>
<name>E. F. Codd
</name></author> </article>
<article> <title>
A Relational Model for Large Shared Data Banks <authors> </title> <author>
<name>E. F. Codd
</name></author> </article>
![Page 113: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/113.jpg)
Validating Well-formedness in streaming setting [MMN10]¨ Streaming XML document¨ Can we check if it is well-formed in small space ?¨ Grammar of well-formed parentheses of s types: ¨ If we can validate for , we can also validate for with blow up
in space
![Page 114: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/114.jpg)
Validating Well-formedness in streaming setting [MMN10]
¨ Validating for
– Example: – – Matching pair: ,
![Page 115: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/115.jpg)
Validating Well-formedness in streaming setting [MMN10]
¨ Validating for
– Example: – – Matching pair: ,
![Page 116: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/116.jpg)
Validating Well-formedness in streaming setting [MMN10]
¨ Define two hash functions g, for any subword as ¨ where
¨ h
If v is well-formed g(v)=h(v)=0 else probability that they are both 0 is very low
𝑝𝑖𝑠 𝑎𝑝𝑟𝑖𝑚𝑒 𝑖𝑛𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑛 {1+𝑐 }𝑎𝑛𝑑𝑛2 {1+𝑐 }𝑎𝑛𝑑𝛼 , 𝛽∈𝑢𝑛𝑖𝑓 [0 ,𝑝−1]
![Page 117: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/117.jpg)
Validating Well-formedness in streaming setting [MMN10]
Algorithm (key idea)¨ Read parentheses and reduce them to the form wW where w
consists of only down steps and W consists of only upsteps¨ If w is empty,
– construct hashes for W and compute its length: push (g(W),h(W),|W|) in the stack
¨ Else – construct hashes for w, pop (g,h,l) from the stack, update
g=g+g(w), h=h+h(w), l=l-1 and push back to stack– If l=0 and both g and h are not identically to 0 ERROR– Construct hashes for W along with its length and insert in
the stack
![Page 118: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/118.jpg)
Repairing Malformedness Efficiently [KSSY13]¨ Repairing based on edit distance
![Page 119: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/119.jpg)
Repairing Malformedness [KSSY13]
¨ In the streaming setting only very restricted errors can be repaired¨ When there is sufficient memory to hold the entire XML document,
near linear time algorithms can be devised with guaranteed performance
¨ Extension to consider position of text¨ Extension to return multiple edits using branch and bound
![Page 120: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/120.jpg)
Open Problems¨ Many learning problems are based on lattice structure
– Exploit this structure better– Example: CFD pattern tableaux learning uses partial greedy set cover. Can
we design a careful algorithm which will beat in the approximation bound ?
![Page 121: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/121.jpg)
Open Problems¨ Streaming and distributed setting both for learning and detection are
extremely important– Very basic results so far– Data placement, replication become very useful for distributed
processing
![Page 122: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/122.jpg)
Open Problems¨ Semistructured Data
– What is the most general model that is tractable (validation+repair) in different computation model for XML ?
– Learning distributions of types of errors Language Edit Distance Problem
![Page 123: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/123.jpg)
Open Problems¨ Crowdsourcing
– Use crowd to distinguish between data and error – Extend crowd-based entity resolution technique to handle matching
dependencies– Model errors made by crowd themselves
![Page 124: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649d835503460f94a69284/html5/thumbnails/124.jpg)
Open Problems¨ Crowdsourcing
– Use crowd to distinguish between data and error – Extend crowd-based entity resolution technique to handle matching
dependencies– Model errors made by crowd themselves
?