barna saha, at&t research laboratory joint work with: lukasz golab, howard karloff, flip korn,...
TRANSCRIPT
![Page 1: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/1.jpg)
Discovering Conservation Rules
Barna Saha, AT&T Research Laboratory
Joint work with:Lukasz Golab, Howard Karloff, Flip Korn, Divesh
Srivastava
![Page 2: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/2.jpg)
Discovering Conservation Rules
Data QualityData Cleaning
![Page 3: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/3.jpg)
IRS Vs Federal Mathematician“You owe us $10,000 plus accrued interest in taxes for last year. You earned $36,000 last year but only had $1 withheld from your paycheck for Federal taxes.”
“ How could I work the entire year and only have $1 withheld ? I do not have time to waste on this foolishness. Goodbye !”
The Federal Government Agency had only allocated enough storage on the computer to handle withholding amounts of $9999.99 or less . Amount withheld was $10001.00. The last $1 made the crucial difference.
![Page 4: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/4.jpg)
The Risk of Massive ID Fraud
In May 2004, Ryan Pirozzi of Edina, Minnesota opened his mail box and found more than a dozen bank statements inside.
None of the accounts were his !
Because of a data entry error made by a clerk at the processing center of Wachovia Corp, a large bank headquartered in the Southeastern USA, over the course of 9 months, Pirozzi received the financial statements of 73 strangers. Their names, SSN, bank account numbers constitute an identity thief's dream !
![Page 5: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/5.jpg)
The Risk of Massive ID Fraud
Pirozzi began receiving completed 1099 tax forms belonging to many of these people .
Finally one day in January 2005, a strange thing happened. Mr. Pirozzi went to his mail box and discovered an envelope from Wachovia that contained his completed 1099 tax form. That was the first piece of correspondence that he received from the bank that actually belonged to them.
![Page 6: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/6.jpg)
Source of these storiesMARYLAND RESIDENTS
BEWARE!800 houses in Montgomery County,
Maryland, were put on auction block in 2005 due to mistakes in the tax
payment data of Washington Mutual Mortgage
FOR SALE!
![Page 7: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/7.jpg)
Data Quality ToolsReal world data is often dirty:
InconsistentInaccurateIncompleteStale
Enterprises typically find data error rates of approximately 1%-5%, for some companies, it is above 30%.Dirty data costs US businesses 600 billion dollars annually.Data cleaning accounts for 30%-80% of development time and budget in most data warehouse projects.
Data Quality Tools:• Detect and repair errors• Differentiate between dirty and clean
data
![Page 8: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/8.jpg)
A Systematic Approach to Improve Data Quality
Impose integrity constraints Semantic rules for dataErrors and inconsistencies in data emerge as
violation of the constraints
![Page 9: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/9.jpg)
Integrity Constraints: Functional Dependency [Codd, 1972]
Functional Dependency:(Name, Type, State) (Price, Vat)
Name Type State Price VatCH1 Clothing Maryland $50 $3BK45 Book New
Jersey$120 $15
FN30 Furniture Washington
$100 $0
CH1 Clothing Maryland $100 $6BK66 Book Washingto
n$80 $10
![Page 10: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/10.jpg)
New Integrity Constraints for Data Quality
Functional Dependency:(Name, Type, State) (Price, Vat)
Name Type State Price VatCH1 Clothing Maryland $50 $3BK45 Book New
Jersey$120 $15
FN30 Furniture Washington
$100 $0
CH1 Clothing Maryland $100 $6BK66 Book Washingto
n$80 $10
1. If (Type=Book) then the above FD holds. 2.If (State=Washington) then (Vat=0)
Conditional Functional Dependency
![Page 11: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/11.jpg)
New Integrity Constraints for Data Quality
Conditional Functional Dependency
Sequential Dependency
Aggregation Dependency: Discovering Conservation Rules
![Page 12: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/12.jpg)
MotivationsInfrastructure networks are continuously
monitored over time.Example:
Highway monitoring systems use road sensors to identify traffic buildup and suggest alternate routes.
Routers in an IP telecommunications network maintain counters to keep track of the traffic flowing through them.
Power meters measure electricity flowing through different systems
Monitored to troubleshoot customer problems, check network performance and understand provisioning requirements.
![Page 13: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/13.jpg)
Data Quality Problems in Infrastructure NetworksMissing or delayed data, especially over large
interval of times, can be detrimental to any attempt to ensure reliable and well-functioning network.
IP network monitoring typically uses the UDP protocol, so measurements can be delayed (or even lost) when there is high network congestion.
Sometimes a new router interface is activated and traffic is flowing through it, but this interface is not known to the monitoring system; in this case, there is missing data that is hard to detect.
![Page 14: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/14.jpg)
Data Quality Problems in Infrastructure NetworksMissing or delayed data, especially over large
interval of times, can be detrimental to any attempt to ensure reliable and well-functioning network.
Monitoring road networks in the presence of sensor failures or unmonitored road segments
Monitoring electricity networks in the presence of hacked power meters or if someone is diverting (stealing) electricity, etc.
Detecting data quality issues is difficult when monitoring large and complex
networks
![Page 15: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/15.jpg)
ApproachImpose integrity constraints to capture the
semantics of dataProvide concise summary of data where the
rules hold/fail efficiently.
![Page 16: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/16.jpg)
Integrity Constraint: Conservation RulesIn many infrastructure networks, there exists
a conservation law between related quantities in monitored data. Kirchoff’s Node Law of
Conservation of Electricity : The current flowing into a node in an electric circuit equals the current flowing out of the node.
Road Network Monitoring: Every car that enters an intersection must exit.
Telecommunication Networks: Every packet entering a router must exit.
And many more…
![Page 17: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/17.jpg)
Conservation Rules
One to One Matching
Match each incoming event to each outgoing event, and report average delay/ loss as measure for violation of conservation laws. Infeasible with respect to storage and processing costs to collect individual packets/ monitor individual events
Monitoring systems provide aggregate counts at regular intervals.
![Page 18: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/18.jpg)
Conservation Rules
Incoming traffic at a router
Outgoing traffic at a router time
We expect the two time series to be identical
Matching incoming and outgoing aggregated traffic at every time point may not reveal true data quality issues.
Clock synchronization errorQueuing delay
Compare aggregated total over time windows.
![Page 19: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/19.jpg)
Conservation Rules: Confidence of an Interval
Confidence of an interval = Ignores duration of violation
Incoming traffic at a router
Outgoing traffic at a router
10 8 6 4 6
10 8 6 4 6
IN
OUT
a1 a2 a3 a4 a5
a1 a2 a3 a4 a5
b1 b2 b3 b4 b5 b1 b2 b3 b4 b5
34
∑ ai
∑ bi
![Page 20: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/20.jpg)
Conservation Rules: Confidence of an interval
Rightward Matching between IN and OUT
Confidence=1
Incoming traffic at a router
Outgoing traffic at a router
Confidence=0
10 8 6 4 6
5
10 8 6 4 6
IN
OUT
a1 a2 a3 a4 a5 a1 a2 a3 a4 a5
b1 b2 b3 b4 b5b1 b2 b3 b4 b5
34
![Page 21: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/21.jpg)
Earth Mover DistanceA measure of distance between
two distributions over some region D. Interpret the distributions as two different
ways of piling up a certain amount of dirt over the region D.EMD is the minimum cost of turning one pile
into the other. Cost is assumed to be amount of dirt moved times
the distance by which its is moved.
Also known as Wasserstein distance.
![Page 22: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/22.jpg)
Rightward Matching (RM): A special case of Earth Mover Distance (EMD)
Only right shiftingSimple greedy algorithm works
Confidence=1
Incoming traffic at a router
Outgoing traffic at a router
Confidence=0
10 8 6 4 6
5
10 8 6 4 6
IN
OUT
EMD=114Maximum EMD Possible=114
EMD=0Maximum EMD Possible=114
![Page 23: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/23.jpg)
RM: Interpretation by area over cumulative counts
Confidence of an interval I= area(CUM-OUT(I))/ area (CUM-IN(I))
time
CUM-IN
CUM-OUT
Cu
mu
lative
co
un
t
CUM-IN
CUM-OUT
![Page 24: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/24.jpg)
RM: Interpretation by area over cumulative counts
Find all intervals with confidence >= 0.9 (say)
Cu
mu
lative
co
un
t
time
CUM-IN
CUM-OUT
CUM-IN
CUM-OUT
![Page 25: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/25.jpg)
RM: Interpretation by area over cumulative counts
Return a minimum collection of intervals with confidence >= 0.9 (say) covering at least 95%
(say) of data
Cu
mu
lative
co
un
t
time
CUM-IN
CUM-OUT
CUM-IN
CUM-OUT
![Page 26: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/26.jpg)
Finding intervals with high confidence
• Trivial using O(n3) time• Try all possible n2 intervals• For each interval using O(n) time find the confidence
Cu
mu
lative
co
un
t
time
CUM-IN
CUM-OUT
CUM-IN
CUM-OUT
![Page 27: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/27.jpg)
Finding intervals with high confidence
• Easy to do in O(n2) time• Compute in linear time confidence of all the
intervals that start from a specific point
time
Cu
mu
lative
co
un
t
time
CUM-IN
CUM-OUT
CUM-IN
CUM-OUT
![Page 28: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/28.jpg)
• How do you solve it in sub-quadratic time ?• Only maximal intervals
time
Cu
mu
lative
co
un
t
Finding intervals with high confidence
CUM-IN
CUM-OUT
CUM-IN
CUM-OUT
![Page 29: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/29.jpg)
Relax confidence:If outputs I, then conf(I) ≥ c/(1+ε)
(no false positives)If conf(I*) ≥ c, output I I* with conf(I) ≥ c/(1+ε)
(no false negatives)
Finding intervals with high confidence
![Page 30: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/30.jpg)
Algorithm 1
Finding intervals with high confidence
![Page 31: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/31.jpg)
Generating Sparse Set of Intervals = =
Compute the confidence of intervals with growing geometrically by a factor of
Finding intervals with high confidence
![Page 32: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/32.jpg)
Generating Sparse Set of Intervals = =
Compute the confidence of intervals with growing geometrically by a factor of
Finding intervals with high confidence
![Page 33: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/33.jpg)
Finding intervals with high confidence
Running time depends on areaB
![Page 34: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/34.jpg)
Finding intervals with high confidence:Avoiding dependency on area
• Main Idea:• Consider each possible ending point of
intervals instead of starting points• Compute confidence of intervals with interval
lengths growing exponentially in 1+ε
![Page 35: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/35.jpg)
Finding intervals with high confidence:Avoiding dependency on area
![Page 36: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/36.jpg)
Finding intervals with high confidence:Avoiding dependency on area
![Page 37: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/37.jpg)
Discount Models
![Page 38: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/38.jpg)
Finding minimum collection of maximal intervals with support threshold
Partial set cover on lineCan be solved exactly in quadratic time using dynamic
programmingCan be solved in linear time if we allow constant factor
approximation using greedy algorithm Greedy gives 7-approximation
![Page 39: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/39.jpg)
Finding minimum collection of maximal intervals with support threshold
Partial set cover using greedy algorithmIf OPT chooses t intervals then
We can choose at most t intervals that do not intersect any of the OPT intervals.
We can choose at most 6 intervals that intersect a particular OPT intervals.
![Page 40: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/40.jpg)
Credit Card DataDec Jan
![Page 41: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/41.jpg)
Entrance-Exit Data
![Page 42: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/42.jpg)
Network Monitoring Data
![Page 43: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/43.jpg)
Running Time on Job-log Data
Area Based Non-area Based
![Page 44: Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava](https://reader030.vdocuments.site/reader030/viewer/2022032802/56649e155503460f94affcb2/html5/thumbnails/44.jpg)
SummaryWe study data quality problems that arise
frequently in many infrastructure networks.We propose rules that express conservation laws
between related quantities, such as those between the inbound and outbound counts reported by network monitoring systems.
We present several confidence metrics for conservation rules.
We give efficient approximation algorithms for finding a concise set of intervals that satisfy (or fail) a supplied conservation rule given a confidence threshold