bigdansing - university of helsinki · 4/10/2017 bigdansing biyun huang 30 physical operators are...
TRANSCRIPT
![Page 1: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/1.jpg)
BigDansing
A System for Big Data Cleansing
4/10/2017 1BigDansing Biyun Huang
![Page 2: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/2.jpg)
Table of Content
Description
Description
4/10/2017 2BigDansing Biyun Huang
Introduction
Data Cleansing Process
Architecture of BigDansing
Experiment
Discussion
![Page 3: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/3.jpg)
Background
Description
Description
4/10/2017 3BigDansing Biyun Huang
“Inaccurate data has a direct impact ... the average company
losing 12% of its revenue” -- Ben Davis (Econsultancy)
“More than 25 Percent of Critical Data Used in Large
Corporations is Flawed” – Gartner Inc.
“This is the digital universe. It is growing 40% a year into the
next decade” -- EMC2
![Page 4: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/4.jpg)
Concepts
Description
Description
4/10/2017 4BigDansing Biyun Huang
Data CleansingData cleansing, data cleaning, or data scrubbing is the process of
detecting and correcting (or removing) corrupt or inaccurate records
from a record set, table, or database.
Data QualityData are of high quality “if they are fit for their intended uses in
operations, decision making and planning ".
![Page 5: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/5.jpg)
An Intuitive Example
Description
Description
4/10/2017 5BigDansing Biyun Huang
Name Zipcode City
Winnie 91340 San Francisco
Robbert 91340 New York
Emma 91340 San Francisco
Quality Rule
Two customers having the same zipcode cannot be in different cities
D(zipcode -> city)
Dirty Dataset
![Page 6: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/6.jpg)
An Intuitive Example
Description
Description
4/10/2017 6BigDansing Biyun Huang
Quality Rule
Two customers having the same zipcode cannot be in different cities
D(zipcode -> city)
Name Zipcode City
Winnie 91340 San Francisco
Robbert 91340 New York
Emma 91340 San Francisco
Dirty Dataset
![Page 7: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/7.jpg)
ChallengesIn the context of big data
Description
Description
4/10/2017 7BigDansing Biyun Huang
ScalabilityCan not scale to large datasets
AbstractionNeed to understand both the quality rules
and the distributed platform
![Page 8: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/8.jpg)
Description
Description
4/10/2017 8BigDansing Biyun Huang
ScalabilityCan not scale to large datasets
AbstractionNeed to understand both the quality rules
and the distributed platform
ChallengesIn the context of big data
![Page 9: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/9.jpg)
BigDansing
Description
Description
4/10/2017 9BigDansing Biyun Huang
A Big Data Cleansing system to tackle
efficiency, scalability, and ease-of-use issues
in data cleansing
![Page 10: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/10.jpg)
BigDansingTo address the challenges
Description
Description
4/10/2017 10BigDansing Biyun Huang
Generic abstraction
define customized rules together with traditional rules in a simple way
the underlying distributed platform is transparent
Fast and scalable detection, repair and updates
1.9B rows → 13B violations < 3 hours on 16 small machines
![Page 11: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/11.jpg)
Data Cleansing Process
Description
4/10/2017 11BigDansing Biyun Huang
Detect
Update
Repair
Source: https://www.slideshare.net/Zuhairkhayyat/bigdansing-presentation-slides-for-kaust
Data Cleansing Flowchart
Quality Rules
Dirty Data
Clean Data
![Page 12: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/12.jpg)
Architecture
Description
Description
4/10/2017 12BigDansing Biyun Huang
Source: Zuhair Khayyat et al: BigDansing: A System for Big Data Cleansing. SIGMOD Conference 2015: 1215-1230
![Page 13: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/13.jpg)
Description
Description
4/10/2017 13BigDansing Biyun Huang
Input:quality rulesa dirty dataset
Architecture
![Page 14: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/14.jpg)
Description
Description
4/10/2017 14BigDansing Biyun Huang
RuleEngine:
logical layer
physical layer
execution layer
Architecture
![Page 15: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/15.jpg)
Description
Description
4/10/2017 15BigDansing Biyun Huang
general purpose distributed platform like Hadoop or Spark
Architecture
![Page 16: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/16.jpg)
Description
Description
4/10/2017 16BigDansing Biyun Huang
Repair Algorithm :run existing centralized algorithm as a black box in a distributed way
Architecture
![Page 17: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/17.jpg)
Description
Description
4/10/2017 17BigDansing Biyun Huang
Output: a clean dataset
Architecture
![Page 18: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/18.jpg)
Description
Description
4/10/2017 18BigDansing Biyun Huang
Let’s dig into the components one by one…
Architecture
![Page 19: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/19.jpg)
Description
Description
4/10/2017 19BigDansing Biyun Huang
Input:quality rulesa dirty dataset
Architecture
![Page 20: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/20.jpg)
Quality Rule
Description
Description
4/10/2017 20BigDansing Biyun Huang
Dirty data is detected by
Declarative RuleFunctional dependencies (FD)
e.g. Zipcode → CityConditional functional dependency (CFD)
e.g. Country = 'Saudi Arabia', Zipcode → CityDenial constraints (DC)
e.g. ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
User Defined Function (UDF)
Duplicates, statistical errors
![Page 21: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/21.jpg)
Description
Description
4/10/2017 21BigDansing Biyun Huang
AbstractionUDF-based Approach
![Page 22: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/22.jpg)
AbstractionUDF-based Approach
Description
Description
4/10/2017 22BigDansing Biyun Huang
Quality rules are represented by 5 functions
Scope, Block, Iterate, Detect, and GenFix
declarative rules
-automatically be translated into logical operators
UDFs
-can be implemented using logical operators
![Page 23: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/23.jpg)
Logic OperatorsScope
Description
Description
4/10/2017 23BigDansing Biyun Huang
Scope
Input: data units
Output: data units
Example: Zipcode → City
Input: t1 – t6
Output:
t1 – t6 (Zipcode,City)
Name Zipcode City
t1 Annie 60601 NY
t2 Laure 90210 LA
t3 John 60601 CH
t4 Mark 90210 SF
t5 Robert 60601 CH
t6 Mary 90210 LA
Zipcode City
t1 60601 NY
t2 90210 LA
t3 60601 CH
t4 90210 SF
t5 60601 CH
t6 90210 LA
![Page 24: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/24.jpg)
Logic OperatorsBlock
Description
Description
4/10/2017 24BigDansing Biyun Huang
Block
Input: data units
Output: grouping key
Example: Zipcode → City
Input: t1 – t6
Output:
<60601, (t1,t3,t5)>,
<90210, (t2,t4,t6)>
Zipcode City
t1 60601 NY
t2 90210 LA
t3 60601 CH
t4 90210 SF
t5 60601 CH
t6 90210 LA
![Page 25: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/25.jpg)
Logic OperatorsIterate
Description
Description
4/10/2017 25BigDansing Biyun Huang
Iterate
Input: a group of data units
Output: single tuple, tuple pair
Example: Zipcode → City
Input: <60601, (t1,t3,t5)>
Output:
<t1,t3>, <t1,t5>, <t3,t5>
Zipcode City
t1 60601 NY
t3 60601 CH
t5 60601 CH
![Page 26: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/26.jpg)
Logic OperatorsDetect
Description
Description
4/10/2017 26BigDansing Biyun Huang
Detect
Input: data units
Output: Violation(s)
Example: Zipcode → City
Input: <t1,t3>, <t1,t5>, <t3,t5>
Output:
<t1.City ≠ t3.City>,
<t1.City ≠ t5.City>
Zipcode City
t1 60601 NY
t3 60601 CH
t5 60601 CH
![Page 27: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/27.jpg)
Logic OperatorsGenFix
Description
Description
4/10/2017 27BigDansing Biyun Huang
GenFix
Input: Violation
Output: possible fix(es)
Example: Zipcode → City
Input: <t1.City ≠ t3.City>,
<t1.City ≠ t5.City>
Output:
<t1.City = t3.City>,
<t1.City = t5.City>
Zipcode City
t1 60601 NY
t3 60601 CH
t5 60601 CH
Zipcode City
t1 60601 CH
t3 60601 CH
t5 60601 CH
![Page 28: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/28.jpg)
Rule Engine
Description
Description
4/10/2017 28BigDansing Biyun Huang
RuleEngine
three-layer
on top of general purpose data processing framework such as MapReduce-like with execution abstraction
![Page 29: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/29.jpg)
Rule EngineLogical Plan
Description
4/10/2017 29BigDansing Biyun Huang
Define the data unit flow
Validating the plan:At least one input datasetFor UDF: at least one detectFor Rules: at least one rule
Support simple and bushy plans
RuleLogical
PlanDetectGenFix Iterate
BlockScope
Found Found Found
Not found Not found Not found
Not found
![Page 30: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/30.jpg)
Description
4/10/2017 30BigDansing Biyun Huang
Physical operators are system specificMPI, Hadoop, Spark
Each physical operator is an independent execution unit
Each logical operator → one physical operator
BigDansing consolidate logical plans to improve I/O
More physical operators can be added with different optimizations to improve logical plans
Rule EnginePhysical Plan
![Page 31: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/31.jpg)
Description
4/10/2017 31BigDansing Biyun Huang
Run on the parallel general purpose data processing framework such as MapReduce-like
User don't need to care about how the distributed computation performs
Rule EngineExecution Plan
![Page 32: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/32.jpg)
Repair Algorithm
Description
Description
4/10/2017 32BigDansing Biyun Huang
Repair Algorithm
equivalence class
hypergraph-based
![Page 33: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/33.jpg)
Repair Algorithm
Description
Description
4/10/2017 33BigDansing Biyun Huang
Implement two existing serial repair algorithms to run in distributed mode
Equivalence class algorithm
Hypergraph algorithm
Design a distributed version of the seminal equivalence class algorithm
![Page 34: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/34.jpg)
Equivalence Class Algorithm
Description
Description
4/10/2017 34BigDansing Biyun Huang
Fix errors based on (=,≠)
Based on heuristics:Partition the possible fixes into different groupsAssign the highest frequency value to group
Example:Group 1: Zipcode = 60601
Highest frequency = CHGroup 2: Zipcode = 90210
Highest frequency = LA
Name Zipcode City
t1 Annie 60601 NY
t2 Laure 90210 LA
t3 John 60601 CH
t4 Mark 90210 SF
t5 Robert 60601 CH
t6 Mary 90210 LA
t7 Jon 60601 CH
![Page 35: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/35.jpg)
Hypergraph Algorithm
Description
Description
4/10/2017 35BigDansing Biyun Huang
Fix errors based on (<,>,≤, and ≥)
Based on linear optimization and greedy MVC:Select hyper-graph node with highest edgesChange its value depending on edge conditions
Name Salary Rate
t1 Annie 24000 15
t2 Laure 25000 10
t3 John 40000 25
t4 Mark 88000 24
t5 Robert 15000 15
t6 Mary 81000 28
t7 Jon 40000 25
t1.Salaryt1.Rate
t2.Salaryt2. Rate
t5.Salaryt5.Rate
>,<
>,<
t3.Salaryt3.Rate
t6.Salaryt6. Rate
t7.Salaryt7.Rate
t4.Salaryt4. Rate
>,<
>,<
>,<
![Page 36: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/36.jpg)
Experiment
Description
Description
4/10/2017 36BigDansing Biyun Huang
Dataset
More intuitively, TPCH datasets with 959M, 1271M, 1583M, and 1907M rows are of sizes 150GB, 200GB, 250GB, and 300GBrespectively.
![Page 37: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/37.jpg)
Experiment
Description
Description
4/10/2017 37BigDansing Biyun Huang
Quality Rules
![Page 38: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/38.jpg)
Results
Description
Description
4/10/2017 38BigDansing Biyun Huang
Single-node experiments with Φ1 (Functional Dependencies)
TaxA dataset:FD: Zipcode → CityFD: Zipcode → State
provides a finer
granular abstraction
allowing users to
specify rules more
efficiently
![Page 39: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/39.jpg)
Results
Description
Description
4/10/2017 39BigDansing Biyun Huang
TPCH dataset:FD: custkey →custAddress16 Workers
Multi-nodes experiments with Φ1(Functional Dependencies)
![Page 40: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/40.jpg)
Results
Description
Description
4/10/2017 40BigDansing Biyun Huang
TPCH Dataset:FD: custkey → custAddressDataset: 500M rows
Scalability
![Page 41: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/41.jpg)
Results
Description
4/10/2017 41BigDansing Biyun Huang
TaxA/HAI Dataset:Dataset: 100K--40M/166k rows
Repair quality
![Page 42: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/42.jpg)
Results
Description
Description
4/10/2017 42BigDansing Biyun Huang
Outperform existing baseline systems up to more than two
orders of magnitude without sacrificing the repair quality!
![Page 43: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/43.jpg)
DiscussionContributions
Description
4/10/2017 43BigDansing Biyun Huang
A data cleansing framework rather than a data cleansing algorithm
very useful as the emerging of big data
There have been extensive studies on data cleansing algorithm for decades.
But now the demand is to run these algorithms on distributed system.
![Page 44: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/44.jpg)
Description
4/10/2017 44BigDansing Biyun Huang
Flexibility and scalability
Logic Abstraction define customized rules together with traditional rules in a simple way with an UDF-based approach
Flexibility of Repair Algorithmadopt existing algorithms or user-defined ones by treating the repair algorithm as a black box
Scalability and Fast Computation
integrate on general purpose distributed framework
Portable and Easy-to-userun on various platforms ranging from DBMS to MapReduce-like platforms
DiscussionContributions
![Page 45: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/45.jpg)
DiscussionLimitations
Description
4/10/2017 45BigDansing Biyun Huang
When treating UDFs as black boxes, it is hard to do static analysis, such as consistency and implication, for the given rules.
Dataset must be static and quality rules must be predefined.stream data / real-time cleansing / modified quality rules
Require domain expertise.hinder its adoption in industrial targeting non-expert users
![Page 46: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/46.jpg)
DiscussionLimitations
Description
4/10/2017 46BigDansing Biyun Huang
In the experiment, the authors added random text/numerical/duplicate errors and the system considers only key-based blockers, which is not accurate for many real-world data sets, due to dirty/missing data.
It is usually hard, if not impossible, to guarantee the accuracy of the data cleaning process without verifying it via experts or external sources.
![Page 47: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/47.jpg)
Description
Description
4/10/2017 47BigDansing Biyun Huang
DiscussionLimitations
![Page 48: BigDansing - University of Helsinki · 4/10/2017 BigDansing Biyun Huang 30 Physical operators are system specific MPI, Hadoop, Spark Each physical operator is an independent execution](https://reader036.vdocuments.site/reader036/viewer/2022071218/604f0ebe45b18f62574eb167/html5/thumbnails/48.jpg)
Thanks!
4/10/2017 48BigDansing Biyun Huang