sampleclean: bringing data cleaning into the bdas stack
DESCRIPTION
"SampleClean: Bringing Data Cleaning into the BDAS Stack" presentation at AMPCamp 5 by Sanjay Krishnan and Daniel HaasTRANSCRIPT
![Page 1: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/1.jpg)
SampleClean: Bringing Data Cleaning into the BDAS Stack!
Sanjay Krishnan and Daniel Haas!In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan
Wang, Tim Kraska, Michael Franklin, Tova Milo, Ken Goldberg !
![Page 2: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/2.jpg)
Who publishes more? !
!!
2
![Page 3: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/3.jpg)
Microsoft Academic Search!
!! Paper Id! Affiliation!
16! Computer Science Division--University of California Berkeley CA!
101! University of California at Berkeley!
102! Department of Physics Stanford !University California!
116! Lawrence Berkeley National Labs!<ref>California</ref>!
3
![Page 4: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/4.jpg)
Microsoft Academic Search!
!! Paper Id! Affiliation!
16! Computer Science Division--University of California Berkeley CA!
101! University of California at Berkeley!
102! Department of Physics Stanford !University California!
116! Lawrence Berkeley National Labs!<ref>California</ref>!
X
4
![Page 5: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/5.jpg)
Microsoft Academic Search!
!!
University of California at Berkeley!Computer Science Division!University of California at Berkeley!Computer Science Division!University of California at Berkeley!
Department of Physics Stanford !University California!
5
![Page 6: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/6.jpg)
• Data cleaning in BDAS.!– Problem 1. Scale!– Problem 2. Latency!!
• Sampling to cope with scale.!• Asynchrony to cope with latency.!
!!
Enter SampleClean!
6
![Page 7: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/7.jpg)
Now it’s your turn!!
!!
Be the crowd and help us decide!
7
![Page 8: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/8.jpg)
Dirty Data is Ubiquitous!
8!
Example: Missing, incomplete, inconsistent data!
![Page 9: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/9.jpg)
Data Cleaning is Hard!
9
Time consuming!
![Page 10: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/10.jpg)
Data Cleaning is Hard!
10
Costly!Time consuming!
![Page 11: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/11.jpg)
Data Cleaning is Hard!
11
Costly!Domain-specific!
Time consuming!
![Page 12: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/12.jpg)
Data Cleaning is Hard!
12
Costly!Domain-specific!
Time consuming!
![Page 13: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/13.jpg)
Analy0cs
13
A New Data Cleaning Architecture!
Data Cleaning Data
![Page 14: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/14.jpg)
Analy0cs
14
A New Data Cleaning Architecture!
Data Cleaning
Data
![Page 15: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/15.jpg)
Can it Scale?!
Crowd Machine Learning
Regex
Time
People are slow and expensive!
15
![Page 16: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/16.jpg)
Insight 1: Asynchrony Hides Latency!
16
![Page 17: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/17.jpg)
Insight 2: Sampling Hides Scale!
Query !Error!
Time!
BlinkDB!
17
![Page 18: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/18.jpg)
Query !Error!
Time!
Data Error
BlinkDB!
Insight 2: Sampling Hides Scale!
18
![Page 19: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/19.jpg)
Query !Error!
Time!
Data Error
SampleClean!
Insight 2: Sampling Hides Scale!
BlinkDB!
19
![Page 20: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/20.jpg)
SampleClean Data Flow!
Dirty Data Dirty Sample
Clean Sample
Query
Data Cleaning
20
Sampling Asynchrony
![Page 21: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/21.jpg)
SampleClean Data Flow!
Clean Sample
Query
Data Cleaning
Asynchrony
21
![Page 22: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/22.jpg)
The SampleClean Architecture!
Data Cleaning Library
Approximate Query Processing Asynchronous
Pipelines
Clean Sample
Issue Queries, ! Get Results!
Declare Cleaning !Operations!
Dirty Sample
22
![Page 23: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/23.jpg)
The SampleClean Architecture!
Data Cleaning Library
Approximate Query Processing Asynchronous
Pipelines
Clean Sample
Issue Queries, ! Get Results!
Declare Cleaning !Operations!
Dirty Sample
23
![Page 24: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/24.jpg)
Approximate Query Processing!• Estimate early results and bound with
error bars!
SampleClean: Fast and Accurate Query Processing on Dirty Data. SIGMOD 2014!!BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. EuroSys 2013!
Query !Error!
Time!
24
![Page 25: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/25.jpg)
The SampleClean Architecture!
25
Approximate Query Processing Asynchronous
Pipelines
Clean Sample
Issue Queries, ! Get Results!
Declare Cleaning !Operations!
Dirty Sample
Data Cleaning Library
![Page 26: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/26.jpg)
Crowds and Machines Work Together!
• Extensible library of data cleaning tools!• Tools are:!– Automated!– Human-powered!– Hybrid!
!
Crowd Machine Learning
Regex
Time
26
![Page 27: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/27.jpg)
Active Learning and Crowds!• Choose informative training points!
Not !Informative!
Are these the same?!Stanford Department of IEOR!!UC Berkeley Stats!! ¢ Yes ! ¤ No!
Informative!
Are these the same?!Department of Mathematics Stanford University!!University of California Berkeley Department of Mathematics!! ¢ Yes ! ¤ No!
27
![Page 28: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/28.jpg)
Active Learning and Crowds!• Choose informative training points!
Not !Informative!
Are these the same?!Stanford Department of IEOR!!UC Berkeley Stats!! ¢ Yes ! ¤ No!
Informative!
Are these the same?!Department of Mathematics Stanford University!!University of California Berkeley Department of Mathematics!! ¢ Yes ! ¤ No!
28
![Page 29: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/29.jpg)
The SampleClean Architecture!
29
Data Cleaning Library
Clean Sample
Issue Queries, ! Get Results!
Declare Cleaning !Operations!
Dirty Sample
Approximate Query Processing Asynchronous
Pipelines
![Page 30: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/30.jpg)
Putting it all together: Asynchronous Pipelines!
• Users group data cleaning operations into pipelines!
30
![Page 31: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/31.jpg)
The SampleClean Architecture!
Data Cleaning Library
Approximate Query Processing Asynchronous
Pipelines
Clean Sample
Issue Queries, ! Get Results!
Declare Cleaning !Operations!
Dirty Sample
31
![Page 32: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/32.jpg)
Great, Now What?!• Prototype implementation complete!!
• Significant research challenges remain:!• Crowd worker performance and quality!• Pipeline semantics and optimization!• Programming model and interface!!
• Open source release targeted for next year!
32
![Page 33: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/33.jpg)
Summary!• Data Cleaning is slow, costly, and
domain-specific!
• SampleClean brings data cleaning into the BDAS stack !
• SampleClean uses asynchrony to hide latency, and sampling to hide scale!
• SampleClean combines Algorithms, Machines, and People, all in one system! 33
![Page 34: SampleClean: Bringing Data Cleaning into the BDAS Stack](https://reader033.vdocuments.site/reader033/viewer/2022060121/559444841a28abfa2f8b4756/html5/thumbnails/34.jpg)
Asynchrony in Spark!• The Spark abstraction: blocking BSP!
• So how do we achieve asynchrony?!• Multithreaded master!• Intermediate results materialized in
Hive!• Standalone Finagle HTTP server for
crowd work!
!34