2015 - extract sf - data quality
TRANSCRIPT
![Page 1: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/1.jpg)
It's Time to Start Caring About Data Quality
Data Quality at Scale
![Page 2: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/2.jpg)
Ignacio Elola
![Page 3: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/3.jpg)
Everyone is talking about how useful data is
![Page 4: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/4.jpg)
data can save your business
![Page 5: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/5.jpg)
data can save your life
![Page 6: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/6.jpg)
![Page 7: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/7.jpg)
but...
![Page 8: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/8.jpg)
all that is only true if you have the right data
![Page 9: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/9.jpg)
data tend to be dirty and unstructured
![Page 10: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/10.jpg)
specially web data!
![Page 11: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/11.jpg)
![Page 12: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/12.jpg)
Let’s start simple
![Page 13: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/13.jpg)
I’ve created an extractor
![Page 14: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/14.jpg)
![Page 15: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/15.jpg)
I’ve pass a bunch of queries (bulk)
![Page 16: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/16.jpg)
![Page 17: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/17.jpg)
![Page 18: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/18.jpg)
and got a dataset
![Page 19: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/19.jpg)
![Page 20: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/20.jpg)
How can you QA this data?
![Page 21: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/21.jpg)
eyeballing
![Page 22: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/22.jpg)
![Page 23: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/23.jpg)
![Page 24: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/24.jpg)
eyeballing we can find anomalies without having domain expertise
![Page 25: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/25.jpg)
Quick summary:
- create extractors- combine extractors
- schedule data extraction
![Page 26: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/26.jpg)
What if we need to scale up?
![Page 27: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/27.jpg)
if you have:- more than ~3 datasources
- more than ~2 extractors per ds- big volume of queries- pre or post processing
![Page 28: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/28.jpg)
you will need:- people to create and maintain
extractors- process to clean and validate
data
![Page 29: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/29.jpg)
Data Quality
think about it pre and post data extraction!
![Page 30: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/30.jpg)
tips and tricks to increase data quality
![Page 31: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/31.jpg)
XPaths
![Page 32: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/32.jpg)
![Page 33: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/33.jpg)
![Page 34: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/34.jpg)
//div[@id="priceBlock"]/table/tbody/tr/td[b/@class="priceLarge"]/b
better than
//*[@id="priceBlock"]/table/tbody/tr[2]/td[2]/b[1]
![Page 35: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/35.jpg)
Regex
![Page 36: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/36.jpg)
![Page 37: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/37.jpg)
![Page 38: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/38.jpg)
More at:
http://support.import.io/knowledgebase/articles/341182-xpaths-regex
http://www.w3schools.com/xsl/xpath_intro.asp
![Page 39: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/39.jpg)
Required column
![Page 40: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/40.jpg)
![Page 41: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/41.jpg)
![Page 42: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/42.jpg)
![Page 43: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/43.jpg)
![Page 44: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/44.jpg)
measuring data quality
![Page 45: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/45.jpg)
![Page 46: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/46.jpg)
completeness
![Page 47: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/47.jpg)
coverage
![Page 48: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/48.jpg)
![Page 49: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/49.jpg)
post extraction data quality improvements?
![Page 50: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/50.jpg)
![Page 51: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/51.jpg)
how we do it
![Page 52: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/52.jpg)
![Page 53: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/53.jpg)
![Page 54: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/54.jpg)
Smart automation
![Page 55: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/55.jpg)
anomaly detection
![Page 56: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/56.jpg)
variance, variability, noise
![Page 57: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/57.jpg)
normalization
![Page 58: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/58.jpg)
confidence score
![Page 59: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/59.jpg)
Human input
![Page 60: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/60.jpg)
Transparency
![Page 61: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/61.jpg)
summary
![Page 62: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/62.jpg)
Data Quality is essential
![Page 63: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/63.jpg)
think about it from the very beginning
![Page 64: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/64.jpg)
develop a process to measure data quality before scaling up
![Page 65: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/65.jpg)
if you don’t want to reinvent the wheel - contact us!
![Page 66: 2015 - Extract SF - Data Quality](https://reader033.vdocuments.site/reader033/viewer/2022051404/58ecabc71a28abb34a8b45c5/html5/thumbnails/66.jpg)
Thank [email protected]