automated testing for undocumented assumptions · 2020. 7. 15. · automated testing for protecting...

32

Upload: others

Post on 23-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive
Page 2: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Automated Testing For Protecting Data Pipelines from Undocumented AssumptionsEugene Mandel

Head of Product, Superconductive

Page 3: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Agenda

I

What is Pipeline Debt?II

How does Great Expectations beat pipeline debt?III

How can I get started?

Page 4: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

What is pipeline debt?

Technical debt in data pipelines,mainly as a result of missing

tests and documentation.

Page 5: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Your data pipeline

Page 6: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Your data pipeline

Page 7: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Your data pipeline

Page 8: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Your data pipeline

Page 9: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Your data pipeline

wants to be a hairball

Page 10: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive
Page 11: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive
Page 12: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

UndocumentedUntestedUnstable

What is pipeline debt?

Page 13: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

code testing ≠ data testing

Solution: automated testing,

BUT

Page 14: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

How does Great Expectations

beat pipeline debt

Page 15: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Always know what to expect from your data

▪ Public launch in 2018

▪ Full-time, active development started June 2019

▪ Most popular OSS library for data pipeline testing

▪ Growing community on Slack and github

Page 16: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

An expectation is a declarative statement that describes a property of a dataset

Page 17: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

“Values in this column should be between 55 and 90, at least 95% of the time.”

Describe expected behavior

Page 18: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

{ "expectation_type": "expect_column_values_to_be_between", "kwargs": { "column": "temp_f", "max_value": 90, "min_value": 55, "mostly": 0.97, }, "meta": { "notes": { "format": "markdown", "content": [ "this column contains indoor temp readings - CA, spring and summer" ] } }}

Declarative language

Page 19: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

{ "expectation_type": "expect_column_values_to_be_between", "kwargs": { "column": "temp_f", "max_value": 90, "min_value": 55, "mostly": 0.97, }, "meta": { "notes": { "format": "markdown", "content": [ "this column contains indoor temp readings - CA, spring and summer" ] } }}

class PandasDataset ... def expect_column_values_to_be_between( ...

class SparkDFDataset ... def expect_column_values_to_be_between( ...

class SqlAlchemyDataset ... def expect_column_values_to_be_between( ...

expectation:

Validate: take the compute to the data

Page 20: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

expect_column_to_exist

expect_table_row_count_to_be_between

expect_column_values_to_be_unique

expect_column_values_to_not_be_null

expect_column_values_to_be_between

expect_column_values_to_match_regex

expect_column_values_to_match_strftime_format

expect_column_mean_to_be_between

expect_column_kl_divergence_to_be_less_than

etc. etc. etc.great_expectations

Expressive and extensible

Page 21: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

expect_column_to_exist

expect_table_row_count_to_be_between

expect_column_values_to_be_unique

expect_column_values_to_not_be_null

expect_column_values_to_be_between

expect_column_values_to_match_regex

expect_column_values_to_match_strftime_format

expect_column_mean_to_be_between

expect_column_kl_divergence_to_be_less_than

etc. etc. etc.great_expectations

Expressive and extensible

Page 22: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

expect_column_to_exist

expect_table_row_count_to_be_between

expect_column_values_to_be_unique

expect_column_values_to_not_be_null

expect_column_values_to_be_between

expect_column_values_to_match_regex

expect_column_values_to_match_strftime_format

expect_column_mean_to_be_between

expect_column_kl_divergence_to_be_less_than

etc. etc. etc.great_expectations

Expressive and extensible

Page 23: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

expect_column_to_exist

expect_table_row_count_to_be_between

expect_column_values_to_be_unique

expect_column_values_to_not_be_null

expect_column_values_to_be_between

expect_column_values_to_match_regex

expect_column_values_to_match_strftime_format

expect_column_mean_to_be_between

expect_column_kl_divergence_to_be_less_than

etc. etc. etc.great_expectations

Expressive and extensible

Page 24: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Your tests are your docsYour docs are your tests

Page 25: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Your tests are your docsYour docs are your tests

Page 26: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Setup and Configuration

Page 27: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Drift

Page 28: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Outliers

Page 29: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Outage

Page 30: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

How can I get started?

Page 31: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

▪ Check out github▪ https://github.com/great-expectations/great_expectations

▪ Read the docs▪ https://docs.greatexpectations.io/en/latest/

▪ Say hi and ask questions on Slack▪ https://greatexpectations.io/slack

▪ pip install great_expectations

How can I get started?

Page 32: Automated Testing For Undocumented Assumptions · 2020. 7. 15. · Automated Testing For Protecting Data Pipelines from Undocumented Assumptions Eugene Mandel Head of Product, Superconductive

Thank you!