Common Sense Validation Using SAS
Lisa Eckler Lisa Eckler Consulting Inc.
TASS Interfaces, December 2015
Common Sense Validation Using SAS
• Holistic approach• Allocate most effort to what’s most
important• Avoid or automate repetitive tasks• Ask ourselves the right questions
Common Sense Validation Using SAS
Defining terms: QA
Data quality assurance is the process of profiling the data to discover inconsistencies, and other anomalies in the data and performing data cleansing activities to improve the data quality.
– Wikipedia
Common Sense Validation Using SAS
Defining terms: Verification
Verification is the act of reviewing, inspecting, testing, etc. to establish and document that a product, service, or system meets the regulatory, standard, or specification requirements.
Does it meet the structural requirements? Is it complete?
Common Sense Validation Using SAS
Defining terms: Validation
Validation refers to meeting the needs of the intended end-user or customer.
Does it answer the user’s question?Does it meet all of the needs?
Structure and completeness, data integrity, appropriateness
Common Sense Validation Using SAS
– Pablo Picasso
“Computers are useless. They can only give you answers.”
Common Sense Validation Using SAS
How do I know if I got it right?
Common Sense Validation Using SAS
Is Validation a programming task?
Yes – mostly
The routine parts can and should be automated and repeatable
That leaves more resources for the parts which require human attention
Common Sense Validation Using SAS
PROC COMPARE PROC CONTENTS PROC CONTENTS with compare (using
PROC COMPARE or TRANSPOSE, MERGE and flag) PROC FREQ +/- PROC FORMAT PROC SUMMARY PROC SUMMARY + compare (using PROC COMPARE or TRANSPOSE, MERGE and flag)
Common Sense Validation Using SAS
PROC COMPARE PROC CONTENTS PROC CONTENTS with compare (using
PROC COMPARE or TRANSPOSE, MERGE and flag) PROC FREQ +/- PROC FORMAT PROC SUMMARY PROC SUMMARY + compare (using PROC COMPARE or TRANSPOSE, MERGE and flag)
Wrap a macro around this and you have a
flexible, re-usable tool!
Common Sense Validation Using SAS
Does this mean writing more SAS code after I thought I was finished writing SAS code?
• Yes… and no • We can save time and improve the quality of results by using
code that isn’t part of the final program. • Don’t think of it as disposable, though: this code can be set up
once and saved to use for all future validation efforts.
Additional benefits• Automated validation provides a log• Easily repeatable
Common Sense Validation Using SAS
What are the questions?• Should this be a replication of something I have
seen before? If not, is it similar to something I’ve done before?
• Is it – or some part of it – supposed to be different from anything I’ve seen before?
• Is the result packaged properly?
Common Sense Validation Using SAS
Mantra for Validation• Check your assumptions• Confirm similarities• Focus on differences
Common Sense Validation Using SAS
How is this result expected to compare with what we’ve seen before?
Entirely different Some overlap Complete overlap Subset
Common Sense Validation Using SAS
Some possibilities – not an exhaustive list!
Common Sense Validation Using SAS
** This is the simplest form of **;** comparison between two sets of data **;proc compare compare = SHOES base = OLD_SHOES;run;
Common Sense Validation Using SAS
Common Sense Validation Using SAS
** PROC CONTENTS gives us metadata **;proc contents data = OLD_SHOES;run;
Common Sense Validation Using SAS
Common Sense Validation Using SAS
** CONTENTS with select facts saved to **;** a data set --> a table of metadata **;proc contents data = OLD_SHOES out = CONTENTS_OLD_SHOES
(keep=name type length);run;
Common Sense Validation Using SAS
** Same as previous slide except for the **;** new data set **;proc contents data = NEW_SHOES out = CONTENTS_NEW_SHOES
(keep=name type length);run;
Common Sense Validation Using SAS
** Comparing metadata tables rather than **;** data tables **;proc compare compare = CONTENTS_OLD_SHOES base = CONTENTS_NEW_SHOES;run;
Common Sense Validation Using SAS
Common Sense Validation Using SAS
Common Sense Validation Using SAS
proc contents data = OLD_SHOES out = CONTENTS1(keep=name type length);run;proc contents data = NEW_SHOES out = CONTENTS2(keep=name type length);run; proc compare compare = CONTENTS1 base = CONTENTS2;run;
Common Sense Validation Using SAS
%macro COMPARE_STRUCTURE1;proc contents data = OLD_SHOES out = CONTENTS1(keep=name type length);run; proc contents data = NEW_SHOES out = CONTENTS2(keep=name type length);run;proc compare compare = CONTENTS1 base = CONTENTS2;run;%mend COMPARE_STRUCTURE1; %COMPARE_STRUCTURE1;
Common Sense Validation Using SAS
%macro COMPARE_STRUCTURE(DS1,DS2);proc contents data = &DS1 out = CONTENTS1(keep=name type length);run; proc contents data = &DS2 out = CONTENTS2(keep=name type length);run;proc compare compare = CONTENTS1 base = CONTENTS2;run;%mend COMPARE_STRUCTURE; %COMPARE_STRUCTURE(OLD_SHOES, NEW_SHOES);
Common Sense Validation Using SAS
%macro COMPARE_STRUCTURE(DS1,DS2);proc contents data = &DS1 out = CONTENTS1(keep=name type length);run; proc contents data = &DS2 out = CONTENTS2(keep=name type length);run;proc compare compare = CONTENTS1 base = CONTENTS2;run;%mend COMPARE_STRUCTURE; %COMPARE_STRUCTURE(OLD_SHOES, NEW_SHOES);
Common Sense Validation Using SAS
** We've just built a generic tool for comparing **;** the STRUCTURE of any two SAS data sets **;
%COMPARE_STRUCTURE( <any SAS data set name>, <any other SAS data set name> );
Common Sense Validation Using SAS Reasonableness: complete overlap
Common Sense Validation Using SAS Reasonableness: complete overlap
“_character_” gives the list of ALL vars in the table with data type character, which may include some vars with too many values
Common Sense Validation Using SAS Reasonableness: complete overlap
This code also gives a list of ALL vars in the table with data type character
Common Sense Validation Using SAS Reasonableness: complete overlap
The above code lets us customize our list to exclude non-categorical character columns and include the others
Common Sense Validation Using SAS Reasonableness: complete overlap
Common Sense Validation Using SAS Reasonableness: complete overlap
Common Sense Validation Using SAS Reasonableness: complete overlap
Similar to the way we compared the structure of two tables, we can compare the frequency counts of values in two tables
Common Sense Validation Using SAS
proc compare compare = OLD_SHOES
base = NEW_SHOES;run;
Judicious use of unrestricted PROC COMPARE -- after confirming reasonableness
Data correctness: complete overlap
Common Sense Validation Using SAS
If we are expecting a result that is a complete replication of something that already exists• Confirm that the structure is identical• Confirm that the data is the same at a high
level• Confirm that the data is the same at a
detailed level
fully automated
Common Sense Validation Using SAS
What if we don’t have an existing results table to compare to?• Similar SAS data in an existing table or produced by
someone else?• Similar data in some other format that can be imported
into SAS for comparison?• Do we have a data requirements document?• The truly original data will require much greater
attention to validation and the involvement of a subject matter expert
Data correctness: completely new
Common Sense Validation Using SAS Packaging: completely new
Common Sense Validation Using SAS
Assuming we have a Requirements “document”…• Import REQUIREMENTS into SAS data set• run PROC CONTENTS on new data set to get
CONTENTS_NEW_SHOES• run PROC COMPARE, comparing
CONTENTS_NEW_SHOES to REQUIREMENTS
OR• Join REQUIREMENTS with CONTENTS_NEW_SHOES
and flag non-matching rows
Packaging: completely new
Common Sense Validation Using SAS Packaging: completely new
Common Sense Validation Using SAS Packaging: completely new
Common Sense Validation Using SAS
Reasonableness: completely new
Common Sense Validation Using SAS
Reasonableness: completely new
Common Sense Validation Using SAS
Common Sense Validation Using SAS
What if part of our result should be the same as an existing result but there should be some differences?• Treat it as a hybrid and split the validation
exercise into two parts• Expected same (by rows, columns,
data or metadata)• Expected different (by rows, columns,
data or metadata)
Common Sense Validation Using SAS
• For each of the two parts• Confirm (expected) similarities• Focus efforts on (expected) differences• Run the validation procedures we’ve alreay looked
at as appropriate for the “same” and “different” aspects
Common Sense Validation Using SAS
Recall the scenario where our data sets should be identical
record_id a b c1 * * *2 * * *3 * * *
record_id a b c1 * * *2 * * *3 * * *
Common Sense Validation Using SAS
record_id a b c1 * *2 * *3 * *
record_id a b d1 * *2 * *3 * *
When some columns should be the same
50
Common Sense Validation Using SAS
record_id a b c1 * *2 * *3 * *
record_id a b d1 * *3 * *4
When some “cells” (parts of rows and columns) should be the same
Common Sense Validation Using SAS
Review:
Common Sense Validation Using SAS
Summary:
Ask the right questions Confirm similarities with known things – quickly and
programmatically – then focus time and effort on validating “unknown” or new things
Basic base SAS procedures for validation vary the technique based on how much is similar/different from
what you’ve validated previously and what types of data are involved
Common Sense Validation Using SAS
You can find my related conference papers at www.lexjansen.com
• Don’t Forget About Small Data (SESUG 2015)• When Good Looks Aren’t Enough (NESUG 2009)
If you have comments or questions…