overview
Post on 25-Feb-2016
29 Views
Preview:
DESCRIPTION
TRANSCRIPT
Auto-Validating Your Data Store: A do-it-yourself approach to
data integrity and anomaly detection.
Evan DaviesOffice of Strategic Planning and Analysis
The College of William and Mary in Virginia
Overview
• Institutional researchers increasingly rely on data marts, stores, and warehouses for management information.
• Given the perpetually ‘developing’ status of these environments, both commercial and institutional, you can spend significant time dealing with data that does not meet even shifting standards for table logic, variable conventions, and values.
• Such anomalies can stop production programs or lead to inaccurate information until detected.
Does this ever happen to you?• NOTE: Table WORK.ENROLL created, with 8241 rows and 19
columns. • NOTE: Table WORK.PERSON created, with 8241 rows and
13 columns. • NOTE: Table WORK.MAJOR created, with 8253 rows and 30
columns. • NOTE: Table WORK.MAJOR_PERSON created, with 8264
rows and 30 columns. • Who are these extra people? Why are they in here?
•
Agenda
• This presentation demonstrates a simple yet sophisticated way of using SAS® Enterprise Guide to check tables, variables, and values automatically to find out if that data meets IR standards and premises before you start analytical work (and to let others know that things may need fixing).
Things You Should Know…
• ‘Simple’ is a relative term, as is ‘sophisticated’• This involves more coding than mouse clicking• You should have at some concepts of SAS® coding,
SQL, and relational databases• To make this work back at your campus, you need to
know (or find out) how to access your data• If you have significant structured programming and
SQL experience, please refrain from laughing out loud. Snickering is acceptable.
The College of William & Mary• The only royally chartered colonial institution, 1693, by
King William III and Queen Mary II… • Making it the second oldest college in the United States• Phi Beta Kappa, the first Greek honor society, was founded
here in 1776• Became state-supported in 1906 and coeducational in 1918
• The Alma Mater of George Washington and Thomas Jefferson , as well as Jon Stewart and Secretary of Defense Robert Gates
• Named one of Intel's 50 Most Unwired College Campuses for our campus-wide wireless network
• The Colonial Campus section of the 1,200 acre campus is restored to its 18th-century appearance
The Wren Building (1700)
The Oldest Academic Building in Continuous Use in the U.S.
The College Today…• 5,800 undergraduates and 1,950 graduate students from all 50
states and 30 foreign countries• 22 percent are students of color• 79 percent of freshmen graduated in the top ten percent of their
class• Highest SAT middle 50th range of all public institutions in Virginia• 11:1 student-faculty ratio• W&M has more recipients of the Commonwealth's Outstanding
Faculty Award than any other institution• 5 undergraduate and graduate schools: Arts & Sciences, Business,
Education, Law, and Marine Science • 36 undergraduate programs, 12 masters , doctoral, and professional
degrees
• is a Highly Selective Public Liberal Arts University
HRSHuman Resource System∙Applications∙Personnel Mgmt∙Position Control∙Benefits Mgmt∙Work Study∙Payroll/Account∙A21 Certification∙CARS Interface
The College of William and Mary
Federal Govt∙IRS/SSA∙W2s/1099s
VA Dept of Taxation∙W2s/1099s∙CDS Vendors
Office Supply Vendor
VA DOA CARS System∙Payroll Acct∙EDI to W&M∙CDS Pymts∙EDI Vendor Pymts∙Expense/Cash Acct∙Agency/CPRS Tape
Federal Dept Labor∙Employment Compliance
VA DPT∙BES System∙Benefits Eligibility
VA VRS∙Retire, Life Ins, Opt Life
VA VEC∙Unemployment Benefits
FSA Administrator
TIAA - CREF∙Annuity Eligibility
Benefits Vendors
VA DPT PMIS∙Personnel Mgmt
VA DPB∙Budget Admin
Social Security Administration
Fed Reserve∙US Savings Bonds
External SystemInterfaces/Entities
Office Supplies Systems
Warehouse Systems
Work Order System
Faculty Salary Tracking
Leave Account
1500 Hour Tracking
FRSFinancial Record System∙Accounts Payable∙Purchasing∙General Ledger∙Budget∙CARS∙Interface/Recon∙EDI & CDS Vendors∙Financial Statements∙Grants Account∙Fixed Assets (6/99)
VA APA
∙Bank∙ACH Direct Deposits∙Check Verification∙Cancelled Checks∙Lock-Box Payments
Cash Receipts System
FAACS (Old Fixed Assets)
WORCSStudent Web
Old Campus Police
Old SISStudent Info System∙Residence Life∙Student Billing∙Transcripts
New SISStudent Info System∙Prospects∙Admissions∙Student Records∙Registration∙Course Schedule∙PO Box Management
DARS
SSA/FICADeposits Payroll Office
US Savings BondsAdmin Payroll Office
Checks 1-2-3∙Pay Checks∙Direct Deposit Stubs∙W2s∙AP checks∙1099sGeneral Accounting Office
New Campus Police
Identification SystemFood Services
Parking System
SWEMPatron Info System
Mysoft Call Account Sys Telecom
Power FAIDSNew Financial Aid
VA SCHEV
SAT Scores
ACT Scores
GRE Scores
MAT Scores
MCAT Scores
NCAA
Prospective Student Search
Peterson’s Pros. Student Data
National Clearing House Loan Verification
PELL Grants
ISIR F.A. Input
Wiz Kid
Student Health System
Door Access System
Alumni Development System
Old Financial Aid
Schedule 25Resource 25
Athlete Tracking
Web Applications (IP)
College Systems that Interface with Administrative Computing Systems
College Systems that Interface with Administrative Computing Systems
Data HistoryWe had many homegrown systems, some based on Information Associates®
architecture, but extensively modified.
Recent Data History
• After false starts, in 2003 we bought a large Banner enterprise system, and then added a datamart, and then a data store. We are now five years into our two year installation period.
• Or put another way, we are ‘current’ on version(s) of 8.2 and 3.1, with new version(s) around the corner.
• When the IR staff start to understand the relationships and foibles of a particular version, it is time to upgrade to a newer version in which some things are fixed… and some other things are broken differently enabled.
Production
DataMart
DataStore
v.6, v.7, v.8… v.1, v.2… v.1, v.2, v.3….
Things To Realize…• SungardHE Banner® products are not a bad
system. Nor are any other commercial vendor’s products.
• Any enterprise-level system with a data store is a permanently evolving, almost organic entity,
• with multiple input opportunities for breaking constraints and premises
• and for finding new ways to induce unexpected results through changing business rules, institutional decentralization, flexibility, and collegiality.
Data Integrity in Pre- and Post-Enterprise Systems
• Old History: I.T. used to do a system General Edit, with “general” meaning edited for operational purposes, not analytical purposes. If the data got the payroll to run today or allowed you to admit a student, it was ‘valid’.
• New History: Data validity is still measured against operational standards.
Data Integrity Post-Enterprise System
• And I.T. is now even busier than ever just making the production system ‘run’, without markedly larger numbers of staff to commit to the data warehousing activity. They actually do less general editing because more transactions are interactive rather than process-oriented.
• At the end of a day, data is off-loaded into the data store, in different forms and with different premises from how the data is held from the production system.
Data Integrity Pre- and Post-Enterprise System
• This means that there is now an even bigger split between operational and analytical data integrity, especially in terms of the differing forms of the data.
• The data store is not as rigorously evaluated for data integrity or meaning precisely because it is not production. And since the operational offices of the institution are satisfied with their pieces of the data pie, everything is working fine.
The Institutional Research Role
• IR has always had premises for data reporting that go beyond ‘general edit’.
• We analyze data relationships that make up the whole picture.
• We work in the aggregate, rather than by individual transaction.
• We are ideally poised to discover the anomalies that occur between multiple institutional sources -- between one office’s interpretation of a transaction and another office’s idea of the same information.
A Data Store’s Added Task to IR
• IR now has to do the same comprehensive validation tasks that we used to do, plus identify and deal with the newer ‘introduced’ problem of tables that violate their own premises or have unexpected values due to complex dynamics among:
RULES
STORE
PEOPLE CHANGES
PRODUPGRADE
The Complex Dynamics• imperfect translation of data between the production
transaction system and the data store/warehouse,
• vendor maintenance or institutional business rules changes that intentionally induce changes in tables.
• generally caused by the inability to predict all of the systemic results of making unspecified or unimagined changes in a system, sometimes known as the Butterfly Effect.
• the continuous upgrade cycles applied to the system,
• and data that results from imperfectly recorded transactions in uncertain environments with less than adequate collaboration and training.
So What Can We Do?
• Write a program to – check the logic premises of frequently used tables– check for missing or out-of-range values in data that
affects IR– run the program frequently to uncover problems in
time for the current census and prevent future term anomalies. Keep the results for documentation.
– communicate findings promptly and efficiently in order to effect change
– build in flexibility to test different things at different times in different ways
Limitations in Place
• Do it on your own, since it is for IR purposes. Remember, the data already meets everybody else’s needs.
• Use existing resources. Keep it simple.
• Someone else is going to have to be notified to deal with the data anomaly once it is identified.
• Don’t weaponize the process. This is why we choose to have anomalies, not mistakes.
How To Accomplish?
• The use of SAS® EG on a PC platform allows for the remote submission of a premise and data-checking program during off-hours, when demand on the system is lessened.
• It also allows a convenient environment in which to store the program and results, access validation and history tables, and send automatic e-mails to interested parties such as admissions, registrar, IR and IT staff.
• “SAS® Enterprise Guide®, a powerful Microsoft Windows client application that provides a guided mechanism to exploit the power of SAS and publish dynamic results throughout your institution. It’s the preferred interface to SAS for analysts, statisticians and programmers– SAS Enterprise Guide saves time by automatically generating computer code with an easy point-and-click interface.”
• Think of it as a graphical office environment for the SAS language. As Microsoft Outlook ties MS Office products together, Enterprise Guide has a similar role for SAS
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Still not clear on SAS/EG® ?
• It is an environment in which SAS 9.1 runs;• It can bring together data views and any type of files
from any network data servers, including Oracle.• It contains SAS program(s), the larger project, notes,
and a graphical description of how the project, processes, and programs relate to each other.
• It documents how any processes or programs have been run, and the results.
• It can generate code for procedures and datasteps.• It inherits libraries, autoexecs, etc. from 9.1
We Use Enterprise Guide® To…
• Schedule and launch the anomaly tracking job• Provide a comprehensive project
environment for IR staff to be able to visit independently and add new anomaly checks
• Be able to see and modify the associated tables and data in one place
• Provide our novice IR staff a more centralized and friendlier view of the process
What Does It Look Like?
Can You Just Use SAS® Itself?
•YES!• Provided you schedule the job through MS
Task Scheduler and have no desire for the previously mentioned features or facilities.
• Programmatically, all process features are part of base SAS v9.1 on an XP or Vista platform.
Beginning Steps
• Survey yourself and other staff (IR and other offices) to make a preliminary list of tables and values that have issues
• Establish the names, emails, and hierarchy of those who will receive automated communications.
• Make a calendar of when you want to test certain items due to different functional cycles (admissions, registration, etc)
Let’s Go Coding…
• I can’t show you the entire code for my anomaly program today.
• It would need to be modified to fit your data structures and needs anyway.
• Instead I will concentrate on imparting some:• Key Program Ideas• SQL techniques
• to enable you to develop your own program
The Overall Design ConceptSet up a SAS program which launches automatically using SAS/EG
Construct some macros to help pass anomaly parameters
Acquire data for testing from your datastore tables
Use Proc SQL and other steps and procedures to test data
The Overall Design Concept (2)Test tables and variables based on the academic year and institutional work cycle
Anomalies ‘fixed’ are deleted from the master ; new ones are added to it
The master table is subset and sent as email at various intervals and detail levels
A history set of anomaly transactions is kept for study
Set up a SAS program which launches automatically using SAS/EG
Set up a SAS program which launches automatically using SAS/EG
%let term_start_limit=200325;
%let term_end_limit =201030;
%let highestssn='772';
%let es_validset='EL','MW','WD','WM','WW'; /*all valid statuses encountered by enrolled students AFTER enrollment/dropadd*/
%let es_bad='QW','WB','AW'; /*all bad or ineligible statuses not to be counted by census date*/
%let tooold='1900'; *out-of-range birthday year cutoff;
Start by setting up metadata necessary for testing
Start by setting up metadata necessary for testing
Macro variable assignment increases flexibility as table names change.
%let table_enroll=enrollment;%let table_academic_study=academic_study;%let table_addcurrent=address_current;%let table_prevedslot=previous_education_slot;%let table_schedule=schedule_offering;%let table_course_catalog=course_catalog;%let table_course=student_course;
Start by setting up metadata necessary for testing
*format table for anomalies;proc format;value anom 1= 'withdrawn with classes'2= 'duplicate person recs '3-4= 'missing value '5= 'ssn out-of-range '6= 'Two ids, one ssn '7= 'dup recs course_catalo '……;run;
Construct some macros to help pass anomaly parameters
Create table todayterm as select stvterm_code as studyterm,
datepart(stvterm_start_date) as begd , datepart(stvterm_end_date) as endd from (your data source) where today() > datepart(stvterm_start_date) and today() < datepart(stvterm_end_date) ;
*set a global study term variable based on value in todayterm;data _null_;
set todayterm;call symput('study',studyterm);
Construct some macros to help pass anomaly parameters
•Use up to 3 variables to show people what is anomalous about any particular situation. Name them A, B, and C. The variables you will pass to these variables will differ with each problem.•You will need to put both the value and the name of the variable into your anomaly report, plus some identifier(s) for the student or entity, plus some anomaly details.
Construct some macros to help pass anomaly parameters
*Macro to help transfer of anomalies;%macro keep ; (keep= id person_uid aval aval_desc bval bval_desc cval cval_desc anom studyterm first_anom_date suspend_date data_own)%mend keep;
Construct some macros to help pass anomaly parameters
%macro pass (aval=,bval=,cval=,dsn=,anom=,studyterm=,suspend_date=,data_own=) ;aval_desc = "&aval";bval_desc = "&bval";cval_desc = "&cval";aval = put(&aval,$25.);bval = put(&bval,$25.);cval = put(&cval,$25.);anom=&anom;studyterm=put(&studyterm,$6.);first_anom_date=today();suspend_date=&suspend_date;data_own=&data_own;
%mend pass;
Construct some macros to help pass anomaly parameters
PASSES THE NAME OF THE VARIABLE
PASSES char VALUE OF THE VARIABLE
PASSES OTHER VALUES
proc sql;connect to odbc as mydb (datasrc="&datasrc" user=&user password=&password);create table addressc as select * from connection to mydb ( select m.PERSON_UID, m.id, n.address_type, n.postal_code, n.city, n.county, n.state_province, n.nationfrom &table_enroll m
inner join&table_addcurrent non m.person_uid = n.entity_uid
where m.ACADEMIC_PERIOD in (&study) and ((m.ENROLLED_IND='Y' and m.REGISTERED_IND='Y' ) or (m.ENROLLED_IND='Y' and m.ENROLLMENT_STATUS in
(&es_set)))and n.address_type in ('IN', 'P1', 'MA')
) ;
Acquire data for testing from your datastore
Acquire data for testing from your datastore
…from &table_enroll minner join&table_addcurrent non m.person_uid = n.entity_uid
where m.ACADEMIC_PERIOD in (&study) and and n.address_type in ('IN', 'P1', 'MA')
STUDENTS THIS TERM
ALL ADDRESSES
ADDRESSES TO CHECK
Acquire data for testing from your datastore
Let your server do the heavy data work
SQL call
Result Set
Oracle ®
• The most useful technique for detecting a table that violates record premises is by joining the table back to a copy of itself that has been summarized by the number of expected rows, keeping only those records that don’t meet expectations.
• A SQL statement that uses a count() function, a ‘group by’ clause, and a ‘having’ clause does this job effectively.
Use Proc SQL and other procedures to test logic/values
select l.* from(select PERSON_UID, ID, ACADEMIC_PERIOD, PROGRAM,PRIMARY_PROGRAM_IND, ADMISSIONS_POPULATION from &table_academic_study where academic_period in (&study) ) l
inner join(select person_uidfrom &table_academic_studywhere academic_period in (&study)group by person_uid, programhaving count(person_uid) > 1 ) r
on l.person_uid = r.person_uid
Use Proc SQL and other procedures to test logic/values
selectperson_uidfrom &table_academic_studywhere academic_period in (&study)group by person_uid, programhaving count(person_uid) > 1
Use Proc SQL and other procedures to test logic/values
PERSON_UID PROGRAM ACADEMIC_PERIOD COUNT
1234 BA-GOVT 200910 1
1234 BA-GOVT 200910 1
PERSON_UID PROGRAM ACADEMIC_PERIOD COUNT
1234 BA-GOVT 200910 2
select l.person_uid, r.person_uid as other_uid, l.tax_id, r.tax_id as other_tax_id, l.full_name_lfmifrom person l inner join person r on l.TAX_ID = r. TAX_IDAND l.person_uid <> r.person_uidwhere l.tax_id is not nulland r.tax_id is not null
This SQL will find two different university ids that share the same SSN. This generally occurs when the institution has issued two ids to the same person without an adequate search of records. The faster this is spotted, the better.
Use Proc SQL and other procedures to test logic/values
In addition to testing table logic, once you have the datasets established, any variety or combination of values can be tested. Here are four conditions to get you started thinking about what can be tested:
- missing values;if citizenship_type = '' then do;
- withdrawn with classes;if enrolled_ind = 'Y' and registered_ind = 'Y' and enrollment_status = 'WB' then do;
Use Proc SQL and other procedures to test logic/values
if state_province in ('AA','AE','AP','PR','VI','AL','AK','AZ','AR','CA', …'WI','WY' ) and nation > ''
orstate_province in ('AB','BC','MB','NB','NL','NT','NS','NU','ON','PE','QC','SK','YT') and nation ^= 'CA'
ornation = 'CA' and state_province not in
('AB','BC','MB','NB','NL','NT','NS','NU','ON','PE','QC','SK','YT')
orstate_province in ('FC' ,'HK', 'RQ', 'XX') then do;
Use Proc SQL and other procedures to test logic/values
- ssn out of range;if tax_id ^= '' then do; ssnverf = indexc(substr(tax_id,1,9),' ','-abcdefghijklmnopqrstuvwxyz','ABCDEFGHIJKLMNOPQRSTUVWXYZ') ;
if substr(tax_id,6,4) = '0000' or substr(tax_id,1,3) < '001' or substr(tax_id,4,2) = '00' or substr(tax_id,1,3) > &highestssn or ssnverf > 0 then do;
Use Proc SQL and other procedures to test logic/values
data T_person_v %keep ;set person;
*test3 - missing value;anom=0;if citizenship_type = '' then
do;%pass(aval=citizenship_type, bval=full_name_lfmi, cval=, dsn=person, anom=3, studyterm=&study, suspend_date=., data_own='reg')
end;if anom > 0;run;
Pass anomalies found into a transaction table
The KEEP Macro in place
The PASS Macro in place
The Anomaly Testing
*test6 - two ids, one ssn;data T_person_v3 %keep ;
set anom_person2;%pass(aval=tax_id, bval=other_tax_id, cval=other_id, dsn=person, anom=6, studyterm=&study, suspend_date=.,data_own='reg')
run;
Pass anomalies found into a transaction table
The KEEP Macro
PASS
Nomenclature of Test Datasets - T_area_XN where T = Test area = broad anomaly category X = v(ariable based)
a(ssumption of table logic violated) N = incremental test set number
dataset T_person_v3 is the third datset testing the demographic (person) table for variable-based anomalies such as missing values or incompatible statuses
Pass anomalies found into a transaction table
*create temp work dataset for todays transactions;data today;
length aval_desc bval_desc cval_desc $25 ; set
T_addressc_v1T_addressc_v2T_addressc_aT_acadstudy_aT_ccat_aT_person_aT_person_v1T_person_v2T_person_v3
{all the transaction sets};run;
Pass anomalies found into a transaction table
Remember to unduplicate all the duplicated records found! You only need one example record per anomaly.
proc sort NODUPKEY data=today;by person_uid anom aval bval cval studyterm;run;
Failure to do so may result in multiple joins in the next step, when you update the master dataset.
Pass anomalies found into a transaction table
Anomalies are added or deleted from the master
Today’s Set
Master Set
Data master ;Update master today;------------or-------------Proc SQL;Select * fromoldmaster lright jointoday ron{criteria}
Add New RecordsKeep Matching Records
Delete Non-Matches
proc sql; create table newmaster asselect r.person_uid, r.id, r.anom, r.aval, r.aval_desc , {other variables},
coalesce(l.first_anom_date, r.first_anom_date) as first_anom_datefrom oldmaster l right join today ronl.person_uid = r.person_uid and l.anom = r.anomand l.aval= r.aval and l.bval= r.bval and l.cval= r.cvaland l.studyterm = r.studyterm ;quit;
Anomalies are added or deleted from the master
The master table is subset and sent as email
data reg ban ir adm grr soe bur law;set master;output ir; *for all records;if data_own='reg' then output reg; else if data_own = 'ban' then output ban; else if data_own = 'adm' then output adm; else if data_own = 'grr' then output grr; else if data_own = 'soe' then output soe; else if data_own = 'bur' then output bur; else if data_own = 'law' then output law;
The master table is subset and sent as email
proc export data=reg dbms=excel2002 outfile= 'g:\temp\reg.xls' replace;
proc export data=ban dbms=excel2002 outfile= 'g:\temp\ban.xls' replace;
proc export data=ir dbms=excel2002 outfile= 'g:\temp\ir.xls' replace;
The master table is subset and sent as email
filename reports email "esdav2@wm.edu"; data _null_; file reports; set departments; put '!EM_TO! ' name; put '!EM_SUBJECT! Report for ' dept; put ‘Hi’ fname ‘-'; put 'Here is the latest report of anomalies for the' dept'.' ; if dept='ban' then put '!EM_ATTACH! g:\temp\ban.xls'; else if dept ='reg' then put '!EM_ATTACH! g:\temp\reg.xls'; else if dept ='ir' then put '!EM_ATTACH! g:\temp\adm.xls'; put '!EM_SEND!'; put '!EM_NEWMSG!'; put '!EM_ABORT!'; run;
Success!
Subject line
Attached xlsName Department
Technical Hurdles Along The Way
• Don’t use your Outlook mailer. Specify your SMTP mail service as the mailer. You may have to pass your authentication through the SAS sasv9.cfg file to allow STMP mailing.
• Depending on your network, you may have to use “Cscript” as the keyword in the scheduler to launch the program, rather than the implicit “Wscript”. ‘C’ stands for ‘console’.
Lessons Learned
• Send all output to yourself for several days to review, before allowing it to be sent out automatically
• Send lower priority anomalies infrequently, and high priority ones weekly or daily
• Send notice of table violations infrequently to IT, once they have identified a problem and resolution path. (Don’t bug them if they, too, are waiting for a vendor patch or fix)
Lessons Learned
• Be aware of the length of time the job takes to execute. You may need to adjust what and when you test if it starts taking too much time
• Bring your ‘users’ into the process by asking them what you can do differently to help them. Do they need another variable to help isolate problems? Alternate ID?
Auto-Validating Your Data Store Evan Davies
Presentation available online at:http://web.wm.edu/ir/conferencepres.html
top related