auto-validating your data store: a do-it-yourself approach to data integrity and anomaly detection....

Download Auto-Validating Your Data Store: A do-it-yourself approach to data integrity and anomaly detection. Evan Davies Office of Strategic Planning and Analysis

If you can't read please download the document

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Auto-Validating Your Data Store: A do-it-yourself approach to data integrity and anomaly detection. Evan Davies Office of Strategic Planning and Analysis The College of William and Mary in Virginia
  • Slide 2
  • Overview Institutional researchers increasingly rely on data marts, stores, and warehouses for management information. Given the perpetually developing status of these environments, both commercial and institutional, you can spend significant time dealing with data that does not meet even shifting standards for table logic, variable conventions, and values. Such anomalies can stop production programs or lead to inaccurate information until detected.
  • Slide 3
  • Does this ever happen to you? NOTE: Table WORK.ENROLL created, with 8241 rows and 19 columns. NOTE: Table WORK.PERSON created, with 8241 rows and 13 columns. NOTE: Table WORK.MAJOR created, with 8253 rows and 30 columns. NOTE: Table WORK.MAJOR_PERSON created, with 8264 rows and 30 columns. Who are these extra people? Why are they in here?
  • Slide 4
  • Agenda This presentation demonstrates a simple yet sophisticated way of using SAS Enterprise Guide to check tables, variables, and values automatically to find out if that data meets IR standards and premises before you start analytical work (and to let others know that things may need fixing).
  • Slide 5
  • Things You Should Know Simple is a relative term, as is sophisticated This involves more coding than mouse clicking You should have at some concepts of SAS coding, SQL, and relational databases To make this work back at your campus, you need to know (or find out) how to access your data If you have significant structured programming and SQL experience, please refrain from laughing out loud. Snickering is acceptable.
  • Slide 6
  • The College of William & Mary The only royally chartered colonial institution, 1693, by King William III and Queen Mary II Making it the second oldest college in the United States Phi Beta Kappa, the first Greek honor society, was founded here in 1776 Became state-supported in 1906 and coeducational in 1918 The Alma Mater of George Washington and Thomas Jefferson, as well as Jon Stewart and Secretary of Defense Robert Gates Named one of Intel's 50 Most Unwired College Campuses for our campus-wide wireless network The Colonial Campus section of the 1,200 acre campus is restored to its 18th-century appearance
  • Slide 7
  • The Wren Building (1700) The Oldest Academic Building in Continuous Use in the U.S.
  • Slide 8
  • The College Today 5,800 undergraduates and 1,950 graduate students from all 50 states and 30 foreign countries 22 percent are students of color 79 percent of freshmen graduated in the top ten percent of their class Highest SAT middle 50th range of all public institutions in Virginia 11:1 student-faculty ratio W&M has more recipients of the Commonwealth's Outstanding Faculty Award than any other institution 5 undergraduate and graduate schools: Arts & Sciences, Business, Education, Law, and Marine Science 36 undergraduate programs, 12 masters, doctoral, and professional degrees is a Highly Selective Public Liberal Arts University
  • Slide 9
  • HRS Human Resource System Applications Personnel Mgmt Position Control Benefits Mgmt Work Study Payroll/Account A21 Certification CARS Interface The College of William and Mary Federal Govt IRS/SSA W2s/1099s VA Dept of Taxation W2s/1099s CDS Vendors Office Supply Vendor VA DOA CARS System Payroll Acct EDI to W&M CDS Pymts EDI Vendor Pymts Expense/Cash Acct Agency/CPRS Tape Federal Dept Labor Employment Compliance VA DPT BES System Benefits Eligibility VA VRS Retire, Life Ins, Opt Life VA VEC Unemployment Benefits FSA Administrator TIAA - CREF Annuity Eligibility Benefits Vendors VA DPT PMIS Personnel Mgmt VA DPB Budget Admin Social Security Administration Fed Reserve US Savings Bonds External System Interfaces/Entities Office Supplies Systems Warehouse Systems Work Order System Faculty Salary Tracking Leave Account 1500 Hour Tracking FRS Financial Record System Accounts Payable Purchasing General Ledger Budget CARS Interface/Recon EDI & CDS Vendors Financial Statements Grants Account Fixed Assets (6/99) VA APA Bank ACH Direct Deposits Check Verification Cancelled Checks Lock-Box Payments Cash Receipts System FAACS (Old Fixed Assets) WORCS Student Web Old Campus Police Old SIS Student Info System Residence Life Student Billing Transcripts New SIS Student Info System Prospects Admissions Student Records Registration Course Schedule PO Box Management DARS SSA/FICA Deposits Payroll Office US Savings Bonds Admin Payroll Office Checks 1-2-3 Pay Checks Direct Deposit Stubs W2s AP checks 1099s General Accounting Office New Campus Police Identification System Food Services Parking System SWEM Patron Info System Mysoft Call Account Sys Telecom Power FAIDS New Financial Aid VA SCHEV SAT Scores ACT Scores GRE Scores MAT Scores MCAT Scores NCAA Prospective Student Search Petersons Pros. Student Data National Clearing House Loan Verification PELL Grants ISIR F.A. Input Wiz Kid Student Health System Door Access System Alumni Development System Old Financial Aid Schedule 25 Resource 25 Athlete Tracking Web Applications (IP) College Systems that Interface with Administrative Computing Systems Data History We had many homegrown systems, some based on Information Associates architecture, but extensively modified.
  • Slide 10
  • Recent Data History After false starts, in 2003 we bought a large Banner enterprise system, and then added a datamart, and then a data store. We are now five years into our two year installation period. Or put another way, we are current on version(s) of 8.2 and 3.1, with new version(s) around the corner. When the IR staff start to understand the relationships and foibles of a particular version, it is time to upgrade to a newer version in which some things are fixed and some other things are broken differently enabled. v.6, v.7, v.8 v.1, v.2 v.1, v.2, v.3.
  • Slide 11
  • Things To Realize SungardHE Banner products are not a bad system. Nor are any other commercial vendors products. Any enterprise-level system with a data store is a permanently evolving, almost organic entity, with multiple input opportunities for breaking constraints and premises and for finding new ways to induce unexpected results through changing business rules, institutional decentralization, flexibility, and collegiality.
  • Slide 12
  • Data Integrity in Pre- and Post-Enterprise Systems Old History: I.T. used to do a system General Edit, with general meaning edited for operational purposes, not analytical purposes. If the data got the payroll to run today or allowed you to admit a student, it was valid. New History: Data validity is still measured against operational standards.
  • Slide 13
  • Data Integrity Post-Enterprise System And I.T. is now even busier than ever just making the production system run, without markedly larger numbers of staff to commit to the data warehousing activity. They actually do less general editing because more transactions are interactive rather than process-oriented. At the end of a day, data is off-loaded into the data store, in different forms and with different premises from how the data is held from the production system.
  • Slide 14
  • Data Integrity Pre- and Post-Enterprise System This means that there is now an even bigger split between operational and analytical data integrity, especially in terms of the differing forms of the data. The data store is not as rigorously evaluated for data integrity or meaning precisely because it is not production. And since the operational offices of the institution are satisfied with their pieces of the data pie, everything is working fine.
  • Slide 15
  • The Institutional Research Role IR has always had premises for data reporting that go beyond general edit. We analyze data relationships that make up the whole picture. We work in the aggregate, rather than by individual transaction. We are ideally poised to discover the anomalies that occur between multiple institutional sources -- between one offices interpretation of a transaction and another offices idea of the same information.
  • Slide 16
  • A Data Stores Added Task to IR IR now has to do the same comprehensive validation tasks that we used to do, plus identify and deal with the newer introduced problem of tables that violate their own premises or have unexpected values due to complex dynamics among: UPGRADE
  • Slide 17
  • The Complex Dynamics imperfect translation of data between the production transaction system and the data store/warehouse, vendor maintenance or institutional business rules changes that intentionally induce changes in tables. generally caused by the inability to predict all of the systemic results of making unspecified or unimagined changes in a system, sometimes known as the Butterfly Effect. the continuous upgrade cycles applied to the system, and data that results from imperfectly recorded transactions in uncertain environments with less than adequate collaboration and training.
  • Slide 18
  • So What Can We Do? Write a program to check the logic premises of frequently used tables check for missing or out-of-range values in data that affects IR run the program frequently to uncover problems in time for the current census and prevent future term anomalies. Keep the results for documentation. communicate findings promptly and efficiently in order to effect change build in flexibility to test different things at different times in different ways
  • Slide 19
  • Limitations in Place Do it on your own, since it is for IR purposes. Remember, the data already meets everybody elses needs. Use existing resources. Keep it simple. Someone else is going to have to be notified to deal with the data anomaly once it is identified. Dont weaponize the process. This is why we choose to have anomalies, not mistakes.
  • Slide 20
  • How To Accomplish? The use of SAS EG on a PC platform allows for the remote submission of a premise and data-checking program during off-hours, when demand on the system is lessened. It also allows a convenient environment in which to store the program and results, access validation and history tables, and send automatic e-mails to interested parties such as admissions, registrar, IR and IT staff.
  • Slide 21
  • SAS Enterprise Guide, a powerful Microsoft Windows client application that provides a guided mechanism to exploit the power of SAS and publish dynamic results throughout your institution. Its the preferred interface to SAS for analysts, statisticians and programmers SAS Enterprise Guide saves time by automatically generating computer code with an easy point- and-click interface. Think of it as a graphical office environment for the SAS language. As Microsoft Outlook ties MS Office products together, Enterprise Guide has a similar role for SAS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration.
  • Slide 22
  • Still not clear on SAS/EG ? It is an environment in which SAS 9.1 runs; It can bring together data views and any type of files from any network data servers, including Oracle. It contains SAS program(s), the larger project, notes, and a graphical description of how the project, processes, and programs relate to each other. It documents how any processes or programs have been run, and the results. It can generate code for procedures and datasteps. It inherits libraries, autoexecs, etc. from 9.1
  • Slide 23
  • We Use Enterprise Guide To Schedule and launch the anomaly tracking job Provide a comprehensive project environment for IR staff to be able to visit independently and add new anomaly checks Be able to see and modify the associated tables and data in one place Provide our novice IR staff a more centralized and friendlier view of the process
  • Slide 24
  • What Does It Look Like?
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Can You Just Use SAS Itself? YES! Provided you schedule the job through MS Task Scheduler and have no desire for the previously mentioned features or facilities. Programmatically, all process features are part of base SAS v9.1 on an XP or Vista platform.
  • Slide 32
  • Beginning Steps Survey yourself and other staff (IR and other offices) to make a preliminary list of tables and values that have issues Establish the names, emails, and hierarchy of those who will receive automated communications. Make a calendar of when you want to test certain items due to different functional cycles (admissions, registration, etc)
  • Slide 33
  • Lets Go Coding I cant show you the entire code for my anomaly program today. It would need to be modified to fit your data structures and needs anyway. Instead I will concentrate on imparting some: Key Program Ideas SQL techniques to enable you to develop your own program
  • Slide 34
  • The Overall Design Concept
  • Slide 35
  • The Overall Design Concept (2)
  • Slide 36
  • Set up a SAS program which launches automatically using SAS/EG
  • Slide 37
  • Slide 38
  • %let term_start_limit=200325; %let term_end_limit =201030; %let highestssn='772'; %let es_validset='EL','MW','WD','WM','WW'; /*all valid statuses encountered by enrolled students AFTER enrollment/dropadd*/ %let es_bad='QW','WB','AW'; /*all bad or ineligible statuses not to be counted by census date*/ %let tooold='1900'; *out-of-range birthday year cutoff; Start by setting up metadata necessary for testing
  • Slide 39
  • Macro variable assignment increases flexibility as table names change. %let table_enroll=enrollment; %let table_academic_study=academic_study; %let table_addcurrent=address_current; %let table_prevedslot=previous_education_slot; %let table_schedule=schedule_offering; %let table_course_catalog=course_catalog; %let table_course=student_course;
  • Slide 40
  • Start by setting up metadata necessary for testing *format table for anomalies; proc format; value anom 1= 'withdrawn with classes' 2= 'duplicate person recs ' 3-4='missing value ' 5= 'ssn out-of-range ' 6= 'Two ids, one ssn ' 7= 'dup recs course_catalo ' ; run;
  • Slide 41
  • Construct some macros to help pass anomaly parameters
  • Slide 42
  • Create table todayterm as select stvterm_code as studyterm, datepart(stvterm_start_date) as begd, datepart(stvterm_end_date) as endd from ( your data source ) where today() > datepart(stvterm_start_date) and today() < datepart(stvterm_end_date) ; *set a global study term variable based on value in todayterm; data _null_; set todayterm; call symput('study',studyterm); Construct some macros to help pass anomaly parameters
  • Slide 43
  • Use up to 3 variables to show people what is anomalous about any particular situation. Name them A, B, and C. The variables you will pass to these variables will differ with each problem. You will need to put both the value and the name of the variable into your anomaly report, plus some identifier(s) for the student or entity, plus some anomaly details. Construct some macros to help pass anomaly parameters
  • Slide 44
  • *Macro to help transfer of anomalies; %macro keep ; (keep= id person_uid aval aval_desc bval bval_desc cval cval_desc anom studyterm first_anom_date suspend_date data_own) %mend keep; Construct some macros to help pass anomaly parameters
  • Slide 45
  • %macro pass (aval=,bval=,cval=,dsn=,anom=,studyterm=,suspend_date=,data_own=) ; aval_desc = "&aval"; bval_desc = "&bval"; cval_desc = "&cval"; aval = put(&aval,$25.); bval = put(&bval,$25.); cval = put(&cval,$25.); anom=&anom; studyterm=put(&studyterm,$6.); first_anom_date=today(); suspend_date=&suspend_date; data_own=&data_own; %mend pass; Construct some macros to help pass anomaly parameters PASSES THE NAME OF THE VARIABLE PASSES char VALUE OF THE VARIABLE PASSES OTHER VALUES
  • Slide 46
  • proc sql; connect to odbc as mydb (datasrc="&datasrc" user=&user password=&password); create table addressc as select * from connection to mydb ( select m.PERSON_UID, m.id, n.address_type, n.postal_code, n.city, n.county, n.state_province, n.nation from &table_enroll m inner join &table_addcurrent n on m.person_uid = n.entity_uid where m.ACADEMIC_PERIOD in (&study) and ((m.ENROLLED_IND='Y' and m.REGISTERED_IND='Y' ) or (m.ENROLLED_IND='Y' and m.ENROLLMENT_STATUS in (&es_set))) and n.address_type in ('IN', 'P1', 'MA') ) ; Acquire data for testing from your datastore
  • Slide 47
  • from &table_enroll m inner join &table_addcurrent n on m.person_uid = n.entity_uid where m.ACADEMIC_PERIOD in (&study) and and n.address_type in ('IN', 'P1', 'MA') STUDENTS THIS TERM ALL ADDRESSES ADDRESSES TO CHECK
  • Slide 48
  • Acquire data for testing from your datastore Let your server do the heavy data work SQL call Result Set Oracle
  • Slide 49
  • The most useful technique for detecting a table that violates record premises is by joining the table back to a copy of itself that has been summarized by the number of expected rows, keeping only those records that dont meet expectations. A SQL statement that uses a count() function, a group by clause, and a having clause does this job effectively. Use Proc SQL and other procedures to test logic/values
  • Slide 50
  • select l.* from (select PERSON_UID, ID, ACADEMIC_PERIOD, PROGRAM, PRIMARY_PROGRAM_IND, ADMISSIONS_POPULATION from &table_academic_study where academic_period in (&study) ) l inner join (select person_uid from &table_academic_study where academic_period in (&study) group by person_uid, program having count(person_uid) > 1 ) r on l.person_uid = r.person_uid Use Proc SQL and other procedures to test logic/values
  • Slide 51
  • select person_uid from &table_academic_study where academic_period in (&study) group by person_uid, program having count(person_uid) > 1 Use Proc SQL and other procedures to test logic/values PERSON_UIDPROGRAMACADEMIC_PERIODCOUNT 1234BA-GOVT2009101 1234BA-GOVT2009101 PERSON_UIDPROGRAMACADEMIC_PERIODCOUNT 1234BA-GOVT2009102
  • Slide 52
  • select l.person_uid, r.person_uid as other_uid, l.tax_id, r.tax_id as other_tax_id, l.full_name_lfmi from person l inner join person r on l.TAX_ID = r. TAX_ID AND l.person_uid r.person_uid where l.tax_id is not null and r.tax_id is not null This SQL will find two different university ids that share the same SSN. This generally occurs when the institution has issued two ids to the same person without an adequate search of records. The faster this is spotted, the better. Use Proc SQL and other procedures to test logic/values
  • Slide 53
  • In addition to testing table logic, once you have the datasets established, any variety or combination of values can be tested. Here are four conditions to get you started thinking about what can be tested: - missing values; if citizenship_type = '' then do; - withdrawn with classes; if enrolled_ind = 'Y' and registered_ind = 'Y' and enrollment_status = 'WB' then do; Use Proc SQL and other procedures to test logic/values
  • Slide 54
  • if state_province in ('AA','AE','AP','PR','VI','AL','AK','AZ','AR','CA', 'WI','WY' ) and nation > '' or state_province in ('AB','BC','MB','NB','NL','NT','NS','NU','ON','PE','QC','SK', 'YT') and nation ^= 'CA' or nation = 'CA' and state_province not in ('AB','BC','MB','NB','NL','NT','NS','NU','ON','PE','QC','SK', 'YT') or state_province in ('FC','HK', 'RQ', 'XX') then do; Use Proc SQL and other procedures to test logic/values
  • Slide 55
  • - ssn out of range; if tax_id ^= '' then do; ssnverf = indexc(substr(tax_id,1,9),' ','- abcdefghijklmnopqrstuvwxyz','ABCDEFGHIJKLM NOPQRSTUVWXYZ') ; if substr(tax_id,6,4) = '0000' or substr(tax_id,1,3) < '001' or substr(tax_id,4,2) = '00' or substr(tax_id,1,3) > &highestssn or ssnverf > 0 then do; Use Proc SQL and other procedures to test logic/values
  • Slide 56
  • data T_person_v %keep ; set person; *test3 - missing value; anom=0; if citizenship_type = '' then do; %pass(aval=citizenship_type, bval=full_name_lfmi, cval=, dsn=person, anom=3, studyterm=&study, suspend_date=., data_own='reg') end; if anom > 0; run; Pass anomalies found into a transaction table The KEEP Macro in place The PASS Macro in place The Anomaly Testing
  • Slide 57
  • *test6 - two ids, one ssn; data T_person_v3 %keep ; set anom_person2; %pass(aval=tax_id, bval=other_tax_id, cval=other_id, dsn=person, anom=6, studyterm=&study, suspend_date=.,data_own='reg') run; Pass anomalies found into a transaction table The KEEP Macro PASS
  • Slide 58
  • Nomenclature of Test Datasets - T_area_XN where T = Test area = broad anomaly category X = v(ariable based) a(ssumption of table logic violated) N = incremental test set number dataset T_person_v3 is the third datset testing the demographic (person) table for variable-based anomalies such as missing values or incompatible statuses Pass anomalies found into a transaction table
  • Slide 59
  • *create temp work dataset for todays transactions; data today; length aval_desc bval_desc cval_desc $25 ; set T_addressc_v1 T_addressc_v2 T_addressc_a T_acadstudy_a T_ccat_a T_person_a T_person_v1 T_person_v2 T_person_v3 {all the transaction sets}; run; Pass anomalies found into a transaction table
  • Slide 60
  • Remember to unduplicate all the duplicated records found! You only need one example record per anomaly. proc sort NODUPKEY data=today; by person_uid anom aval bval cval studyterm; run; Failure to do so may result in multiple joins in the next step, when you update the master dataset. Pass anomalies found into a transaction table
  • Slide 61
  • Anomalies are added or deleted from the master Todays Set Master Set Data master ; Update master today; ------------or------------- Proc SQL; Select * from oldmaster l right join today r on {criteria}
  • Slide 62
  • proc sql; create table newmaster as select r.person_uid, r.id, r.anom, r.aval, r.aval_desc, {other variables}, coalesce(l.first_anom_date, r.first_anom_date) as first_anom_date from oldmaster l right join today r on l.person_uid = r.person_uid and l.anom = r.anom and l.aval= r.aval and l.bval= r.bval and l.cval= r.cval and l.studyterm = r.studyterm ; quit; Anomalies are added or deleted from the master
  • Slide 63
  • The master table is subset and sent as email data reg ban ir adm grr soe bur law; set master; output ir; *for all records; if data_own='reg' then output reg; else if data_own = 'ban' then output ban; else if data_own = 'adm' then output adm; else if data_own = 'grr' then output grr; else if data_own = 'soe' then output soe; else if data_own = 'bur' then output bur; else if data_own = 'law' then output law;
  • Slide 64
  • The master table is subset and sent as email proc export data=reg dbms=excel2002 outfile= 'g:\temp\reg.xls' replace; proc export data=ban dbms=excel2002 outfile= 'g:\temp\ban.xls' replace; proc export data=ir dbms=excel2002 outfile= 'g:\temp\ir.xls' replace;
  • Slide 65
  • The master table is subset and sent as email filename reports email "[email protected]"; data _null_; file reports; set departments; put '!EM_TO! ' name; put '!EM_SUBJECT! Report for ' dept; put Hi fname -'; put 'Here is the latest report of anomalies for the' dept'.' ; if dept='ban' then put '!EM_ATTACH! g:\temp\ban.xls'; else if dept ='reg' then put '!EM_ATTACH! g:\temp\reg.xls'; else if dept ='ir' then put '!EM_ATTACH! g:\temp\adm.xls'; put '!EM_SEND!'; put '!EM_NEWMSG!'; put '!EM_ABORT!'; run;
  • Slide 66
  • Success! Subject line Attached xls Name Department
  • Slide 67
  • Technical Hurdles Along The Way Dont use your Outlook mailer. Specify your SMTP mail service as the mailer. You may have to pass your authentication through the SAS sasv9.cfg file to allow STMP mailing. Depending on your network, you may have to use Cscript as the keyword in the scheduler to launch the program, rather than the implicit Wscript. C stands for console.
  • Slide 68
  • Lessons Learned Send all output to yourself for several days to review, before allowing it to be sent out automatically Send lower priority anomalies infrequently, and high priority ones weekly or daily Send notice of table violations infrequently to IT, once they have identified a problem and resolution path. (Dont bug them if they, too, are waiting for a vendor patch or fix)
  • Slide 69
  • Lessons Learned Be aware of the length of time the job takes to execute. You may need to adjust what and when you test if it starts taking too much time Bring your users into the process by asking them what you can do differently to help them. Do they need another variable to help isolate problems? Alternate ID?
  • Slide 70
  • Auto-Validating Your Data Store Evan Davies Presentation available online at: http://web.wm.edu/ir/conferencepres.html
  • Slide 71