overview

Post on 25-Feb-2016

29 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Auto-Validating Your Data Store: A do-it-yourself approach to data integrity and anomaly detection. Evan Davies Office of Strategic Planning and Analysis The College of William and Mary in Virginia . Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Auto-Validating Your Data Store: A do-it-yourself approach to

data integrity and anomaly detection.

Evan DaviesOffice of Strategic Planning and Analysis

The College of William and Mary in Virginia

Overview

• Institutional researchers increasingly rely on data marts, stores, and warehouses for management information.

• Given the perpetually ‘developing’ status of these environments, both commercial and institutional, you can spend significant time dealing with data that does not meet even shifting standards for table logic, variable conventions, and values.

• Such anomalies can stop production programs or lead to inaccurate information until detected.

Does this ever happen to you?• NOTE: Table WORK.ENROLL created, with 8241 rows and 19

columns. • NOTE: Table WORK.PERSON created, with 8241 rows and

13 columns. • NOTE: Table WORK.MAJOR created, with 8253 rows and 30

columns. • NOTE: Table WORK.MAJOR_PERSON created, with 8264

rows and 30 columns. • Who are these extra people? Why are they in here?

Agenda

• This presentation demonstrates a simple yet sophisticated way of using SAS® Enterprise Guide to check tables, variables, and values automatically to find out if that data meets IR standards and premises before you start analytical work (and to let others know that things may need fixing).

Things You Should Know…

• ‘Simple’ is a relative term, as is ‘sophisticated’• This involves more coding than mouse clicking• You should have at some concepts of SAS® coding,

SQL, and relational databases• To make this work back at your campus, you need to

know (or find out) how to access your data• If you have significant structured programming and

SQL experience, please refrain from laughing out loud. Snickering is acceptable.

The College of William & Mary• The only royally chartered colonial institution, 1693, by

King William III and Queen Mary II… • Making it the second oldest college in the United States• Phi Beta Kappa, the first Greek honor society, was founded

here in 1776• Became state-supported in 1906 and coeducational in 1918

• The Alma Mater of George Washington and Thomas Jefferson , as well as Jon Stewart and Secretary of Defense Robert Gates

• Named one of Intel's 50 Most Unwired College Campuses for our campus-wide wireless network

• The Colonial Campus section of the 1,200 acre campus is restored to its 18th-century appearance

The Wren Building (1700)

The Oldest Academic Building in Continuous Use in the U.S.

The College Today…• 5,800 undergraduates and 1,950 graduate students from all 50

states and 30 foreign countries• 22 percent are students of color• 79 percent of freshmen graduated in the top ten percent of their

class• Highest SAT middle 50th range of all public institutions in Virginia• 11:1 student-faculty ratio• W&M has more recipients of the Commonwealth's Outstanding

Faculty Award than any other institution• 5 undergraduate and graduate schools: Arts & Sciences, Business,

Education, Law, and Marine Science • 36 undergraduate programs, 12 masters , doctoral, and professional

degrees

• is a Highly Selective Public Liberal Arts University

HRSHuman Resource System∙Applications∙Personnel Mgmt∙Position Control∙Benefits Mgmt∙Work Study∙Payroll/Account∙A21 Certification∙CARS Interface

The College of William and Mary

Federal Govt∙IRS/SSA∙W2s/1099s

VA Dept of Taxation∙W2s/1099s∙CDS Vendors

Office Supply Vendor

VA DOA CARS System∙Payroll Acct∙EDI to W&M∙CDS Pymts∙EDI Vendor Pymts∙Expense/Cash Acct∙Agency/CPRS Tape

Federal Dept Labor∙Employment Compliance

VA DPT∙BES System∙Benefits Eligibility

VA VRS∙Retire, Life Ins, Opt Life

VA VEC∙Unemployment Benefits

FSA Administrator

TIAA - CREF∙Annuity Eligibility

Benefits Vendors

VA DPT PMIS∙Personnel Mgmt

VA DPB∙Budget Admin

Social Security Administration

Fed Reserve∙US Savings Bonds

External SystemInterfaces/Entities

Office Supplies Systems

Warehouse Systems

Work Order System

Faculty Salary Tracking

Leave Account

1500 Hour Tracking

FRSFinancial Record System∙Accounts Payable∙Purchasing∙General Ledger∙Budget∙CARS∙Interface/Recon∙EDI & CDS Vendors∙Financial Statements∙Grants Account∙Fixed Assets (6/99)

VA APA

∙Bank∙ACH Direct Deposits∙Check Verification∙Cancelled Checks∙Lock-Box Payments

Cash Receipts System

FAACS (Old Fixed Assets)

WORCSStudent Web

Old Campus Police

Old SISStudent Info System∙Residence Life∙Student Billing∙Transcripts

New SISStudent Info System∙Prospects∙Admissions∙Student Records∙Registration∙Course Schedule∙PO Box Management

DARS

SSA/FICADeposits Payroll Office

US Savings BondsAdmin Payroll Office

Checks 1-2-3∙Pay Checks∙Direct Deposit Stubs∙W2s∙AP checks∙1099sGeneral Accounting Office

New Campus Police

Identification SystemFood Services

Parking System

SWEMPatron Info System

Mysoft Call Account Sys Telecom

Power FAIDSNew Financial Aid

VA SCHEV

SAT Scores

ACT Scores

GRE Scores

MAT Scores

MCAT Scores

NCAA

Prospective Student Search

Peterson’s Pros. Student Data

National Clearing House Loan Verification

PELL Grants

ISIR F.A. Input

Wiz Kid

Student Health System

Door Access System

Alumni Development System

Old Financial Aid

Schedule 25Resource 25

Athlete Tracking

Web Applications (IP)

College Systems that Interface with Administrative Computing Systems

College Systems that Interface with Administrative Computing Systems

Data HistoryWe had many homegrown systems, some based on Information Associates®

architecture, but extensively modified.

Recent Data History

• After false starts, in 2003 we bought a large Banner enterprise system, and then added a datamart, and then a data store. We are now five years into our two year installation period.

• Or put another way, we are ‘current’ on version(s) of 8.2 and 3.1, with new version(s) around the corner.

• When the IR staff start to understand the relationships and foibles of a particular version, it is time to upgrade to a newer version in which some things are fixed… and some other things are broken differently enabled.

Production

DataMart

DataStore

v.6, v.7, v.8… v.1, v.2… v.1, v.2, v.3….

Things To Realize…• SungardHE Banner® products are not a bad

system. Nor are any other commercial vendor’s products.

• Any enterprise-level system with a data store is a permanently evolving, almost organic entity,

• with multiple input opportunities for breaking constraints and premises

• and for finding new ways to induce unexpected results through changing business rules, institutional decentralization, flexibility, and collegiality.

Data Integrity in Pre- and Post-Enterprise Systems

• Old History: I.T. used to do a system General Edit, with “general” meaning edited for operational purposes, not analytical purposes. If the data got the payroll to run today or allowed you to admit a student, it was ‘valid’.

• New History: Data validity is still measured against operational standards.

Data Integrity Post-Enterprise System

• And I.T. is now even busier than ever just making the production system ‘run’, without markedly larger numbers of staff to commit to the data warehousing activity. They actually do less general editing because more transactions are interactive rather than process-oriented.

• At the end of a day, data is off-loaded into the data store, in different forms and with different premises from how the data is held from the production system.

Data Integrity Pre- and Post-Enterprise System

• This means that there is now an even bigger split between operational and analytical data integrity, especially in terms of the differing forms of the data.

• The data store is not as rigorously evaluated for data integrity or meaning precisely because it is not production. And since the operational offices of the institution are satisfied with their pieces of the data pie, everything is working fine.

The Institutional Research Role

• IR has always had premises for data reporting that go beyond ‘general edit’.

• We analyze data relationships that make up the whole picture.

• We work in the aggregate, rather than by individual transaction.

• We are ideally poised to discover the anomalies that occur between multiple institutional sources -- between one office’s interpretation of a transaction and another office’s idea of the same information.

A Data Store’s Added Task to IR

• IR now has to do the same comprehensive validation tasks that we used to do, plus identify and deal with the newer ‘introduced’ problem of tables that violate their own premises or have unexpected values due to complex dynamics among:

RULES

STORE

PEOPLE CHANGES

PRODUPGRADE

The Complex Dynamics• imperfect translation of data between the production

transaction system and the data store/warehouse,

• vendor maintenance or institutional business rules changes that intentionally induce changes in tables.

• generally caused by the inability to predict all of the systemic results of making unspecified or unimagined changes in a system, sometimes known as the Butterfly Effect.

• the continuous upgrade cycles applied to the system,

• and data that results from imperfectly recorded transactions in uncertain environments with less than adequate collaboration and training.

So What Can We Do?

• Write a program to – check the logic premises of frequently used tables– check for missing or out-of-range values in data that

affects IR– run the program frequently to uncover problems in

time for the current census and prevent future term anomalies. Keep the results for documentation.

– communicate findings promptly and efficiently in order to effect change

– build in flexibility to test different things at different times in different ways

Limitations in Place

• Do it on your own, since it is for IR purposes. Remember, the data already meets everybody else’s needs.

• Use existing resources. Keep it simple.

• Someone else is going to have to be notified to deal with the data anomaly once it is identified.

• Don’t weaponize the process. This is why we choose to have anomalies, not mistakes.

How To Accomplish?

• The use of SAS® EG on a PC platform allows for the remote submission of a premise and data-checking program during off-hours, when demand on the system is lessened.

• It also allows a convenient environment in which to store the program and results, access validation and history tables, and send automatic e-mails to interested parties such as admissions, registrar, IR and IT staff.

• “SAS® Enterprise Guide®, a powerful Microsoft Windows client application that provides a guided mechanism to exploit the power of SAS and publish dynamic results throughout your institution. It’s the preferred interface to SAS for analysts, statisticians and programmers– SAS Enterprise Guide saves time by automatically generating computer code with an easy point-and-click interface.”

• Think of it as a graphical office environment for the SAS language. As Microsoft Outlook ties MS Office products together, Enterprise Guide has a similar role for SAS

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Still not clear on SAS/EG® ?

• It is an environment in which SAS 9.1 runs;• It can bring together data views and any type of files

from any network data servers, including Oracle.• It contains SAS program(s), the larger project, notes,

and a graphical description of how the project, processes, and programs relate to each other.

• It documents how any processes or programs have been run, and the results.

• It can generate code for procedures and datasteps.• It inherits libraries, autoexecs, etc. from 9.1

We Use Enterprise Guide® To…

• Schedule and launch the anomaly tracking job• Provide a comprehensive project

environment for IR staff to be able to visit independently and add new anomaly checks

• Be able to see and modify the associated tables and data in one place

• Provide our novice IR staff a more centralized and friendlier view of the process

What Does It Look Like?

Can You Just Use SAS® Itself?

•YES!• Provided you schedule the job through MS

Task Scheduler and have no desire for the previously mentioned features or facilities.

• Programmatically, all process features are part of base SAS v9.1 on an XP or Vista platform.

Beginning Steps

• Survey yourself and other staff (IR and other offices) to make a preliminary list of tables and values that have issues

• Establish the names, emails, and hierarchy of those who will receive automated communications.

• Make a calendar of when you want to test certain items due to different functional cycles (admissions, registration, etc)

Let’s Go Coding…

• I can’t show you the entire code for my anomaly program today.

• It would need to be modified to fit your data structures and needs anyway.

• Instead I will concentrate on imparting some:• Key Program Ideas• SQL techniques

• to enable you to develop your own program

The Overall Design ConceptSet up a SAS program which launches automatically using SAS/EG

Construct some macros to help pass anomaly parameters

Acquire data for testing from your datastore tables

Use Proc SQL and other steps and procedures to test data

The Overall Design Concept (2)Test tables and variables based on the academic year and institutional work cycle

Anomalies ‘fixed’ are deleted from the master ; new ones are added to it

The master table is subset and sent as email at various intervals and detail levels

A history set of anomaly transactions is kept for study

Set up a SAS program which launches automatically using SAS/EG

Set up a SAS program which launches automatically using SAS/EG

%let term_start_limit=200325;

%let term_end_limit =201030;

%let highestssn='772';

%let es_validset='EL','MW','WD','WM','WW'; /*all valid statuses encountered by enrolled students AFTER enrollment/dropadd*/

%let es_bad='QW','WB','AW'; /*all bad or ineligible statuses not to be counted by census date*/

%let tooold='1900'; *out-of-range birthday year cutoff;

Start by setting up metadata necessary for testing

Start by setting up metadata necessary for testing

Macro variable assignment increases flexibility as table names change.

%let table_enroll=enrollment;%let table_academic_study=academic_study;%let table_addcurrent=address_current;%let table_prevedslot=previous_education_slot;%let table_schedule=schedule_offering;%let table_course_catalog=course_catalog;%let table_course=student_course;

Start by setting up metadata necessary for testing

*format table for anomalies;proc format;value anom 1= 'withdrawn with classes'2= 'duplicate person recs '3-4= 'missing value '5= 'ssn out-of-range '6= 'Two ids, one ssn '7= 'dup recs course_catalo '……;run;

Construct some macros to help pass anomaly parameters

Create table todayterm as select stvterm_code as studyterm,

datepart(stvterm_start_date) as begd , datepart(stvterm_end_date) as endd from (your data source) where today() > datepart(stvterm_start_date) and today() < datepart(stvterm_end_date) ;

*set a global study term variable based on value in todayterm;data _null_;

set todayterm;call symput('study',studyterm);

Construct some macros to help pass anomaly parameters

•Use up to 3 variables to show people what is anomalous about any particular situation. Name them A, B, and C. The variables you will pass to these variables will differ with each problem.•You will need to put both the value and the name of the variable into your anomaly report, plus some identifier(s) for the student or entity, plus some anomaly details.

Construct some macros to help pass anomaly parameters

*Macro to help transfer of anomalies;%macro keep ; (keep= id person_uid aval aval_desc bval bval_desc cval cval_desc anom studyterm first_anom_date suspend_date data_own)%mend keep;

Construct some macros to help pass anomaly parameters

%macro pass (aval=,bval=,cval=,dsn=,anom=,studyterm=,suspend_date=,data_own=) ;aval_desc = "&aval";bval_desc = "&bval";cval_desc = "&cval";aval = put(&aval,$25.);bval = put(&bval,$25.);cval = put(&cval,$25.);anom=&anom;studyterm=put(&studyterm,$6.);first_anom_date=today();suspend_date=&suspend_date;data_own=&data_own;

%mend pass;

Construct some macros to help pass anomaly parameters

PASSES THE NAME OF THE VARIABLE

PASSES char VALUE OF THE VARIABLE

PASSES OTHER VALUES

proc sql;connect to odbc as mydb (datasrc="&datasrc" user=&user password=&password);create table addressc as select * from connection to mydb ( select m.PERSON_UID, m.id, n.address_type, n.postal_code, n.city, n.county, n.state_province, n.nationfrom &table_enroll m

inner join&table_addcurrent non m.person_uid = n.entity_uid

where m.ACADEMIC_PERIOD in (&study) and ((m.ENROLLED_IND='Y' and m.REGISTERED_IND='Y' ) or (m.ENROLLED_IND='Y' and m.ENROLLMENT_STATUS in

(&es_set)))and n.address_type in ('IN', 'P1', 'MA')

) ;

Acquire data for testing from your datastore

Acquire data for testing from your datastore

…from &table_enroll minner join&table_addcurrent non m.person_uid = n.entity_uid

where m.ACADEMIC_PERIOD in (&study) and and n.address_type in ('IN', 'P1', 'MA')

STUDENTS THIS TERM

ALL ADDRESSES

ADDRESSES TO CHECK

Acquire data for testing from your datastore

Let your server do the heavy data work

SQL call

Result Set

Oracle ®

• The most useful technique for detecting a table that violates record premises is by joining the table back to a copy of itself that has been summarized by the number of expected rows, keeping only those records that don’t meet expectations.

• A SQL statement that uses a count() function, a ‘group by’ clause, and a ‘having’ clause does this job effectively.

Use Proc SQL and other procedures to test logic/values

select l.* from(select PERSON_UID, ID, ACADEMIC_PERIOD, PROGRAM,PRIMARY_PROGRAM_IND, ADMISSIONS_POPULATION from &table_academic_study where academic_period in (&study) ) l

inner join(select person_uidfrom &table_academic_studywhere academic_period in (&study)group by person_uid, programhaving count(person_uid) > 1 ) r

on l.person_uid = r.person_uid

Use Proc SQL and other procedures to test logic/values

selectperson_uidfrom &table_academic_studywhere academic_period in (&study)group by person_uid, programhaving count(person_uid) > 1

Use Proc SQL and other procedures to test logic/values

PERSON_UID PROGRAM ACADEMIC_PERIOD COUNT

1234 BA-GOVT 200910 1

1234 BA-GOVT 200910 1

PERSON_UID PROGRAM ACADEMIC_PERIOD COUNT

1234 BA-GOVT 200910 2

select l.person_uid, r.person_uid as other_uid, l.tax_id, r.tax_id as other_tax_id, l.full_name_lfmifrom person l inner join person r on l.TAX_ID = r. TAX_IDAND l.person_uid <> r.person_uidwhere l.tax_id is not nulland r.tax_id is not null

This SQL will find two different university ids that share the same SSN. This generally occurs when the institution has issued two ids to the same person without an adequate search of records. The faster this is spotted, the better.

Use Proc SQL and other procedures to test logic/values

In addition to testing table logic, once you have the datasets established, any variety or combination of values can be tested. Here are four conditions to get you started thinking about what can be tested:

- missing values;if citizenship_type = '' then do;

- withdrawn with classes;if enrolled_ind = 'Y' and registered_ind = 'Y' and enrollment_status = 'WB' then do;

Use Proc SQL and other procedures to test logic/values

if state_province in ('AA','AE','AP','PR','VI','AL','AK','AZ','AR','CA', …'WI','WY' ) and nation > ''

orstate_province in ('AB','BC','MB','NB','NL','NT','NS','NU','ON','PE','QC','SK','YT') and nation ^= 'CA'

ornation = 'CA' and state_province not in

('AB','BC','MB','NB','NL','NT','NS','NU','ON','PE','QC','SK','YT')

orstate_province in ('FC' ,'HK', 'RQ', 'XX') then do;

Use Proc SQL and other procedures to test logic/values

- ssn out of range;if tax_id ^= '' then do; ssnverf = indexc(substr(tax_id,1,9),' ','-abcdefghijklmnopqrstuvwxyz','ABCDEFGHIJKLMNOPQRSTUVWXYZ') ;

if substr(tax_id,6,4) = '0000' or substr(tax_id,1,3) < '001' or substr(tax_id,4,2) = '00' or substr(tax_id,1,3) > &highestssn or ssnverf > 0 then do;

Use Proc SQL and other procedures to test logic/values

data T_person_v %keep ;set person;

*test3 - missing value;anom=0;if citizenship_type = '' then

do;%pass(aval=citizenship_type, bval=full_name_lfmi, cval=, dsn=person, anom=3, studyterm=&study, suspend_date=., data_own='reg')

end;if anom > 0;run;

Pass anomalies found into a transaction table

The KEEP Macro in place

The PASS Macro in place

The Anomaly Testing

*test6 - two ids, one ssn;data T_person_v3 %keep ;

set anom_person2;%pass(aval=tax_id, bval=other_tax_id, cval=other_id, dsn=person, anom=6, studyterm=&study, suspend_date=.,data_own='reg')

run;

Pass anomalies found into a transaction table

The KEEP Macro

PASS

Nomenclature of Test Datasets - T_area_XN where T = Test area = broad anomaly category X = v(ariable based)

a(ssumption of table logic violated) N = incremental test set number

dataset T_person_v3 is the third datset testing the demographic (person) table for variable-based anomalies such as missing values or incompatible statuses

Pass anomalies found into a transaction table

*create temp work dataset for todays transactions;data today;

length aval_desc bval_desc cval_desc $25 ; set

T_addressc_v1T_addressc_v2T_addressc_aT_acadstudy_aT_ccat_aT_person_aT_person_v1T_person_v2T_person_v3

{all the transaction sets};run;

Pass anomalies found into a transaction table

Remember to unduplicate all the duplicated records found! You only need one example record per anomaly.

proc sort NODUPKEY data=today;by person_uid anom aval bval cval studyterm;run;

Failure to do so may result in multiple joins in the next step, when you update the master dataset.

Pass anomalies found into a transaction table

Anomalies are added or deleted from the master

Today’s Set

Master Set

Data master ;Update master today;------------or-------------Proc SQL;Select * fromoldmaster lright jointoday ron{criteria}

Add New RecordsKeep Matching Records

Delete Non-Matches

proc sql; create table newmaster asselect r.person_uid, r.id, r.anom, r.aval, r.aval_desc , {other variables},

coalesce(l.first_anom_date, r.first_anom_date) as first_anom_datefrom oldmaster l right join today ronl.person_uid = r.person_uid and l.anom = r.anomand l.aval= r.aval and l.bval= r.bval and l.cval= r.cvaland l.studyterm = r.studyterm ;quit;

Anomalies are added or deleted from the master

The master table is subset and sent as email

data reg ban ir adm grr soe bur law;set master;output ir; *for all records;if data_own='reg' then output reg; else if data_own = 'ban' then output ban; else if data_own = 'adm' then output adm; else if data_own = 'grr' then output grr; else if data_own = 'soe' then output soe; else if data_own = 'bur' then output bur; else if data_own = 'law' then output law;

The master table is subset and sent as email

proc export data=reg dbms=excel2002 outfile= 'g:\temp\reg.xls' replace;

proc export data=ban dbms=excel2002 outfile= 'g:\temp\ban.xls' replace;

proc export data=ir dbms=excel2002 outfile= 'g:\temp\ir.xls' replace;

The master table is subset and sent as email

filename reports email "esdav2@wm.edu"; data _null_; file reports; set departments; put '!EM_TO! ' name; put '!EM_SUBJECT! Report for ' dept; put ‘Hi’ fname ‘-'; put 'Here is the latest report of anomalies for the' dept'.' ; if dept='ban' then put '!EM_ATTACH! g:\temp\ban.xls'; else if dept ='reg' then put '!EM_ATTACH! g:\temp\reg.xls'; else if dept ='ir' then put '!EM_ATTACH! g:\temp\adm.xls'; put '!EM_SEND!'; put '!EM_NEWMSG!'; put '!EM_ABORT!'; run;

Success!

Subject line

Attached xlsName Department

Technical Hurdles Along The Way

• Don’t use your Outlook mailer. Specify your SMTP mail service as the mailer. You may have to pass your authentication through the SAS sasv9.cfg file to allow STMP mailing.

• Depending on your network, you may have to use “Cscript” as the keyword in the scheduler to launch the program, rather than the implicit “Wscript”. ‘C’ stands for ‘console’.

Lessons Learned

• Send all output to yourself for several days to review, before allowing it to be sent out automatically

• Send lower priority anomalies infrequently, and high priority ones weekly or daily

• Send notice of table violations infrequently to IT, once they have identified a problem and resolution path. (Don’t bug them if they, too, are waiting for a vendor patch or fix)

Lessons Learned

• Be aware of the length of time the job takes to execute. You may need to adjust what and when you test if it starts taking too much time

• Bring your ‘users’ into the process by asking them what you can do differently to help them. Do they need another variable to help isolate problems? Alternate ID?

Auto-Validating Your Data Store Evan Davies

Presentation available online at:http://web.wm.edu/ir/conferencepres.html

top related