using sas/iml for advanced statistical needs, with an ... sasiml for advanced... · using the...

10
USING THE SAS/IML@ SOFTWARE FOR ADVANCED STATISTICAL NEEDS, WITH AN EXAMPLE ON CREDIT SCORECARD BUILDING. PAUL C K CHIK 1 INTRODUCTION 1.1 Background My first experience of using the SAS- System to build scorecards was at Freemans PIc, one of the biggest mail order companies in the United Kingdom. A scorecard is a tool developed using statistical techniques, used to aid credit granting decision making. The objective of using a scorecard is to enable companies, when processing an application for credit, to accept potential good customers and to reject potential bad customers. The scorecard building program was originally written in APL, a mathematical programming language. However, due to the unpopularity of this language within the department, the APL program has since been translated and enhanced in SAS using the SAS/IML- software. The result shows a great improvement in terms of sample sizes used and time taken to run the program. Under APL, the maximum sample size allowable for processing was 25,000 observations which took about 4 hours to finish. After the change over to IML, the program can now process a sample size of 40,000 observations taking just over 2 hours to complete. The scorecard program was developed using the mainframe SAS System, version 6.06. The testing, implementation and all subsequent running of the program, are processed under the MVS operating system. 1.2 What is IML The SAS/IML software consists of the SAS IML procedure that implements a programming language called the Interactive Matrix Language. This is a PROC step just like any other, however, unlike other PROC steps, the IML procedure has a very different nature and properties which we shall look at later. IML is a multi-level interactive programming language. It has its own syntax and notation rather different from the SAS/BASE- software. Users can write IML programs to build highly sophisticated models, for example scorecard building, or to carry out operations such as simple data processing. It is very flexible like many other programming languages, for example, APL, which will be discussed in section 2. However, most programming languages are low- level languages, which means users must program every detail, e.g. variable declarations, etc. IML is a high- level programming language but is designed to cater for low-level programming requirements. 151

Upload: vudan

Post on 06-Feb-2018

238 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Using SAS/IML for Advanced Statistical Needs, With an ... SASIML for Advanced... · using the sas/iml@ software for advanced statistical needs, with an example on credit scorecard

USING THE SAS/IML@ SOFTWARE FOR ADVANCED STATISTICAL NEEDS, WITH AN EXAMPLE ON CREDIT SCORECARD BUILDING.

PAUL C K CHIK

1 INTRODUCTION

1.1 Background

My first experience of using the SAS- System to build scorecards was at Freemans PIc, one of the biggest mail order companies in the United Kingdom. A scorecard is a tool developed using statistical techniques, used to aid credit granting decision making. The objective of using a scorecard is to enable companies, when processing an application for credit, to accept potential good customers and to reject potential bad customers. The scorecard building program was originally written in APL, a mathematical programming language. However, due to the unpopularity of this language within the department, the APL program has since been translated and enhanced in SAS using the SAS/IML- software. The result shows a great improvement in terms of sample sizes used and time taken to run the program. Under APL, the maximum sample size allowable for processing was 25,000 observations which took about 4 hours to finish. After the change over to IML, the program can now process a sample size of 40,000 observations taking just over 2 hours to complete. The scorecard program was developed using the mainframe SAS System, version 6.06. The testing, implementation and all subsequent running of the program, are processed under the MVS operating system.

1.2 What is IML

The SAS/IML software consists of the SAS IML procedure that implements a programming language called the Interactive Matrix Language. This is a PROC step just like any other, however, unlike other PROC steps, the IML procedure has a very different nature and properties which we shall look at later.

IML is a multi-level interactive programming language. It has its own syntax and notation rather different from the SAS/BASE- software. Users can write IML programs to build highly sophisticated models, for example scorecard building, or to carry out operations such as simple data processing. It is very flexible like many other programming languages, for example, APL, which will be discussed in section 2. However, most programming languages are low­level languages, which means users must program every detail, e.g. variable declarations, etc. IML is a high­level programming language but is designed to cater for low-level programming requirements.

151

Page 2: Using SAS/IML for Advanced Statistical Needs, With an ... SASIML for Advanced... · using the sas/iml@ software for advanced statistical needs, with an example on credit scorecard

Most programming languages deal with single data elements. IML deals with matrices, which are arrays of elements stored under one single variable name. Therefore, a single variable addressed in IML can range from just one single item to thousands of different values. Any IML command is applicable to-every element of a matrix. The dimensions of any matrix variables can be changed without any prior declaration and allocation. Hence, IML programs are more compact, easier to write and usually mean less programming statements to code. For example, if we want to add two matrices together, A and B say, instead of doing it in a loop, in IML you simply write A+B. Due to this compactness, IML programs tend to be more efficient to manage, particularly when debugging. This helps to cut down the need to make housekeeping specifications which can easily distract from the original problem. Therefore, users can stay close to the heart of the original problem thus giving opportunities for greater insight.

2 IML Ii APL

APL stands for A £rogramming ~anguage which is a low-level programming language that deals with matrices. However, it uses a special set of symbols which are not expressed in English language style like IML. There will be two simple examples later for illustration. Apart from the symbols which are so unique, APL is only available on special mainframe terminals. If APL is used on a personal computer, the keyboard will have to be reorganised before any APL programs can be run. Nevertheless, APL is still the most powerful programming language I have used, and it is claimed that APL is the ideal tool for mathematicians to solve a wide range of problems.

IML can be considered to be the sUbstitute for APL. To a certain extent, they are very similar. They both deal with matrices and they both can be used to solve a wide range of technical problems. However, as the SAS System has already provided for all our basic programming needs, for example, the PRoe PRINT step for customised output, it is easier to use the SAS System without any loss of functionality as opposed to APL. Al though users can program in APL and achieve the same results from the PRoe PRINT step, an awful lot of time will be spent on the programming side which is not the prime objective. The main difference between IML and APL is their syntax. IML reads and processes each line of code from left to right, whilst APL reads and processes each line of code from right to left.

The following are two simple examples to show their similarities and differences. In example 1, A is assigned the total number of occurrences of the character 'G' in matrix STAT. In example.2, variable D is only read into the working memory from an external file when e has a value of 1. It can be easily seen that IML statements are much more interpretable by just reading the codings, whilst the APL statements are almost unreadable. Nevertheless, their structures are very similar, both languages use only one

152

Page 3: Using SAS/IML for Advanced Statistical Needs, With an ... SASIML for Advanced... · using the sas/iml@ software for advanced statistical needs, with an example on credit scorecard

line to do the required tasks. In SAS/BASE, this would require one or more DATA and/or PROC steps to accomplish the same tasks.

Example 1.

In APL: A+-+/STAT='G'

In IML: A = sum(STAT='G');

3 Peature of IHL

Example 2

In APL: B+-DAV 1 C/ ~ 'D'

In IML: read all var{D} where (C=1) into B;

In the previous section, I have compared IML with APL. We will now look at the features of IML in more detail. In general, they can be divided into Internal and External. Internal features are the built-in functionalities which include the Language Reference, Data Processing commands, Library and User written function facilities, and the Call routines which are similar to the Macro language of the SAS/BASE software. External features are functionalities specially created by users, in my case for scorecard building purposes. They include the statistics of Scorecard building, the Sort & Restoring facilities and Memory Management which is vitally important when running IML programs with a large data sample.

3.1 The Language Reference

IML has a very powerful vocabulary of operators, and all other features that a programming language must possess. It can also gain access to the functions and the Macro facility provided by the SAS/BASE software. To strengthen the power of IML further, the software has the ability to allow users to create their own functions and function libraries. It is this ability which makes the scorecard building program written in IML a reality.

3.2 Data Processing

IML has a complete set of data processing commands with which users can manipulate data in SAS data sets without leaving the IML session. The commands can also create flat files in the same way as SAS/BASE. USE, EDIT and SETIN are the three commands used to read in data from SAS data sets, and the CREATE and SETOUT are used to write to SAS data sets. IML data processing commands not only allow users to manipulate data, but also to change the physical structure of original data sets. This is particularly useful when further calculations are to be performed based on previous results. For example, if there is a SAS data set containing information to calculate a regression line three times, each time the calculation uses new information derived from the previous one. This could be done using the SAS/BASE and SAS/STAT- software by combining a series of DATA, PROC REG, PROC SCORE and PROC PRINT steps. In IML, this is simply

153

Page 4: Using SAS/IML for Advanced Statistical Needs, With an ... SASIML for Advanced... · using the sas/iml@ software for advanced statistical needs, with an example on credit scorecard

done within a loop and a straightforward calculation of the three regression lines required. All intermediate results can be created and stored in one or more SAS data sets, which can then be printed by a PROC PRINT step after leaving IML. From the example above, it can be seen that by combining orie-DATA and one PROC PRINT step with IML, results are easily stored away and retrieved for later use. At this point, users are reminded that IML is not in competition with other PROC steps within the SAS System. It is the power and efficiency of mixing the appropriate steps together that users should pursue.

3.3 Library and User written functions

There are more than 150 IML functions which can be applied to either a scalar quantity or an entire matrix. All functions are stored in different libraries serving different purposes. These libraries include Graphics Calls and Time series function libraries as well as the standard function library of IML. Any of these functions can be grouped together in a user written Function Module and be given a name. This function module can then be invoked in an assignment statement as you would invoke a standard function. It can be used to perform a specific task repetitively if necessary by calling the name and supplying appropriate arguments to that function module. In Appendix A, the function module INNER is a user written function module which can perform over a hundred operations. The INNER function module requires two matrix arguments and two operators. The operators used must come from ei ther the arithmetic, comparison or logical categories. This function module simulate the Product operator function which is also called the INNER product function of APL. It behaves exactly the same as the IML INNER function module, but it can perform up to 430 different operations. The most common uses of this function are:

-locating incidences of given character strings within textual data

-evaluation of polynomials -matrix multiplication -product of powers

Apart from the function module mentioned above, users can create any number of function modules either for general uses or, for specific purposes. These function modules can also be stored permanently in user created function libraries for future recall. Each time the SAS/IML software is loaded, every function library must be given a file reference name in order to access it during that session. The advantage of this facility is that users can now write any IML program using less coding which also means it will be easier to read and debug.

3.4 Call Routines & Macro Language

In this section, I will be comparing the CALL PUSH routine of IML with the Macro Language of SAS/BASE. CALL PUSH routine allows users to generate programming statements

154

Page 5: Using SAS/IML for Advanced Statistical Needs, With an ... SASIML for Advanced... · using the sas/iml@ software for advanced statistical needs, with an example on credit scorecard

like the Macro Language. The difference between them is that CALL PUSH is processed at execution time whereas pure macro processing is done only at parsing time. The CALL PUSH routine can also be used in any user written function module, as shown in the example within Appendix A. Variable names can be generated as well. This is particularly useful when an argument of a function is not known. I did a very simple test to compare the speed, and the result has shown that CALL PUSH routine is marginally faster than Macro Language.

3.5 Scorecard Statistics

sections 3.1 to 3.4 cover the internal features of IML. The next three sections will take a closer look at some external features specially created and utilised during the development of the scorecard building program. The basic statistical techniques used in the program are Maximum Likelihood Estimation and Logistic Regression. Due to the iterative nature of scorecard building, these techniques are used repeatedly, therefore, a specially written module is devised. This module takes care of all parameter estimations depending on what stage the scorecard building process has reached. It is invoked by supplying two matrix arguments and a flag value. The flag value is to notify the module as to what stage the building process is at, and hence the calculation of parameters will change accordingly. The function module is similar to the one of the example set out in the IML manual under the chapter of 'Application: Statistical Examples'. In fact, this chapter covers a wide range of basic statistical and Operational Research models. I personally find the chapter extremely useful for two reasons; I can use it as a reference guide, and it has given me better insights to the actual calculation mechanisms of the techniques used in the scorecard building process. This is in contrast with using the PROC LOGISTIC step. The procedure will only give you the results whereas, the function module written in IML has helped me to understand how and why these figures are derived. In addition to the information given above, most of the examples shown in that chapter, work with grouped data. Therefore, the code printed in the book will have to be rearranged in order to become workable if ungrouped data is used.

3.6 Sorting & Restoring

Our scorecard building model uses an i terati ve process. During the process, newly calculated scores are required to be sorted in ascending order at the end of each iteration. However, because of the limited memory space, which will be discussed in the next session, not all data required to build the scorecard can be loaded into the working memory at anyone time. This is particularly frustrating when a large sample size is needed to build a good scorecard. One way to solve this problem, is to load only currently used data into the memory one at a time. That means if the scores matrix is sorted at one point, all other matrices

155

Page 6: Using SAS/IML for Advanced Statistical Needs, With an ... SASIML for Advanced... · using the sas/iml@ software for advanced statistical needs, with an example on credit scorecard

must also be sorted at the time of being loaded into the memory before any calculation can take place. IML has the perfect built-in function to tackle this tricky situation. The function- is called the RANK function. Instead of sorting any matrix directly, the RANK function creates a new index matrix whose elements are the ranks of the corresponding elements of the original matrix to be sorted. In the above example, after the ranks of the scores matrix are obtained, the scores can then be sorted in ascending order according to the ranks given by the index matrix. All other matrices are then sorted using the same ranks. This sorting technique can be applied to sort any matrices in ascending or descending order. It can also be used to restore matrices back to their original order. This is done by ranking the ranks and then sorting the matrix in ascending order according to the new ranks. The function module GRADE, shown in Appendix B, can be used to sort any matrices in either, ascending or descending order. This is signalled by giving a 'Y' or 'N' response to argument ON. If ON is given a 'Y', then argument MAT will be sorted in descending order, otherwise it will be sorted in ascending order according to argument R, which is the ranks of argument MAT.

3.7 Memory Management

All users must have experienced insufficient memory problems when running programs using large data samples. This is particularly true for scorecard building, as the processes involve decision making based on large sample statistical inferences. In order to build a safe and sound scorecard, a very large sample must be collected to ensure that there are enough observations for the least used attributes, as well as the most commonly used ones, an attribute being the principle decision making unit to be scored in a scorecard. As IML is a very memory intensive procedure, running a scorecard building program will certainly cause memory problems. To tackle this problem, a compromise between hardware and software has to be reached. I successfully built scorecards using an IBM mainframe under the MVS operating system with 40,000 observations, by increasing the RAM size to 8 Mega-bytes. However, these scorecards could not have been built if the 'Read In Rub Out' technique was not employed. In the previous section, it is mentioned that the data required to build a scorecard does not necessarily have to remain in the working memory at all times. Therefore, the data that has been processed can be rubbed out of the memory temporarily, and can then be read back in at a later stage. The user must make sure that they are retrieved in the same order as those already in the memory. This can be safeguarded by using the RANK function described earlier. After successfully building scorecards on the mainframe, I went on to tryon a personal computer, using the PC DOS version of SASe Unfortunately, the PC DOS version only allows matrices to have a maximum of 4095 elements, regardless of the RAM size. Therefore, building scorecards on a PC is very restricted and requires a major breakthrough. It is recommended that all potential

156

Page 7: Using SAS/IML for Advanced Statistical Needs, With an ... SASIML for Advanced... · using the sas/iml@ software for advanced statistical needs, with an example on credit scorecard

users carry out a test to ascertain the size of RAM that is available to use on a mainframe. As a rule of thumb, the total number of numeric matrices that can be processed simul taneously in the working memory, is estimated as below:

Total number of numeric matrices

= (Size of RAM) + (Total number of matrix elements *

4 Advantages , Disadvantages

4.1 Advantages

8)

IML is a very flexible programming language to use. By combining the power of the SAS/IML and SAS/BASE software, it provides the ability to solve very complicated technical problems. It lets users control the flow of data through the use of data processing commands, so that interactions between IML matrices and SAS data sets or flat files, are easily carried out. Also the structure and contents of SAS data sets can be manipulated without leaving the IML session. Moreover, the freedom in programming is a strength that is not matched by any other procedure steps, as simple and complex calculations can be performed within the IML procedure step.

4.2 Disadvantages

However, IML is very memory intensive. Great care must be taken to avoid the program being interrupted because of insufficient memory. Another weakness is that IML requires more programming than other procedures. For example, if a logistic regression line is to be calculated, the PROC LOGISTIC step is easier to use. This is because IML requires all the finest technical details to be programmed, as well as printing the output. However, IML enables greater versatility in model building than the logistic procedure.

5 Conclusion

I have compared IML with APL, and explained the connections between SAS/IML and SAS/BASE software. I have also talked about how the features, both internal and external, can help us to develop sophisticated models such as scorecard building. All these topics have led to one conclusion, which is IML has opened a new chapter for using the SAS System. Enthusiastic mathematicians or statisticians can use IML to develop their modelling tools. While managers and other users can forget about questions like - 'can this problem be sol ved by this particular software', instead they now only have to concentrate on how to sol ve this problem.

157

Page 8: Using SAS/IML for Advanced Statistical Needs, With an ... SASIML for Advanced... · using the sas/iml@ software for advanced statistical needs, with an example on credit scorecard

I .~

1

Address for all communications:

26 Hamble Road Bri_ckhill Bedford MK41 7XB united Kingdom

Acknowledgement

I would like to express my deepest thanks to Dr Martin Blackwell of Freemans PIc for giving me technical advices during the development of the scorecard building program. My special thanks also go to Miss Lynne Pritchard and Mr Chris Sykes for their assistance during the preparation of this paper.

References

SAS/IML* User's guide, Release 6.03 edition

SAS* Language guide for personal computer, Release 6.03 edition

SAS* Technical Report P.200, Release 6.04

APL-68000 Language Reference Micro APL

SAS, SAS/IML, SAS/STAT are registered trademarks of SAS Institute Inc., Cary, NC, USA

158

Page 9: Using SAS/IML for Advanced Statistical Needs, With an ... SASIML for Advanced... · using the sas/iml@ software for advanced statistical needs, with an example on credit scorecard

Appendix A

procIML; 1* load IML *~

1* Function module INNER *1 start INNER( ARGU1, ARGU2, OPER1, OPER2); 1* the Column dimension of ARGUl must equal to

the Row dimension of ARGU2 *1

Cl = ncol(ARGU1); Rl = nrow(ARGU1); C2 = ncol(ARGU2); R2 = nrow(ARGU2);

if Cl A= R2 then do; 1* Column dimension of ARGUl is not equal to

Row dimension of AGRU2 *1 print II Matrices do not conform to INNER operation."; abort;

end;

else do; do I = 1 to Rl; do J = 1 to C2;

call push("Res2 = Argul[I,] ", OPER2, "(ARGU2[,J]'); Resume;");

pause *; Res = Res[,l]; if Cl A= 1 then do;

do K = 2 to Cl; call push("Res = Res ", OPER1, II Res2 [,K];

Resume;"); pause *;

end; end; free Res2;

Vec = Vec I I Res; free Res;

end; 1* End of J loop *1 Mat = Mar II Vec; free Vec; end; 1* End of I loop *1

return(Mat); free Mat ARGUl ARGU2 Cl C2 Rl R2 OPERl OPER2;

end;

finish; 1* End of function module INNER *1

quit; 1* quit IML *1

159

I . .

f,'·"

Page 10: Using SAS/IML for Advanced Statistical Needs, With an ... SASIML for Advanced... · using the sas/iml@ software for advanced statistical needs, with an example on credit scorecard

Appendix B

proc IML; 1* load IML *1 /* Function module GRADE *1 start GRADE(MAT,R,DN);

Temp = MAT; if DN = 'Y' I DN = 'y' then R = nrow(Mat)+l-R; MAT[R,] = Temp; return (MAT) ; free MAT Temp;

finish; quit; 1* quit IML *1 1* End of function module GRADE *1

160