sas basics 3-7-08

Upload: amit-sahoo

Post on 07-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 SAS Basics 3-7-08

    1/26

    Revised 3/7/2008

    SAS BasicsBy Ted Christensen

    Brigham Young University

    I. Introduction

    The purpose of this handout is to give you a brief overview of how you can use SAS inyour research. It is not intended to replace the many SAS users guides and manuals thatare available. There are several Windows-based statistical packages (like SPSS)available on the market that allow you to do basic (and even some advanced) statisticalanalysis. They are nice because they allow you to simply point and click. There isalso a PC version of SAS, which contains essentially all of the same features as themainframe version well be learning about today. The biggest limitation of PC SAS isthe speed of your processor. These packages work fine with smaller datasets, but whenyou get into huge capital markets studies with thousands of observations, you are betteroff using a mainframe. In addition, the mainframe version of SAS offers the mostpowerful and comprehensive set of tools Im aware of anywhere. If you learn SAS, it

    will be easy for you to pick up other packages later. (Besides, you can all sign up for aWRDS account for free and have unlimited access to SAS!)

    II. Getting Started

    A. Programming

    While we often refer to the work we do in SAS as programming, in reality theprogramming has already been done in the background so that we can simplyexecute the canned routines that real programmers have prepared for us. Wellbe learning about some of the most commonly used SAS routines today.

    B. Mainframe Platforms

    Ive used two different platforms over the years (IBM and Unix). The SASprogram runs exactly the same on both platforms. In other words, the output andthe programming syntax are the same. The only differences have to do with thesetup and the output format. Ive seen two different operating systems on IBMmachines (TSO and CMS). The IBM TSO system is very old (probably 1960s or1970s vintage) and requires you to add some additional code (called JCL) at thebeginning of your program to tell the computer where all of your input and outputfiles are, etc. JCL is somewhat difficult to learn. From what Ive observed, CMS

    is newer and is very similar to Unix systems. While CMS and Unix require somestatements up front to point the machine to your input/output files, they are mucheasier to master than JCL! In terms of output, TSO merges your SAS log andoutput files into one output file. Unix and IBM CMS systems give you twoseparate output files. The SAS log file (progname.log) gives you feedback aboutwhat SAS is doing at each step of the program. This is helpful in debugging yourprogram and checking to make sure it is executing as you intend it to. The listfile or output file (progname.lst) contains the output youve asked SAS to provideincluding printed lists of observations and statistical output. WRDS has a Unixsystem. The only bad thing about Unix (in my opinion) is that the Unix text

    1

  • 8/4/2019 SAS Basics 3-7-08

    2/26

    Revised 3/7/2008

    editors for writing and editing your programs are terrible. Therefore, Im going toshow you my own shortcuts that allow you to completely avoid the Unix texteditors!!

    C. Avoiding Unix Text Editors

    My solution to Unix text editors (which are a pain in the neck as far as Imconcerned) is to avoid using them altogether by editing my programs on my ownPC using the basic text editors that are already part of the Windows operatingsystem (WordPad or Notepad). Your SAS program can have any name (8characters or less) and should end with .sas (progname.sas). After I finishediting my program in WordPad or Notepad, I simply save it to my hard driveand upload it to the WRDS Unix machine using an FTP program. I then run myone and only Unix command on the Unix machine (via a Telnet program). Inorder to execute a SAS program on a Unix machine, you simply type sasfollowed by the program name and hit Enter on your keyboard. I then use myFTP program to look at my log and list files when SAS has finished executing(by double clicking on them). If I find an error in my program, I make the

    changes in Notepad, save the program again, upload it using the FTP programand re-run it. This may sound difficult when I try to put it down on paper, but itis really easy (and I believe faster than the Unix text editors). I do this bykeeping three windows open at all times on my computer: (1) my telnet programwhich allows me to execute SAS programs on the WRDS Unix machine, (2) myFTP program for transferring files back and forth and opening up output filesafter SAS executes1, and (3) my SAS program in my Windows text editor or PCSAS.

    Note that WRDS now requires secure FTP and Telnet programs. Theyrecommend the SSH program (which includes both FTP and Telnet programs). I

    have a separate document that describes how to download and set up the SSHprogram. Heres a screen capture from my computer showing how my screenappears while Im using SAS on WRDS. Note that the window in the upper leftcorner is the Telnet program where I execute SAS programs. (Hint: once youverun a program once, you can repeat the same command, but simply hitting theup arrow on your keyboard.) The window in the upper right corner of thescreen is my FTP program for transferring files between my hard drive and theWRDS Unix machine. Finally, the bottom window is my SAS program. In thisparticular example, Im using Notepad to edit my SAS program. If you haveaccess to PC SAS, it provides additional powerful tools as explained below.(BYU PhD-prep students can get a copy of PC SAS free of charge from Bob

    Kellett in the MSM computer support group.)

    1 Note that WRDS now requires secure programs for transferring files and interfacing with their Unix machines. Asa result, the SSH program (described in a separate file) now provides secure telnet and FTP applications.

    2

  • 8/4/2019 SAS Basics 3-7-08

    3/26

    Revised 3/7/2008

    D. Using PC SAS

    PC SAS is a great alternative. The Marriott School at BYU has a site license forPC SAS. As long as you are not using Windows Vista, I recommend that youuse PC SAS. It has a lot of great features that make using SAS easier. Forexample, it automatically color codes your SAS code so that you can easily seethe different steps in your program. If you use PC SAS, you can replace thebottom Notepad window in the screen shot above with PC SAS. Alternatively,you can submit programs directly to WRDS from PC SAS without using the SSHsoftware. In order to do this, you use the following three lines of code at thebeginning of your program:

    %let wrds = wrds.wharton.upenn.edu 4016;options comamid=TCP remote=WRDS;signon username=_prompt_ ;

    These three lines allow you to log on to the WRDS mainframe. You only have torun this code once and you're connected to WRDS as long as SAS is open.HOWEVER, to execute SAS code on the mainframe, you have to use thersubmit and endrsubmit commands listed below:

    3

  • 8/4/2019 SAS Basics 3-7-08

    4/26

    Revised 3/7/2008

    rsubmit;

    This command tells PC SAS to connect to the WRDS mainframe and execute allsubsequent commands (until you reach the endrsubmit command) on theWRDS mainframe. Again, EVERYTHING done between the rsubmitcommand until you reach the endrsubmit command is done on the WRDSmainframe. So if you want to view a dataset, you have to download it (seebelow). You can do all the heavy computing with big datasets on the WRDSserver and then download the final dataset to your local computer for moresimple analyses.

    procdownload data=compdata; run;

    This command downloads data from the mainframe so that you can use it on yourlocal computer.

    endrsubmit;

    This command tells PC SAS to disconnect from WRDS and execute allsubsequent commands locally in PC SAS on your computer. In summary, allcommands executed after the rsubmit command but before the endrsubmitcommand are executed on the WRDS mainframe. All other commands after theendrsubmit will run on your computer.

    D. WRDS Directory Structure

    When you set up your account on the WRDS web page (as described in theWRDS handouts), you automatically get storage space on the WRDS server.Your WRDS root directory will be as follows:

    / home/byu/username/

    You are free to set up as many sub-directories (or folders) as you want under thismain directory. You can store data in sub-directories, but the actual SASprograms you are executing need to be in the root directory when you tell SAS torun them. Note that at some point you may run out of space in your regularWRDS account (/home/byu/username). If it fills up, you have almost unlimitedspace in the temporary workspace found in several directories labeled /sastemp0through /sastemp14. Youll need to set up a temporary directory within one ofthe sastemp directories. You can run programs and save datasets up to 48

    hours in this space.

    4

  • 8/4/2019 SAS Basics 3-7-08

    5/26

    Revised 3/7/2008

    III. Basic SAS Statements

    A. Data Manipulation

    When I say data manipulation, I dont mean altering or falsifying the data.Im talking about getting your data into the format you need to run your analysis.By now, youve already learned how to pull basic financial and stock market datausing the WRDS web interface. The next task is to merge it together so that youcan do something meaningful with the data. The following SAS statements aresome of the most commonly used statements in getting your data ready foranalysis. Note that the examples of SAS code are all listed in lower case letters.IBM TSO machines automatically convert all letters to capitals. However, Unixmachines are case sensitive. I generally do my programming (on Unix machines)using lower-case letters. Also, note that all SAS statements must end with asemicolon (;). [The most common error new SAS users make is to forget asemicolon at the end of a SAS statement. SAS will keep on reading until it hits asemicolon and it will count everything in between as one statement.]

    1. The Data StatementThe data statement is the MOST important and commonly used statementin SAS. It always begins with the word "data" and then the name of thenew dataset you're creating. It is generally followed by the "set"command, which assigns a previously used dataset to the newly createdone. It is within the data statement that you define new variables. Thefollowing example gives several examples of how new variables can becreated within a data statement.

    data newname;

    set oldname;newvar1 = oldvar1 + oldvar2;newvar2 = log (oldvar3);newvar3 = sqrt (oldvar4);newvar4 = oldvar3*oldvar5;newvar5 = oldvar4/oldvar2;

    In this example, the name of the dataset is newname. This new datasetis being created from a previously referenced dataset called oldname.This data statement creates five new variables from previously definedvariables. Newvar1 is defined as the sum of oldvar1 and oldvar2.

    Newvar2 is the natural log of oldvar3. Newvar3 is the square root ofoldvar4. Newvar4 is defined as oldvar3 multiplied by oldvar5.Finally, newvar5 is defined as oldvar4 divided by oldvar2. Obviously,youll want to use variable names that make sense in the context of yourown research.

    5

  • 8/4/2019 SAS Basics 3-7-08

    6/26

    Revised 3/7/2008

    2. Reading Data Into SAS from a Text or Excel prn Files

    You can read data into SAS using the infile option on the datastatement. The infile statement needs to describe exactly where the fileis located in your WRDS directory (enclosed by single quotes). If Iremember correctly, the missover statement (shown below in bothexamples) has something to do with missing values. (I dont rememberexactly, but Ive always used this in my infile statements). Then, theinput statement lists all of the variables youll be reading in. SAS willassume that variables are numerical by default. If you are reading in acharacter variable, you need to place a dollar sign ($) after the variablename so that SAS will know it is a text variable as opposed to anumber. Also, you dont need to specify which column number yourvariable appears in as long as each variable is separated by a space or tab.However, be careful with text variables made up of more than one word.For example, company names like General Electric will be read as twoseparate variables because the words General and Electric areseparated by a space. Youll note in the examples below that I specified

    the exact column numbers (character numbers) that contain the companyname so that there wont be any confusion.

    data compstat;infile '/home/byu/tc259/Dispam/Comp6331.prn' missover;input cusip $ year q ticker $ coname $ 26-60 cic;if (q eq 1);drop q;

    This infile statement highlights that you can screen which variables areread into the dataset with if statements. In this case, I only wanted to

    read in first quarter observations. I also told it to drop the quartervariable (q) when I was done with this step.

    data newdata;

    infile '/homes/txc7/dispam/newdata.prn' missover;

    input

    permno 3-7

    q 9-10

    pmktval 15-26

    mktval 31-42 .2

    betacar 46-54 .5

    keep permno mktval betacar;

    This example illustrates how to specify how many decimal places to holdfor each variable. For example, mktval will have two decimal places andbetacar will have five decimal places. Finally, you can tell SAS whichvariables to keep and it will drop all others as shown below.

    3. Reading in Data from SAS Transport Files

    When working with large datasets, the best way to transfer data from onemachine to another is by using a SAS transport file. For example, the

    6

  • 8/4/2019 SAS Basics 3-7-08

    7/26

    Revised 3/7/2008

    WRDS web interface allows you to save data in a SAS transport format asopposed to a text or other format. This makes it much easier to import thedata directly into SAS than was described above for text files. Heres anactual example of some Compustat data that I pulled from the WRDSdatabase and saved as a SAS transport file. The WRDS web interfacenamed the transport file 261612610.trp. The two required steps listedbelow show that you first have to tell SAS where the file is located. (Bythe way, after I ran my Compustat query on the WRDS web page, I firstsaved it on my hard drive after clicking on the hotlink on the WRDS webpage to retrieve the output. Then, I FTPed the file to my WRDSaccount.) Then, you have to use the cimport procedure to read the SAStransport file into a new dataset, which I named ccbcomp.

    filename ccbcomp '/home/byu/tc259/261612610.trp';proc cimport data = ccbcomp infile=ccbcomp;

    The next SAS procedure (proc contents) reports exactly what variablesare in the dataset you have just imported.

    proc contents data = ccbcomp;

    Next, I renamed the Compustat data items with names that make moresense to me for programming purposes.

    data compdat;set ccbcomp;earlyprc=data12;price=data14;shrsout=data18;

    eps=data19;sic=dnum;eadat=rdqe;exch=zlist;ticker=smbl;drop data12 data14 data18 data19 data59 rdqe zlist smbl;

    Finally, I thought the next set of code from this program would be of useto you at some point. I wanted to convert the earnings announcementdates from the Compustat format to a yymmdd format. Heres the codethat uses SAS date functions to retrieve the day, month, and year from the

    Compustat date and then convert it to the correct format.

    data eadates;set compdat;sasdate=eadat;calday=day(sasdate);calmon=month(sasdate);calyear=year(sasdate);eadate=((calyear-1900)*10000)+(calmon*100)+calday;

    7

  • 8/4/2019 SAS Basics 3-7-08

    8/26

    Revised 3/7/2008

    4. Merging Datasets Using the Set Statement

    The most simply way to merge two datasets is to simply combine twosmaller datasets into a larger dataset using the data statement describedabove. This would be appropriate when you have the same variables inboth datasets and you simply want to make one large dataset. Forexample, assume you have the following datasets:

    freshmen

    name age major John White 19 AccountingJamie Blue 18 FinanceMark Brown 20 Marketing

    sophomores

    name age major

    Joe Little 22 EconomicsMary Short 20 AccountingNicki Smith 21 Marketing

    You could merge these two datasets into one large dataset by simplysetting them into one new dataset as follows:

    data byu;set freshmen sophomores;if major = Accounting then majcode = 1;else if major = Finance then majcode = 2;

    else if major = Marketing then majcode = 3;else if major = Economics then majcode = 4;

    Heres what you would end up with in the newly created dataset calledbyu:

    byu

    name age major majcodeJohn White 19 Accounting 1Jamie Blue 18 Finance 2

    Mark Brown 20 Marketing 3Joe Little 22 Economics 4Mary Short 20 Accounting 1Nicki Smith 21 Marketing 3

    Note that I created a new variable in the process of creating the newdataset. The advantage of this new variable is that it is numerical andeasier to use later in sorting or dividing the data up later.

    8

  • 8/4/2019 SAS Basics 3-7-08

    9/26

    Revised 3/7/2008

    5. Accessing Existing SAS Datasets & Saving New Permanent SAS Datasets

    What if you want to start a new program by accessing data that youcreated and stored previously in a SAS dataset? This is a very useful toolin SAS. If youve already saved data in SAS format, you can call thatdataset up at any time with a simple set command within a datastatement. For example, assume I have a permanent SAS dataset storedin the proforma directory that includes abnormal stock returns onearnings announcement dates for my sample firms and that the permanentSAS dataset name is eareturn. First I have to identify a location andassign a name. Assume I decide to name the following folder (ordirectory) crspdata. I do this using a libname statement (usually atthe beginning of the program:

    libname crspdata /home/byu/tc259/proforma/;

    In this example, this line of code defines a file location called crspdataas the proforma folder in my WRDS permanent storage directory.

    Once I define the location called crspdata, I can save or retrieve anyfile to or from this location by simply referencing the crspdata librarylocation. For example, I can now access a permanent SAS dataset that Ipreviously saved in this directory in a data statement as follows:

    data crsp;set crspdata.eareturn;

    I now have a temporary dataset called crsp that contains everything inthe permanent eareturn dataset. Thus, the permanent SAS dataseteareturn (which I saved sometime in the past in a different program) is

    now ready to be used in this program. It is now the crsp temporarydataset in this program.

    Similarly, at some point in the program, I may want to save a SAS datasetIve been working with so that I can access it in the future without goingback through all of the merges, etc. Ive completed to this point. Forexample, assume Ive been working with a dataset that Ive namedallvars. It contains all of the variables Ive merged together and wantto use it for running my analysis in the future. I can save the allvarsSAS dataset in the proforma folder I mentioned above by simply usingthe following data statement. (This assumes Ive already defined

    crspdata with the libname statement above.)

    data crspdata.allvars;set allvars;

    9

  • 8/4/2019 SAS Basics 3-7-08

    10/26

    Revised 3/7/2008

    6. Merging Datasets Based on a Common Identifier

    More commonly when you merge datasets together, you want to match aparticular observation in one dataset with a particular observation inanother dataset. For example, you want to match stock price in onedataset to earnings per share in another dataset by company so that youcan compute each firms P-E (price-earnings) ratio.

    pricepermno price25389 15.0036894 18.5010658 7.7529487 22.25

    earnings

    permno eps25389 2.3536894 1.4710658 5.0216427 0.28

    In order to merge two datasets, you first need to sort them by a commonidentifier such as ticker symbol, cusip, or permno.

    proc sort data=price; by permno;proc sort data=earnings; by permno;

    The sort procedure requires you to specify which dataset you want to sortand then specify which variable you want to sort it by. The default is sosort in ascending order. Once the datasets are sorted in the same order,you can merge them together using the merge command in a datastatement as follows:

    data pricearn;merge price earnings;by permno;

    This is an example of a data statement that doesn't use a set statement.Instead of using an old dataset, I'm creating a new one by merging twoexisting datasets. This basic merge is okay, but it has problems. For onething, if one dataset has multiple observations for the same permno(firm), each of them will be merged with the matching observation in theother dataset. Here is a better way to do the same task.proc sort data=price; by permno;proc sort data=earnings; by permno;

    10

  • 8/4/2019 SAS Basics 3-7-08

    11/26

    Revised 3/7/2008

    data pricearn;merge price (in=a) earnings (in=b);by permno;if (a eq 1) and (b eq 1) then output pricearn;

    This sample merge statement assigns a variable "a" to note whether aparticular permno (firm) exists in price and defines a to be "1" if thefirm exists in the dataset and 0 if it doesnt exist in the dataset. Thevariable "b" does the same for earnings. The data statement will onlyinclude observations in the new dataset called "pricearn" if the permnoexists in both datasets. The only problem with this one is that you couldlose lots of data (i.e. observations with no match in one dataset or theother) and not know what is causing the loss of data.

    The following example performs what I feel is the safest merge you cando. In addition to creating the new merged dataset, it creates twoadditional datasets for misses (i.e. firms that didnt match up).

    proc sort data=price; by permno;proc sort data=earnings; by permno;data pricearn miss1 miss2;

    merge price (in=a) earnings (in=b);by permno;if (a eq 1) and (b eq 1) then output pricearn;if (a eq 0) and (b eq 1) then output miss1;if (a eq 1) and (b eq 0) then output miss2;

    This time Im creating three new datasets so that in addition to finding

    my matches, Ill know exactly which firms were missing from each of myold datasets. Pricearn still contains all of the good matches. Miss1 tellsme which firms are missing from the price dataset and miss2 contains alist of which firms are missing from the earnings dataset. Finally, what ifI know I have 3 years of data for each firm?

    pricepermno price year 25389 15.00 199836894 18.50 199810658 7.75 1998

    29487 22.25 199825389 15.20 199936894 18.70 199910658 8.75 199929487 25.25 199925389 15.25 200036894 14.50 200010658 3.75 200029487 29.25 2000

    11

  • 8/4/2019 SAS Basics 3-7-08

    12/26

    Revised 3/7/2008

    earnings

    permno eps year 25389 2.35 199836894 1.47 199810658 5.02 199816427 0.28 199825389 2.00 199936894 1.52 199910658 5.90 199916427 3.28 199925389 2.50 200036894 3.20 200010658 4.72 200016427 1.48 2000

    I can merge based on permno and year as follows:

    proc sort data=price; by permno year;proc sort data=earnings; by permno year;data pricearn miss1 miss2;

    merge price (in=a) earnings (in=b);by permno year;if (a eq 1) and (b eq 1) then output pricearn;if (a eq 0) and (b eq 1) then output miss1;if (a eq 1) and (b eq 0) then output miss2;

    This time SAS will only merge observations if both the firm number(permno) AND year match up!

    B. Helpful Tools

    1. Testing for the Number of Unique Firms in a DatasetWhat if you want to know how many unique firms exist in the pricedataset from the last example? Often, your datasets will contain multipleobservations for each firm. If you want to get a count of how manyunique firms exist in your dataset, you can use the data statement to tell

    you this.

    proc sort data=price;by permno;

    data numfirms;

    set price;by permno;if first.permno;

    12

  • 8/4/2019 SAS Basics 3-7-08

    13/26

    Revised 3/7/2008

    This is one of the handiest things I've learned! First, I sort the pricedataset by permno. Then, I tell SAS to create a new dataset called"numfirms" from price. However, I tell it to only place an observation innumfirms if it is the first occurrence of a particular permno. Anyduplications of the same permno will be ignored. You check the numberof firms in the SAS log file.

    2. Creating Lagged VariablesOften you'll have to compare a variable from this year to the samevariable last year. For example, you may want to compare this year'ssales to last years sales. This is hard to do if your dataset is organized asfollows:

    sales

    obs cusip year sales

    1 100352 1998 10002 100352 1999 10953 100352 2000 21354 299571 1998 73395 299571 1999 82416 299571 2000 2550

    The lag function returns the value for the previous observation. You canalso use "lag2" or "lag3" to go back several observations. There is,however, one problem with this. The first observation for a given cusip islisted after the final observation for the previous firm. Therefore, the lag

    function returns the value for another firm. The following code andoutput illustrate this problem.

    proc sort data=sales; by cusip year;data allsales;

    set sales;sales1 = sales;sales2 = lag (sales);saleschg = sales1 sales2;

    obs cusip year sales sales1 sales2 saleschg1 100352 1998 1000 1000 . .2 100352 1999 1095 1095 1000 953 100352 2000 2135 2135 1095 10404 299571 1998 7339 7339 2135 52045 299571 1999 8241 8241 7339 9026 299571 2000 2550 2550 8241 -5691

    13

  • 8/4/2019 SAS Basics 3-7-08

    14/26

    Revised 3/7/2008

    The highlighted numbers in the output above are bogus numbersbecause we are mixing data from two different firms. In order to solvethe problem we need to perform a check for bogus lags and then correctthem by assigning a missing value (always denoted by a period) to thefirst observation for a given firm. (Also note that the very firstobservation in the dataset doesnt have a prior observation. Therefore,SAS will automatically return a missing value for sales2.) To performthe check we first need to create a few additional lag variables. First, wecreate lagged cusip and year variables to be sure that the values usedcome from the same firm. Also as an additional check (you can never betoo careful when dealing with lags), we create a variable called yearless1which is equal to (year 1). Then we can use an if thenelsestatement to be sure that only the correct lags are used. The followingcode shows how to solve this problem for the data set given above.

    proc sort data=sales; by cusip year;

    data allsales;

    set sales;sales1 = sales;sales2 = lag (sales);saleschg = sales1 sales2;lagcusip = lag(cusip);lagyear = lag(year);yearless1 = year-1;

    if ((cusip eq lagcusip) and (lagyear eq yearless1))then sales2=sales2;

    else sales2=.;

    if ((cusip eq lagcusip) and (lagyear eq yearless1))then saleschg=saleschg;

    else saleschg=.;

    This would return the following dataset:

    obs cusip year sales sales1 sales2 saleschg1 100352 1998 1000 1000 . .2 100352 1999 1095 1095 1000 953 100352 2000 2135 2135 1095 1040

    4 299571 1998 7339 7339 . .5 299571 1999 8241 8241 7339 9026 299571 2000 2550 2550 8241 -5691

    Therefore, this statement assigns a missing value (always denoted by aperiod) for the first observation for a given firm.

    14

  • 8/4/2019 SAS Basics 3-7-08

    15/26

    Revised 3/7/2008

    The code listed above works fine. However, a former student recentlysent me an example of a more efficient way to do this:

    data alldata;set alldata;iflag1(year)=(year-1) and lag1(gvkey) = gvkey and lag1(fyr)=fyrthen m1=1; else m1 = .;

    rdm1 = lag1(rd )*m1;rdm2 = lag1(rdm1)*m1;rdm3 = lag1(rdm2)*m1;

    rdm4 = lag1(rdm3)*m1;rdm5 = lag1(rdm4)*m1;adm1 = lag1(ad )*m1;

    adm2 = lag1(adm1)*m1; drop m1;

    His if statement simply creates a variable called m1. This variable has avalue of 1 if the observation above is for the same company from the

    previous year. Otherwise, m1 is set to a missing value. Then, he createslags of all variables and simply multiplies them by m1. If the lag is a realnumber, then it is equal to that number. If it is a bogus number, it is setequal to missing. This is a great idea!

    3. Printing Datasets (Or Parts of Datasets)I use print statements frequentlyespecially after I've merged datasetsto make sure everything looks right. For example, I would always printthe allsales dataset in the previous example so that I can manuallycheck a few companies to make sure the lag function is working the way

    I think it is and that Ive properly inserted the missing value markers todelete bogus sales comparisons. If I'm using huge datasets, I don't needto look at every observation. (Imagine how long it would take SAS toprint out 20,000 observations!) I'll often tell it to print the first 50, 100,or 200 observations. The print procedure always begins with "proc print"to tell SAS which procedure you're using. On the same line, you need totell it which dataset you're using and if necessary how many observationsto print. The var statement tells it which variables to print. (The defaultis to print all of your variables.) A title statement is good especially ifyou're printing multiple datasets, so you know what you're looking atwhen you examine the output.

    proc print data=nasdaq (obs=100);var permno cusip ticker sales cogs;title 'This is Data for the Nasdaq Sample';

    4. Using Comments

    There are two common uses for comments. First, it is helpful todocument what youre doing so that you (or someone else) will be able to

    15

  • 8/4/2019 SAS Basics 3-7-08

    16/26

    Revised 3/7/2008

    go through your program and quickly understand what each step is doing.Comment markers can also be used to cause SAS to skip sections ofcode that you dont need at the moment, but that you think you mightneed later. There are two ways to comment out words or commands: (1)begin the statement with a star (*) and end with a semicolon or (2) beginwith a slash star (/*) and end with star slash (*/). For example, Ill ofteninclude headers to explain what the next block of code does:

    ************* This section reads in new data ***************;

    The next set of comments causes SAS to completely skip these eight linesof code and proceed on with the next uncommented SAS command./*proc sort data=price; by permno;proc sort data=earnings; by permno;data pricearn miss1 miss2;

    merge price (in=a) earnings (in=b);

    by permno;if (a eq 1) and (b eq 1) then output pricearn;if (a eq 0) and (b eq 1) then output miss1;if (a eq 1) and (b eq 0) then output miss2;

    */

    Note that the /*.*/ comments dont require a semicolon while theregular star comments require you to place a semicolon at the end ofthe comment.

    5. Converting Compustat 6-digit Cusips to/from CRSP 8-digit Cusips

    If you have the option, permno is always the way to go with CRSPbecause it is a permanent identifier that never changes as long as acompany stays in business. Cusip is the best with Compustat. Then, thetrick is to match everything up. I learned something new about thisrecently. One problem in matching them up is that Compustat uses onlythe first 6 digits of the cusip while CRSP uses all 8. (The last two digitstell which stock issuance the company is on. (So, the cusip for acompany could change over time. For example, it may be 87594G10 andthen change to 87594G20.) In the past, I used to simply cut off the lasttwo digits of the CRSP cusip so that I could merge data based on the 6-

    digit cusip. However, recently I needed to take Compustat firms and goback to CRSP to pull new data (which required the 8-digit cusip). I thinkyou can get away with the 6-digit cusip with CRSP on WRDS, but whenyou go directly to the CRSP database, you have to use the 8-digit cusipidentifier. First, let me explain how you could go from the 8-digit cusipidentifier to the 6-digit identifier. Then, Ill describe how you go theother way.

    16

  • 8/4/2019 SAS Basics 3-7-08

    17/26

    Revised 3/7/2008

    TRUNCATING DIGITS FROM THE 8-DIGIT CUSIP:data compdat;

    set compdat;cusip6=substr(cusip8,1,6);

    Cusip6 is the six-digit cusip and cusip8 is the eight-digit cusip. The"substr" command is short for "sub string". After listing the variablename "cusip8", the first number tells it what character to start with andthe second number tells it which character to end with. This commandwould take the 8-digit cusip 87594G10 and assign the new cusip6variable the following value: 87594G.ADDING THE ADDITIONAL DIGITS TO THE 6-DIGIT CUSIP:

    Compustat provides the information for the Cusip identifier in twoseparate variables. CNUM is the 6-digit cusip. The CIC variablecontains three digits. The first two digits are the ending to the 8-digit

    cusip. You'll note in the following example that I first have to truncatethe last digit from the CIC variable and then concatenate (merge together)this variable with the 6-digit CNUM to create the real 8-digit cusip usedin CRSP.

    data compiden;set compiden;tag=substr(cic,1,2);cusip=cnum || tag;

    6. Outputting Data to a Text File

    Sometimes when you complete your program, you want to save yourfinal dataset as a text file so that you can upload it into Excel or someother program. Writing data out to a text file is very similar to readingdata in from a text file. You simply substitute the file command forinfile and the put command for input as illustrated by this example:

    data allvars;set allvars;file '/home/byu/tc259/Proforma/allvars.txt;put cusip $ 2-7

    returns 9-14 .5totasset 16-24 .2mnforcst 26-30;

    7. Other Helpful Hints

    You can use if statements to screen out observations from adataset. For example:

    17

  • 8/4/2019 SAS Basics 3-7-08

    18/26

    Revised 3/7/2008

    data halfyear;set prices;if (eadate le 990630);

    In this example, le means less than or equal to. The newdataset halfyear will be created from the existing datasetprices. It will include all observations for which the earningsannouncement date took place on or before June 30, 1999. Youcan also use the following operators:

    SAS Code Definitionlt =

    You can debug portions of your program by commenting out partsthat you dont need to run (using /* */ comments) and theendsas command to stop the program at any point. Note thatyou always end your SAS program with endsas;.

    To ensure that your output pages are wide enough for you to seeyour output without wrapping long lines, place the followingcommands at the very beginning of your program:

    options ls=256 ps=10000 nocenter;

    It will ensure that the page will always be wide enough to fit anyoutput without "wrapping" long lines down to the next line.

    C. Extracting Data Using Proc SQL

    This procedure provides a fast way to extract data from Compustat. If you wantdata from all firms on the entire database, this isnt the way to go. However, ifyou want to extract data for a particular set of firms and years, it is the mostefficient way to extract data. You first need to read in a list of firms from anexternal file or SAS dataset. Youll need GVKEY, cnum, or ticker in your inputfile so that you can match with Compustat. You may also want to match withparticular years and/or quarters. (Make sure you have defined your year variable

    in the same way Compustat does.)

    In the example below, Proc SQL will create a dataset called compdata. It willinclude the CRSP permno from the input dataset (note that it comes from adataset called ourfirms). It will also include several Compustat data items:GVKEY, cnum, yeara, data6, data12, and data18. All of these additionalvariables come from the Compustat annual file (compann). The fromstatement tells you which datasets are being compared and merged. The wherestatement specifies that Proc SQL should only keep observations from Compustat

    18

  • 8/4/2019 SAS Basics 3-7-08

    19/26

    Revised 3/7/2008

    if the gvkey is the same as a gvkey identified in the input dataset (ourfirms).Note that this first example extracts data from the Compustat annual file.

    procsql;

    createtable compdata asselect

    ourfirms.permno, compann.gvkey, compann.cnum, compann.yeara,

    compann.data6, compann.data12, compann.data18

    from ourfirms, comp.compann

    where ourfirms.gvkey = compann.gvkey;

    quit;

    The second example is similar except that it extracts data from the Compustatquarterly file. In this example, the new dataset being created is called temp1.The only items being saved in this new dataset are cnum, year, qtr, and data177all from the Compustat file (i.e., it isnt saving any variables from the inputdataset called allsamp. Note that this example matches on cnum instead ofgvkey.

    procsql;

    createtable temp1 asselect

    compqtr.cnum, compqtr.year, compqtr.qtr, compqtr.data177

    from allsamp, comp.compqtr

    where allsamp.cnum = compqtr.cnum;

    quit;

    D. Basic Statistical Analysis

    1. Descriptive Statistics (Proc Univariate, Proc Means, and Proc Freq)

    Proc Univariate is one of the most commonly used procedures. It

    returns information like mean, median, 25th percentile, 75th percentile,largest value, smallest value, sum of all values, along with a host of otherstatistical tests of the data. It's very useful. Note that you can add titlesso that your output is more descriptive

    proc univariate data=oldset;var sales cogs advexp;title Desc. Stats for Sales, COGS, and Advexp Variables';

    Proc Means is a simpler procedure that will only give you the basicstatistics:

    proc means data=oldest median ;var sales cogs advexp;title Basic stats for Sales, COGS, and Advexp Variables';

    I prefer Proc Means when I want a brief printout to give me a generalidea about the distribution of a set of variables. Proc Univariate is betterwhen you are interested in details. Note that Ive included the median

    19

  • 8/4/2019 SAS Basics 3-7-08

    20/26

    Revised 3/7/2008

    option. You can request other details that dont appear in the basicprintout.

    Proc Freq (pronounced freak as in frequency) is used to count theoccurrences of particular values of a variable. It outputs a table listingthe variable value, frequency, percent, cumulative frequency, and

    cumulative percent. The following example prepares tables for twodifferent groups of firms. The observations are sorted based on anindicator variable coded 1 for sample firms and 0 for a group of controlfirms. Therefore, proc freq provides frequency information for both setsof firms.

    proc sort data=alldata;by sample;run;

    proc freq data=alldata;tables sic4;by sample;title Table 1, Panel A: Sample and Control firms by SIC code;

    2. Basic Correlations (Proc Corr)

    Proc Corr will return basic correlations. You can specify both Pearson(parametric) and Spearman (non-parametric) correlations as illustrated inthe following example:

    proc corr data=fulldata pearson spearman;var sales totasset returns;title These correlations are for the full dataset;

    The var statement specifies which variables to include in thecorrelation tables. The default is to include all variables in the dataset.Titles can be helpful in ensuring that you remember what is contained inyour outputs.

    3. Basic Regressions (Proc Reg)Proc Reg is one of the most important SAS procedures. It allows you torun ordinary least squares regressions. You can also change the options

    to run other types of regressions like simultaneous equations models.Here is an example of a basic regression:

    proc reg data=oldset;model price = bookval income mktbook / acov spec var collin;title 'This Regression Contains Data from the Large Firm Sample';

    You always begin by specifying what dataset is to be used in performingthe regression. The model statement defines the dependent and

    20

  • 8/4/2019 SAS Basics 3-7-08

    21/26

    Revised 3/7/2008

    independent variables to be used in the regression. The options after the"/" tell SAS if you want to print out any special tests along with theregular regression results. This example includes four of the standardtests that I usually include. The acov option prints out Whitesheteroskedasticity consistent covariance matrix, spec prints out a modelspecification test, and both var and collin print out different tests formulticollearity. Again, it helps when you read your outputs if you knowexactly which data you are looking at. Therefore, the title statementhelps keep things straight.

    4. Analysis of Variance (Proc Anova)

    Proc Anova runs the basic analysis of variance test to compare means ofdifferent samples. It is very similar to Proc Reg.

    proc anova data=allyears;class year;model mda plr = year;

    title 'Analysis of Variance';

    In this example, the mda and plr variables are compared to see ifthere is a difference based on the classification variable year. The classvariable year has two values. The ANOVA procedure basically testswhether mda and plr are different for the two years.

    5. T-Tests (Proc Ttest)

    This procedure runs a basic T-test comparing the means for two differentsamples. In the following example, the variable sample is coded 1

    for sample firms and 0 for firms in a matched sample. Thus, the outputcompares the means for each variable across the two samples.

    procttest data=alldata;

    class sample;

    var ros roa curratio debtratio cftoni rdratio peratio bookmkt;

    title 'Standard T-test';

    6. Wilcoxon Rank Sum Test (Proc Npar1way)

    This procedure can run both parametric and non-parametric comparisonsacross classification samples. The following example runs an anova(similar to the T-test example listed above) and also computes the non-parametric Wilcoxon Rank Sum Test for the same variables.

    procnpar1way data=alldata anova wilcoxon;

    class sample;

    var ros roa curratio debtratio cftoni rdratio peratio bookmkt;

    title 'Standard Wilcoxon-Rank-Sum Test';

    21

  • 8/4/2019 SAS Basics 3-7-08

    22/26

    Revised 3/7/2008

    7. Sample SAS Program

    *******************************************************************

    * This program checks student's data and analysis *

    *******************************************************************;

    options ls=256 ps=10000 nocenter;

    libname herdata '/homes/txc7/Lorraine/';

    **************************************************************

    *********************** read in data *************************

    **************************************************************;

    ********* read in dependent variables ***************;

    data alldat;

    infile '/homes/txc7/Lorraine/Totals.prn' missover ;

    input

    permno mdadiff plrdiff totdiff mda87 mda97 plr87 plr97;

    data alldat;

    set alldat;

    if mda87 ne .;

    data diffs;

    set alldat;

    ********* read in control variables ****************;

    data controls;

    infile '/homes/txc7/Lorraine/contvars.prn' missover ;

    input

    permno size87 size97 pftmgn87 pftmgn97 numan87 numan97;

    data controls;

    set controls;

    avesize = (size87 + size97) / 2;

    aveprmgn = (pftmgn87 + pftmgn97) / 2;aveanal = (numan87 + numan97) / 2;

    avesize = log(avesize);

    aveprmgn = log(aveprmgn);

    aveanal = log(aveanal);

    chgsize = (size87 + size97) / size87;

    chgprmgn = (pftmgn87 + pftmgn97) / pftmgn87;

    chganal = (numan87 + numan97) / numan87;

    ********** merge dependent and independent vars **********;

    proc sort data=diffs; by permno;

    proc sort data=controls; by permno;

    data alldata;

    merge diffs controls;by permno;

    data alldata;

    set alldata;

    tot87 = mda87 + plr87;

    tot97 = mda97 + plr97;

    ***************** create anova dataset ******************;

    data data87;

    set alldat;

    year=87;

    22

  • 8/4/2019 SAS Basics 3-7-08

    23/26

    Revised 3/7/2008

    mda=mda87;

    plr=plr87;

    keep permno mda plr year;

    data data97;

    set alldat;

    year=97;

    mda=mda97;

    plr=plr97;

    keep permno mda plr year;

    data allyears;

    set data87 data97;

    ******* run descriptive statistics for dependent vars ********;

    proc univariate data=data87;

    var mda plr;

    title 'Descriptive Stats for 1987';

    proc univariate data=data97;

    var mda plr;

    title 'Descriptive Stats for 1997';

    *********** T-test, Anova, Wilcoxon Tests ************;

    proc univariate data=diffs;

    var mdadiff plrdiff totdiff;

    title 'T-test of Differences';

    proc anova data=allyears;

    class year;

    model mda plr = year;

    title 'Analysis of Variance';

    proc npar1way data=allyears anova wilcoxon;

    class year;

    var mda plr;

    title 'Wilcoxon Rank-sum Tests';

    ********** Regression Analysis *************;

    proc reg data=alldata;

    model mdadiff = avesize aveprmgn aveanal / vif collin acov spec ;

    proc reg data=alldata;

    model plrdiff = avesize aveprmgn aveanal / vif collin acov spec ;

    endsas;

    IV. Using Macros

    A. People who use SAS regularly often have to do the same things over and over.

    Macros are an efficient way to execute routines that are done on a regular basisextremely efficiently. Im not an expert at writing macros, but I have used them inthe past and they allow you to have very parsimonious SAS code.

    B. Several graduates from our program are very good at writing Macros. Im includingseveral macro descriptions below. The actual code is posted on Blackboard alongwith SAS Basics. Basically, the macros allow you to complete in one line whatnormal takes many. For example, the WT macro does either winsorization ortruncation of any variable you designate. There are also macros for creating

    23

  • 8/4/2019 SAS Basics 3-7-08

    24/26

    Revised 3/7/2008

    industries based on Fama/French, for creating a more simple output from proc reg,for creating a pearson/spearman matrix, etc.

    C. Using Macros

    In order to use macros, you have to tell SAS where the macro code resides. Forexample, the macros explained below are saved in a SAS program calledMacros.sas. To use these macros, you need to use the %include statement in theopening lines of your program (usually right after your options statement):

    %include"C:\Research\Macros\Macros.sas";

    If you are using PC SAS to connect directly to the mainframe, you need to also use a%include after the rsubmit command so that WRDS will know where the macrosreside.

    %include"Macros.sas";

    Then, you can simply reference a macro anytime you want to use it during your

    program. For example, if you want to winsorize, you can execute the winsorizationmacro as follows:

    %WT(data=cstat, vars=cfo wcd1 earn cfom1 cfop1, type=W, pctl=0.599.5);

    See the details about this and other macros below.

    MACRO NAME: WTARGUMENTS: 1. DATA: input dataset

    2. OUT: output dataset (leave blank to overwrite input dataset)

    3. BYVAR: variable(s) used to form groups4. VARS: variable(s) to be modified5. TYPE: =W for winsorize, =T for truncation6. PCTL: percentile points (small to large)7. DROP: =Y drops truncated variables from new dataset. =N leaves

    truncated variables as missing code .T.REQUIRED: DATA, VARS, PCTLDESCRIPTION: This macro winsorizes or truncates the input variables at the specified

    percentiles.

    MACRO NAME: LAGSARGUMENTS: 1. DATA: input dataset

    2. OUT: output dataset (leave blank to overwrite input dataset)3. VARS: variable(s) to be modified4. FIRMID: firm-specific identification variable5. TIMEID: date variable (e.g. fyenddt fqenddt)6. LAG_TYPE: =1 monthly, =3 quarterly, =12 yearly (default)7. NUM_LAGS: number of lags (negative values specify leads)

    REQUIRED: DATA, VARS, FIRMID, TIMEID, NUMLAGS

    24

  • 8/4/2019 SAS Basics 3-7-08

    25/26

    Revised 3/7/2008

    DESCRIPTION: This macro lags (or leads) variables based on a date variable

    25

  • 8/4/2019 SAS Basics 3-7-08

    26/26

    Revised 3/7/2008

    MACRO NAME: CORRPSARGUMENTS: 1. DATA: input dataset

    2. VARS: variable(s) to be correlatedREQUIRED: DATA, VARSDESCRIPTION: This macro creates a correlation matrix of the input variables with pearson

    (above) spearman (below) the diagonal.

    MACRO NAME FF17, FF30, IND22ARGUMENTS: 1. DATA: input dataset

    2. NEWVARNAME: name of the new industry variable (default isindustry

    3. SIC: name of the input sic code variable (default is sic)4. OUT: output dataset (leave blank to overwrite input dataset)

    REQUIRED: DATADESCRIPTION This macro creates industry definitions based on Fama French 17industries (FF17), Fama French 30 industries (FF30), or the Barth et al 22industries (IND22) and creates a corresponding format code (ff. for theFama French 17 or Fama French 30, and indfmt. for the Barth et al.)