sas session 1

Upload: ashu-verma

Post on 07-Jul-2015

41 views

Category:

Documents


0 download

TRANSCRIPT

Introduction To SAS Session IAugust 2009

By Rupesh Kumar

About SAS

Stands for Statistical Analysis System Developed in the early 70s at the North Carolina State University SAS Institute Inc. was formed in 1976 Used for performing tasks like Data Entry, Retrieval and Management Report Writing Statistical and Mathematical Analysis Business Planning, Forecasting and Decision Support Operations Research

2

Base SAS

Base SAS is the core of the SAS system SAS language a programming language used to manage your data SAS procedures software tools for data analysis and reporting Macro Facility a tool for extending SAS and reducing text in your programs(Out of Scope) ODS a system that delivers output in variety of easy-to-access formats(Out of Scope) SAS windowing environment an interactive GUI to write and run your SAS programs

3

Objectives

After completing this lecture, you should understand what are SAS datasets know how to import data into a SAS dataset from external sources manipulate SAS datasets using the DATA step be familiar with basic operators and functions used in the DATA step understand how the DATA step works internally and its implications on correctness of SAS programs efficiency of SAS programs

be familiar with the window environment editor, log, output and explorer windows

4

SAS Datasets

A SAS dataset (physically a file in the file system) both describes and physically stores your data It consists of the following descriptor information data values

The descriptor information describes the contents of the SAS dataset to SAS The data values can be be thought of as a table having at least one column and zero or more rows Rows are usually referred to as observations Each observation is a collection of data values about a particular object Each column or variable describes a particular characteristic of an objectolumns

Emp Id 001 002 003

irstName LastName Date fJoining Anindya ozumdar 20th Dec, 2005 Indrajit Sengupta 10 th July, 2005 Sayandeb anerjee 4th April, 2005

bservation

5

Rules for most SAS names

The first character must be a letter or _ ; subsequent characters must be a letter, numeric digits or underscores No blanks or special characters in names There are a few reserved names _NU , _DATA_ and _ AST_ for SAS datasets _N_, _ERROR_, _CHARACTER, _NUMERIC_, _A _ for columns Both datasets and variables can have up to 32 characters in their names SAS Variable name is Case INSENSITIVE but the values of the variable s are Case SENSITIVE Names can contain only letters, numerals, and the underscore. No %$!*@.

6

Using a DATA step to create a SAS dataset

Here is a simple program that creates a dataset called demo with two columns x and y

/ * My first SAS program. */ DATA demo; x = 1; y = Hello; RUN;

x

y 1 Hello

The demo dataset with 2 columns and 1 observation

7

Basic Structure of a SAS program

SAS programs consist of a series of SAS statements Each statement must end in a semicolon (;) Add comments using asterisk *; or /* */ The RUN; statement signals to SAS to execute all previous commands The placement of SAS statements do not affect SAS programming

8

Where are the datasets created ?

All SAS datasets have a two-level name: libref.datasetname A library reference corresponds to a directory in the file system If the library reference is omitted, then it is stored in a temporary library called work A new library can be created using the IBNAME statement The name of a new library is restricted to 8 characters Temporary datasets are lost when you close your current SAS session

/* The previous SAS program modified to store the demo dataset in a permanent directory. */ IBNAME myds C:\PermanentDS; DATA myds.demo; x = 1; y = Hello; RUN;9

Creating a dataset using the INFILE statement(Generally Used to Read data form external Files) The following program creates a dataset called demo2 with 3 variables and 4 observations

Read Data From Input File

Input variables

DATA demo2; INFI E DATA INES; INPUT id name $ age; DATA INES; 1 A 18 Start reading 2 B 22 data from next 3 C 21 line 4 D 21 ; RUN; End of data

This can be datalines or the path for the external file

10

Points to note about the previous program

The INFI E statement indicates that the data is to be read from input raw data The raw data may be part of an external physical file in the hard disk (use a file name or reference) It may be part of the SAS program (use DATA INES) The variables must be listed in order in which they are read, separated by spaces Variables can either be numeric or character Use a $ after the variable name to indicate that it is a character variable It MUST be the last statement in a SAS program before the RUN statement Use a ; in a blank line to indicate end of raw data

The INPUT statement lists the variables to be read from the input file

The DATA INES statement signals that the raw data starts from the next line

By default, SAS moves to the next variable as soon as it encounter a whitespace (space, tab or newline character)

11

*

Input Statement: It tells SAS how to read data by describing the arrangement of a target data Different Types of Input Statement: 1. LIST INPUT:a) It simply lists the variables separated with a blank. b) This style is also called the free format c) A character variable should be followed by $. Missing values should be marked with a period (.); a blank does not mean a missing value in the list input style. d) The default length of a string variable is 8 characters . e) Values with length greater than 8 can be can be read using length statement. 2. COLUMN INPUT : The column input style read input values from specified columns. A variable name is followed by the starting and ending columns. This input style works good for well structured data.

12

Standard and Non Standard Data

Standard data are character or numeric data values that can be read with list, column, formatted, or named input.A number with a decimal point or a preceding minus sign is considered a standard numeric data value as is a value represented in scientific notation. Nonstandard data include numeric data values that contain nonnumeric characters. Examples include a) numbers with dollar signs or commas or both b) dates and times These values can be read only with informats or written only with formats. Informats translate the nonstandard data into a form that can be processed within SAS. Formats write out data values in a specific form that may be different than the SAS internal representation of the data value.

13

* Input Statement Continues..

3. Formatted Input : It can read Standard as well as Non-Standard Data. a) Character data values can contain embedded delimiters. b) Placeholders for missing values are not required. c) Data values can be read in the order you specify; it is not necessary to specify the variables in the order they appear in the data lines. d) Pointer controls can direct where SAS should read data values. e) Data values or parts of data values can be reread. Column Pointer Controls: @n, a column control, moves the input pointer to nth column @@, a line holder, keeps the pointer in the line and wait other data input +n, a column control, moves the pointer to the right by n columns #n, a row control, goes to the nth line / goes to the first column of the next line

14

* Input Statement Continues..

4. MODIFIED LIST INPUT : Mixture of List and Formatted InputModifiers are: a) The ampersand(&) format modifier reads character data values that contain embedded blanks with list input and reads until encountering more than one consecutive delimiter. b) The colon (:) format modifier and an informat following the variable name tell SAS to read the data value with the informat and to read until it encounters the specified delimiter or reaches the width specified by the informat, whichever comes first.

15

Options for INFILE Statement-1

By default, the default delimiter between the values of a variable are whitespace characters Can be changed by the DLM option in the INFILE statement Some commonly used delimiters are , and |

D A TA d em o3; IN FIL E D A TA L IN E S D L M ","; IN T id nam e $ age; D A TA L IN E S; 1,A ,18 2,B Q ,22 3,C ,21 4,D S,21 ; N;

se the given character as the d elim iter

16

* Options for INFILE Statement-2

Problem Your names are in the form LastName, FirstName and so you want the , to be stored as part of the name too Use quotes around your character data with the DSD option in the INFILE statement

/* Use the DSD option with quotes to embed the delimiter in character data. */ DATA demo4; INFILE DATALINES DLM="," DSD; INPUT id name $ age; DATALINES; 1,"A,P",18 2,"B, ",22 3,"C,R",21 4,"D,S",21 ; RUN;17

* The three uses of DSD

The DSD option has three functions when reading delimited raw data Strip off any quotes that surround values in the raw data Treat consecutive delimiter as a missing value Assume the delimiter is a , (If not, then use the DLM option)

/* Use of DSD assumes that the delimiter is a , . */ DATA demo5; INFILE DATALINES DSD; INPUT id name $ age; DATALINES; 1,"A,P",18 2,"B, ",22 3,"C,R",21 4,"D,S",21 ; RUN;18

* Options for INFILE Statement-3DATA demo6; INFILE DATALINES DLM="," DSD MISSOVER; INPUT id name $ age; DATALINES; 1,"A,P" 2,"B, ",22 3,"C,R",21 4,"D,S",21 ; RUN;

Normally, SAS moves to the next line if it cannot find the values of all variables in a single line of raw data The MISSOVER option prevents this and assigns missing values for observations that contain no data for particular variables For delimiter separated data, the MISSOVER option is not necessary if there are enough delimiters in the raw data The two programs on the right are equivalent

D A T A d em o7; IN F IL E D A T A L IN E S D L M = " ," D S D ; IN P U T id n a m e $ a g e ; D A T A L IN E S ; 1 ," A ,P " , 2 ," B , " ,2 2 3 ," C ,R " ,2 1 4 ," D ,S " ,2 1 ; RUN;

19

* Length, Formats, Informats - 1

Consider the following SAS program

DATA demo8; INFILE DATALINES DLM="," DSD MISSOVER; INPUT id name $ age; DATALINES; 1,"Kay, Alan",18 2,"Stallman, Richard",22 3,"Brooks, Fred",21 4,"Wozniak, Steve",21 ; RUN; On running the program and viewing the contents of the dataset, we find that each name has been truncated to 8 characters. Run the same program using Length Statement.20

Length, Formats,Informats - 2

Each variable in SAS has a length, which is the number of bytes used to store each of the variables values in a SAS data set By default, the length of numeric and character variable is 8 bytes Thus SAS truncates each name to 8 characters Using an appropriate informat in the INPUT statement By assigning the variable to a constant value, in which case the length is set to the first constant value encountered for the variable Instructions that SAS uses when reading data values If no informat is specified, then it is BEST12. for numeric, and $w. for character variables where w is either 8 or the length of the first constant value encountered for numeric variables by using w.d where w is the total width and d is the number of places after the decimal for character variables by using $w. where w is the maximum number of characters for the variable 21

Length of variable can be set using two ways

Informat

You can specify an informat

* Length, Formats , Informats - 3

Suppose you want to read the following data into SAS

Day , Units , Sales 1,1,120,$1,253.50 2,47,$53 3,555,$600 4,1,023,$1,100 5,110,$125 While the Units and Sales columns are numeric, they have embedded commas and dollar signs SAS has special informats and formats to read and display such values

22

* Length, Formats , Informats - 4

DATA demo11; INFILE DATALINES DLM="," DSD MISSOVER FIRSTOBS = 2; INPUT day units :COMMA10. sales :DOLLAR10.2; DATALINES; Day , Units , Sales 1,"1,120","$1,253.50" 2,47,$53.00 3,555,$600.00 4,"1,023","$1,100.00" 5,110,$125.00 ; RUN;

The COMMA10. informat is used to read numbers with embedded commas The DOLLAR10.2 informat is used to read numbers with embedded commas and a preceding dollar sign The FIRSTOBS = 2 option in the INFILE statement instructs SAS that the observations start from the 2nd line of raw data The : modifier is used everywhere to ensure that SAS stops reading a value on encountering the next delimiter23

* Length, Formats , Informats - 5

DATA demo11; INFILE DATALINES DLM="," DSD MISSOVER FIRSTOBS = 2; INPUT day units :COMMA10. sales :DOLLAR10.2; FORMAT units COMMA10. sales DOLLAR10.2; DATALINES; Day , Units , Sales 1,"1,120","$1,253.50" 2,47,$53.00 3,555,$600.00 4,"1,023","$1,100.00" 5,110,$125.00 ; RUN; Using a FORMAT statement, we can ensure that we view the values with the embedded commas and dollar sign24

Missing Values - 1 A missing value indicates that no data value is stored for the variable in the current observation There are three kinds of missing values numeric a single . character is used to indicate a numeric missing value character a blank (not a space) is used to indicate a character missing value special numeric type of numeric missing value that enables you to represent different categories of missing data by using the letters A-Z or an underscore

DATA demo16; MISSING X I; INPUT Id $4. Foodpr1 Foodpr2 Foodpr3 Coffeem1 Coffeem2; DATALINES; 1001 115 45 65 I 78 1002 86 27 55 72 86 1004 93 52 X 76 88 1015 73 35 43 112 108 1027 101 127 39 76 79 ; RUN;25

Missing Values - 2

Order of missing values A numeric missing value is smaller than any other number, effectively it is equal to minus infinity A character missing value is smaller than any printable character Missing values are created by arithmetic operations on missing values Missing values are generated by illegal operations such as division by zero or taking the logarithm of zero Use an expression to produce a number too large to be represented as a floating point number Illegal character to numeric conversions normally SAS converts a character variable to numeric if used in an arithmetic expression the result of the expression will be missing if the character variable does not contain numeric data

Generation of missing values

Use of special missing values in a numeric expression will result in a period (i.e. numeric missing value) in the result of the expression

26

Creating SAS datasets using the SET statement The SET statement is used to create new SAS datasets from existing SAS datasets The simplest form of the SET statement is DATA ; SET ; RUN; You can use the same dataset in the DATA and SET statements This is commonly used to create new variables from the existing variables of the dataset reads in one observation of the existing dataset processes the observation as instructed by the statements following the SET statement writes the new observation to the new dataset

The above code

27

Creation of new variables - 1

The common ways of creating a new variable in the DATA step are using an assignment statement reading data with the INPUT statement specifying a new variable in a FORMAT statement the variables which appear in the expression of the right hand side of the assignment statement must be created before this step

When creating variables using the assignment statement

DATA test; x = y; /* WRONG both x and y are created and set to missing here */ y = 12; /* y is set to 12, but x will still have a missing value */ RUN; Can use arithmetic or other operators and built in SAS functions to create new variables

28

Creation of new variables - 2

The commonly used arithmetic operators in SAS are + (addition), - (subtraction), * (multiplication), / (division), ** (exponentiation)

DATA demo18; INFILE DATALINES; INPUT id $3. m1 m2 m3; DATALINES; 001 27 44 32 002 28 26 29 003 44 40 48 004 37 33 41 005 18 25 33 ; DATA demo19; SET demo18; total = m1 + m2 + m3; per = (total / 150) * 100; FORMAT per 4.1; RUN;

29

DROP, KEEP and RENAME statements

Use the DROP, KEEP and RENAME statements to decide the variables that go into your output dataset or rename them DATA demo21; x = 1; y = "Hello"; DATA demo22; /* This dataset will have a single column y. */ SET demo21; DROP x; DATA demo23; /* This dataset will have a single column x. */ SET demo21; KEEP x; DATA demo24; /* This dataset will have two columnx x_new and y. */ SET demo21; RENAME x = x_new; RUN; 30

DROP, KEEP and RENAME dataset options

Use the DROP, KEEP and RENAME data set options to decide the variables that go into your output dataset or rename them DATA demo21; x = 1; y = "Hello"; DATA demo25; /* This dataset will have a single column y. */ SET demo21(DROP = x); DATA demo26; /* This dataset will have a single column x. */ SET demo21(KEEP = x); DATA demo27; /* This dataset will have two columnx x_new and y. */ SET demo21(RENAME = (x = x_new)); RUN; The datasets demo25, demo26 and demo27 are the same are demo22, demo23 and demo24 respectively However, they are processed differently by SAS Understanding the difference in processing is the key to writing correct and efficient SAS programs 31

SAS Data Step Internals - 1

When you submit a data step for execution, it is first compiled and then executed The Compilation Phase Checks the syntax of each statement Identifies the type and length of each variable (important!!) Creates an input buffer if reading from raw data The input buffer holds one observation in each step

Creates a program data vector (PDV) logical area in memory to build the output dataset holds the dataset and computed variables and two special variables _N_ and _ERROR_ _N_ holds the observation number of the current record being processed _ERROR_ is 1 if an error has occurred, 0 otherwise

Creates descriptor information about the output SAS dataset Iterates once for each observation that is being created In the beginning of each iteration, the value of _N_ is incremented by 1 Sets all the newly created program variables to missing Reads one observation from an existing SAS dataset or the input buffer Executes an programming statements for the current record Writes an observation to the output dataset and begins the next iteration 32

The Execution Phase

* SAS Data Step Internals 2(Program Data Vector) A DATA step is processed in two phases - first it is compiled, and then executed. During the compile phase the program lines are read line by line. When the SET statement is read, the compiler will look at the data set START and determine its variables. The compiler reserves a place in the PDV for each variable as it is encountered. DATA demo27b; put '1' _all_ ; SET demo27a; put '2' _all_ ; D=2* ; put '3' _all_ ; Run ;

1) The DATA step is an implied loop (when an INPUT, SET, MERGE, UPDATE or MODIFY statement is present). 2) SAS executes the statements in order, line by line, for each observation in the data set being read. 3) _N_ counts iterations of the implied loop of the DATA step. 4) At the bottom of the DATA step, SAS, by default, writes an observation to a SAS data set. 33

SAS Data Step Internals 3

A DROP, KEEP or RENAME dataset option determines whether a variable is loaded into the PDV and what is its name A DROP, KEEP or RENAME statement causes the variable to be loaded into the PDV, but controls whether it is written to the output or not You can speed up your programs by keeping only the required variables dropping variables which are not required avoid a DATA step just to rename variables unless absolutely necessary It will iterate through the entire dataset and make your programs slow

To ensure that your programs are correct Make sure that a variable is available for processing

DATA demo28; SET demo28a(DROP = x); z = x + 2; /* Both x and z will be set to missing in the output dataset. */ RUN;34

Controlling the built-in DATA loop: RETAIN

The build in DATA loop sets all values to missing at the beginning of each iteration of the data step The RETAIN statement forces the built-in loop to keep values from the previous observation The values of the variables specified in the RETAIN statement will keep their values till they are reset by an input or assignment statement

For Ex: Suppose you want to calculate the cumulative sum of a variable say X having values(1,2,3,4,5,6,7,8,9,10)Say dataset name : Test

DATA demo29; put '1:' _all_; SET TEST; put '2' _all_; RETAIN y 0; Y=Y + x; put 3:' _all_; run;35

Controlling the built-in DATA loop: SUM

A special case is a plus sign in an assignment without an equal sign Implicitly retains the previous value of the variable and adds it to the current

DATA demo30; INFILE DATALINES; INPUT num; total + num;/*What will happen if we write total=total +num */ DATALINES; 12 11 17 13 14 ; RUN;36

IF-THEN-ELSE

Used for conditional assignment of variables IF THEN ; ELSE IF THEN ; ELSE ; DATA demo31; marks = 77; FORMAT performance $10.; IF marks > 80 THEN performance = GOOD; ELSE IF marks >= 50 THEN performance = AVERAGE; ELSE performance = POOR; RUN;

37

IF-THEN-DO-END

A DO-END combined with IF-THEN can be used to execute multiple actions for each condition IF THEN DO; ; ; END; ELSE IF THEN DO; ; ; END; ELSE DO; ; ; END; DATA demo32; marks = 77; FORMAT performance $10.; IF marks > 80 THEN DO; performance = GOOD; grade = A; END; ELSE IF marks >= 50 THEN DO; performance = AVERAGE; grade = B; END; ELSE DO; performance = POOR; grade = C; END; RUN;

38

More operators

You can use the following comparison operators in conditions

= or E (equal to) > or GT (greater than) ^= or NE (not equal to) >= or GE (greater than or equal to) < or LT (less than) 10; /* IF num 50; WHERE joindate >= 14APR2006d; The CONTAINS operator can be used to select observations by searching for a specified set of characters in a variable

WHERE company CONTAINS bay; WHERE company NOT CONTAINS bay;

42

WHERE expression processing - 2

the IS NULL operator is used to select observations where the value of the specified variable is missing WHERE id IS NULL; the LIKE operator is used to select observations by comparing the values of a character variable to a specified pattern WHERE lastname LIKE N%; /* % matches any number of characters. */ WHERE firstname NOT LIKE D_a%;

43

WHERE expression processing - 3

WHERE can be used both as a statement or a DATA step option

DATA demo37; INPUT n; DATALINES; 1 2 3 4 ; DATA demo38; SET demo37; WHERE n > 3; DATA demo39; SET demo37(WHERE = (n > 3)); RUN;44

* WHERE and IF

It may appear that there is no difference between using the WHERE and the IF statement to subset observations Both WHERE and IF can be used on variables present in the original data set Must use IF on automatic variables like _N_ and newly created variables in the DATA step Must use WHERE to use special operators like CONTAINS or LIKE WHERE is more efficient as observations not satisfying the given condition are not loaded into the PDV Take care when using the OBS = dataset option When used with IF, SAS first subsets the data based on OBS = and then applies the IF condition When used with WHERE, SAS first applies the WHERE condition and then restricts the output dataset with the OBS = condition

45

The OUTPUT statement

There is an implicit OUTPUT statement at the end of each DATA step which instructs SAS to write all the variables into the output SAS dataset The OUTPUT statement can be explicitly used in the DATA step Once the OUTPUT statement is specified, the implicit OUTPUT is removed and all observation writing must be specified by the user Is useful in creating multiple datasets from an existing one

DATA demo40; INPUT n; DATALINES; 1 2 3 4 ; DATA demo41 demo42; SET demo40; IF n > 2 THEN OUTPUT demo41; ELSE OUTPUT demo42; RUN;

46

References The Little SAS Book: A Primer Copy available in Mu Sigma http://www.ats.ucla.edu/stat/sas/ - Resources to use and learn SAS Official SAS Documentation 9.1.3 PDFs available in Mu Sigma

47