business analytics using sas

© Orangetree Business Solutions Private Limited, 2012-13

No part of this book should be referenced or copied without the prior permission of the

company.

A FEW WORDS TO THE STUDENTS

Analytics is becoming a popular tool for managerial decision making. It‘s still not so

widespread in countries like India, but in the west it has become a standard practice.

Previously studying analytics involved an in depth knowledge of statistics and pro-

gramming languages. But widespread availability of statistical package software has

changed the reality to some extent. Now more emphasis is given on the application of

the techniques to solve the business problems. So there is a need to understand the

meaning of the statistical procedures. This book has been written to cater that need.

In this book, all the necessary concepts have been explained keeping the business

problem in mind. Also, to remove the apathy for statistics, use of mathematical expres-

sions have been limited. That doesn‘t imply that we don‘t have to study the mathe-

matics part. The intention is to put the substance over matter. As the students get ac-

customed to these statistical concepts, they can go for further investigations using vari-

ous mathematical and statistical techniques. A list of suggested books and links have

been given in the appendix.

This book is directly related to the instructor‘s presentation. So it is highly advised

that students should go through this material at the end of each class. As for general

reading, the reader is advised to go according to the chapters. Chapters have been

arranged in the order of higher complexity. So the initial chapters are very important.

In this book, the statistical procedures have been implemented on SAS. The expla-

nations of the codes have been from the perspective of a data modeler. For the per-

spective of a programmer, the students are advised to go through the documentation

of the procedures in the SAS website.

In fine, statistical concepts are a way of thinking. The more you recognize the think-

ing pattern, the quicker you will learn.

Best of Luck!

Team OTG

© Orangetree Business Solutions Private Limited, 2012-13 4

CONTENTS

1. Introduction to Analytics and Basic Statistics

2. Introduction to Probability Theory

3. Sampling Theory and Estimation

4. Important Tests of Statistical Significance (Part I)

5. Understanding The Association Between The Variables

6. Important Tests of Statistical Significance (Part II)

7. Exploratory Factor Analysis

8. Cluster Analysis

9. Linear Regression

10. Logistic Regression

11. Time Series Analysis

5

PAGE

22

33

41

48

53

57

62

69

81

96

Appendix: Suggested Books and References 116


Chapter 1

INTRODUCTION TO ANALYTICS AND BASIC STATISTICS

B usiness analytics (BA) refers to the skills, technologies, applications and practic-

es for continuous iterative exploration and investigation of past business perfor-

mance to gain insight and drive business planning.

There are three main categories of analytics:

1.Descriptive - the use of data to find out what happened in the past and what is

happening now.

2.Predictive - the use of data to find out what could happen in the future.

3.Prescriptive - the use of data to prescribe the best course of action for the future.

1.Retail sales analytics

2.Financial services analyt-

ics

3.Risk & Credit analytics

4.Talent analytics

5.Marketing analytics

6.Behavioral analytics

7.Collections analytics

8.Fraud analytics

9.Pricing analytics

10.Telecommunications

11.Supply Chain analytics

12.Transportation ana-

A

N

A

L

Y

T

I

C

S

D

O

M

A

I

N

S

According to McKinsey Global

Institute, The

amount of data in

our world has been

exploding and ana-

lyzing large data,

Big data can generate value in each. For example, a retailer using big data to the full

could increase its operating margin by more than 60 percent. Harnessing big data in

the public sector has enormous potential, too. If US healthcare were to use big data

creatively and effectively to drive efficiency and quality, the sector could create

more than $300 billion in value every year. Two-thirds of that would be in the form of

reducing US healthcare expenditure by about 8 percent. In the developed econo-

mies of Europe, government administrators could save more than €100 billion ($149

billion) in operational efficiency improvements alone by using big data, not including

using big data to reduce fraud and errors and boost the collection of tax revenues.

And users of services enabled by personal-location data could capture $600 billion in

consumer surplus.

so-called big data will become a key basis of

competition, underpinning new waves of

productivity growth, innovation, and consumer

surplus. MGI studied big data in five domains—

healthcare in the United States, the public sec-

tor in Europe, retail in the United States, and

manufacturing and personal-location data

globally.


Common Business Problems in Telecom:

Customer churn is a common term used

both in academia and practice to de-

note the customers with propensity to

leave for competing companies. Ac-

cording to various estimates in European

mobile service markets, churn rate

reaches twenty-five to thirty percent an-

nually. On the other hand financial anal-

ysis and economic studies are in agree-

ment that acquiring new customers is five

times as expensive compared to retain-

ing existing customers.

Common Business Problems in Retail:

1.Increase customer value and overall rev-

enues

2.Reduce costs and increase operational

efficiency

3.Develop successful new products and ser-

vices

4.Determine profitable sites for new stores

and improve existing stores

5.Communicate effectively between de-

partments for better decision making

Snapshot of Companies Using Analytics:

MoneyGram International uses analytics to detect and prevent money transfer

fraud before it impacts customers and has prevented more than US$37.7 million in

fraudulent transactions, reduced customer fraud complaints by 72 percent.

Primerica provides its representatives with the ability to drill down into sales data in

order to increase productivity and boost revenue. Primerica has more than 142

thousands licensed sales representatives.

T Mobile USA uses analytics to detect the influencers in the network and design lu-

crative customized offers to the influencers. In this way, they reduced the churn

rate by 25%.

Dillard‘s uses analytics to improve its customer relationship management and mer-

chandise management to deliver the right product at the right store.

Seton Healthcare Family uses analytics to detect patients who are at considerable

risk.

Del Monte Foods uses analytics to understand the macro variables like inflation

and how these variables impact the cost structure of the company.

Reliance Capital uses analytics to retain customer in its mutual fund business, to

confirm the continual premium payment in the life insurance business, to design

products of high claim ratio in the general insurance business, and finally, for credit

scoring in the mortgage finance business.


Types of Data Analysis

Exploratory Data Analysis (EDA) makes few assumptions, and its purpose is to suggest

hypotheses and assumptions. An OEM manufacturer was experiencing customer

complaints. A team wanted to identify and re-

move causes of these complaints. They asked

customers for usage data so the team could cal-

culate defect rates. This started an Exploratory

Data Analysis. The investigation established that

a supplier used the wrong raw material. Discus-

sions with the supplier and team members moti-

vated further analysis of raw material, and its

composition. This decision to analyze raw mate-

rial completed the Exploratory Data Analysis.

The Exploratory Data Analysis used both data

analysis and process knowledge possessed by team members. The supplier and com-

pany conducted a series of designed experiments which identified an improved raw

material composition. Using this composition, the defect rate improved from .023%

to .004%. The experimental design and its analysis was Confirmatory Data Analysis

(CDA). Note that the experimental design required a hypothesis generated by the Ex-

ploratory Data Analysis.

Exploratory Data Analysis uncovers statements or hypotheses for Confirmatory Data

Analysis to consider.

Properties of Measurement

Identity: Each value on the measurement scale has a unique meaning.

Magnitude: Values on the measurement scale have an ordered relationship to on

another. That is, some values are larger and some are smaller.

Equal intervals: Scale units along the scale are equal to one another. This means,

for example, that the difference between 1 and 2 would be equal to the differ-

ence between 19 and 20.

Absolute zero: The scale has a true zero point, below which no values exist.

Scales of Measurement

Nominal Scale: The nominal scale of measurement only satisfies the identity property

of measurement. Values assigned to variables represent a descriptive category, but

have no inherent numerical value with respect to magnitude. Gender is an example

of a variable that is measured on a nominal scale. Individuals may be classified as

"male" or "female", but neither value represents more or less "gender" than the other.

Religion and political affiliation are other examples of variables that are normally

measured on a nominal scale.


Ordinal Scale: The ordinal scale has the property

of both identity and magnitude. Each value on

the ordinal scale has a unique meaning, and it

has an ordered relationship to every other value

on the scale. An example of an ordinal scale in

action would be the results of a horse race, re-

ported as "win", "place", and "show". We know the

rank order in which horses finished the race. The

horse that won finished ahead of the horse that

placed, and the horse that placed finished

ahead of the horse that showed. However, we

cannot tell from this ordinal scale whether it was a close race or whether the winning

horse won by a mile.

Interval Scale: The interval scale of measurement has the properties of identity, magni-

tude, and equal intervals. A perfect example of an interval scale is the Fahrenheit

scale to measure temperature. The scale is made up of equal temperature units, so

that the difference between 40 and 50 degrees Fahrenheit is equal to the difference

between 50 and 60 degrees Fahrenheit. With an interval scale, you know not only

whether different values are bigger or smaller, you also know how much bigger or

smaller they are. For example, suppose it is 60 degrees Fahrenheit on Monday and 70

degrees on Tuesday. You know not only that it was hotter on Tuesday; you also know

that it was 10 degrees hotter.

Ratio Scale: The ratio scale of measurement satisfies all four of the properties of meas-

urement: identity, magnitude, equal intervals, and an absolute zero. The weight of an

object would be an example of a ratio scale. Each value on the weight scale has a

unique meaning, weights can be rank ordered, units along the weight scale are equal

to one another, and there is an absolute zero. Absolute zero is a property of the weight

scale because objects at rest can be weightless, but they cannot have negative

weight.

Types of Data

Quantitative Data: In most of the cases, we will

find ourselves using numeric data. This type of

data is the one that contains numbers.

Delivery Time in Minutes 19 10 17 15 18 16 12 16 16 18 15 15 16 18 13 15 19 17 14 10 13 12 13 16


Qualitative Data: The other type of data

is string type data. A string is simply a line

of text and could represent comments

about certain participant, or other infor-

mation that you don‘t wish to analyze

as a grouping variable.

Categorical Data: The third type of data is categorical data

represented by a grouping variable. For example, you insert

a variable called gender and insert ‗Male‘ or ‗Female‘ under

this variable as observations. In this case we can group the

entire data with respect to gender. Here gender is a group-

ing variable.

Presentation of Qualitative Data

Tabular Presentation

Graphical Presentation

Cube # Touch See Smell 1 Rough Brown Wood 2 Rough Sliver Metallic 3 Slightly Rough Sliver Metallic 4 Smooth Gold No Smell 5 Smooth Brown No Smell 6 Smooth Brown No Smell 7 Rough Brown Wood 8 Smooth Gold No Smell

Color Number of

Items Brown 4 Gold 2 Silver 2

Subcategory Frequency Percent Cumulative Frequency

Cumulative Percent

Chocolate 491 32.73 491 32.73 Fruit 170 11.33 661 44.07 Gum 194 12.93 855 57 Mixed 92 6.13 947 63.13 Soft 365 24.33 1312 87.47 Sweet 188 12.53 1500 100

Simple Bar Chart

Pie Chart


Horizontal Bar Chart: Good

for Geographical data

Stacked Bar Chart:

Good for Intra Analysis

Multiple Bar Chart:

Good for Inter Analysis

Presentation of Quantitative Data

Tabular Presentation

Graphical Presentation

Histogram: Understanding

the Distribution of the Data

Scatter Plot: Understanding the Re-

lationship between two numerical

Variables


Various Types of Scatter Plots

Positively Related: One

increases, then the other

also increases

Negatively Related: One in-

creases, then the other de-

creases

Undefined: No Clear

Relation

Measure of Central Tendency

The vice president of marketing of a fast – food chain is studying the sales perfor-

mance of the 100 stores in the eastern part of the country. He has constructed the fol-

lowing frequency distribution of annual sales:

He would be looking at the distribution with an eye toward getting information about

the central tendency to compare the eastern part with other parts of country. Central

tendency is basically the central most value of a distribution. Now how do we know

which one is the central most value?

There are precisely three ways to find the

central value: Arithmetic mean, Median and

Mode.

Arithmetic mean is the simple average of the

data. The problem with arithmetic mean is

that it is influenced by the extreme values.

Suppose, you take a sample of 10 persons

whose monthly incomes are 10k, 12k, 14k,

12.5k, 14.2k, 11k, 12.3k, 13k, 11k, 10k. So the

average income turns out to be 12k. So that‘s

a good representation of the data. Now if

you replace the last data with 100k, then the

average turns out to be 21k which is very absurd as 9 out of 10 people earns way be-

Sales (000s) Frequency Sales (000s) Frequency 700 - 799 4 1300 - 1399 13 800 - 899 7 1400 - 1499 10 900 - 999 8 1500 - 1599 9 1000 - 1099 10 1600 - 1699 7 1100 - 1199 12 1700 - 1799 2 1200 - 1299 17 1800 - 1899 1


low that mark.

This problem of Arithmetic mean can be reduced though the use of Geometric and

Harmonic mean. But the effect of outliers can be almost nullified by the use of Median.

Median is the mark where the entire data is split into exact halves, that is, 50% of the

data lie above the mark and the rest lie below. In intuitive sense, it is the proper meas-

ure of central tendency. But for various computational reason, Arithmetic mean is the

most popular measure.

Whereas median looks for half mark, Mode looks for the value with the highest fre-

quency, that is highest number of occurrence.

So using central tendency, we are trying to find out a value around which all the data

are clustering. This property of data can be used to deal with the missing values. Sup-

pose, some of the income data is missing, then you can replace the missing values

with the mean or the median values. If some city name is missing, one may replace

those by using the mode, that is the city which appeared most of the times.

Measures of Dispersion

As the name says, here we are trying to access how disperse the data is. A measure of

central tendency without any idea about the measures of dispersion don‘t make any

sense. Why it is so? Look at the following charts.

The horizontal data is the central value in both the cases. But for the first case where

the data is less dispersed, the data is really clustered around the central line. Whereas

in the second case, data is so dispersed that central value is not that meaningful, as

you cannot say that the horizontal line is a true representative of the data. So there is a

need to measure the dispersion in the data.

Broadly there are two measures of data, one is absolute measures like Range or Vari-

ance and the other is relative measure like Coefficient of Variation.

Range is the simplest measure. It is basically the difference between the maximum

and the minimum value in a data. The other absolute measure Variance is a bit com-

plicated to express in plain words. It basically comes from the sum of squared differ-


ence of the each data from the arithmetic mean of the data. Now as you go on in-

creasing the number of data, the sum basically increases. So we take the average.

Now if you take the square root (e.g. square root

of 9 is 3), we get the Standard Deviation of the da-

ta. If you like you can memorize the following ex-

pression:

Some of you might find difficulties with the denomi-

nator being n-1 instead of n. The reason is that

here we are calculating the sample standard devi-

ation. If it had been population standard devia-

tion, we could have used n.

We will discuss about the population and sample in the coming chapters.

Apart from understanding the dispersion in the data, standard deviation can be used

for transforming the data. Suppose, if we want to com-

pare two variables like the amount of money persons

earn and the number of pair of shoes their wives have,

then it is better to express those data in terms of stand-

ard deviations. That is, we simply divide the data by

their respective standard deviations. So here the stand-

ard deviation acts as a unit or we make the data unit

free.

Now if you want to understand which data is more vol-

atile, personal income or pair of shoes, you better use

Coefficient of Variation. As mentioned earlier, it is a rel-

ative measure of dispersion and is expressed by stand-

ard deviation per unit of central value, i.e. mean. If you

have income in dollar terms and income in rupee terms, and if the first data has less

coefficient of variation than the second one, use the first data for analysis. You will find

more meaningful information.

Measures of Location

Using Measures of Location, we can get a bird‘s eye view of the data. Measures of

Central Tendency also comes under the Measures of Location. Minimum and maxi-

mum are also measures of

location. Other measures

are Percentiles, Deciles,

and Quartiles. For example,

if 90 percentile denotes the

number 86, the it is implied

that 90% of the students

have got marks which are

less than 86. Now the 90

percentile is the 9th Deciles.


For quartiles, we are basically dividing the total

data into four equal parts. So we are looking

for 3 points Q1, Q2, and Q3. The other name for

Q2 is Median. So we have 25% of the data be-

low Q1, 25% within Q1 and Q2, similarly 25%

within Q2 and Q3 and finally, rest of the 25%

above Q3.

Statistics Related to The Shape of The Distribu-

tion

As we look at the shape of the histogram of a numeric data, we have various under-

standing about the distribution of the data. We have two statistics that are related to

the shape of

the distribu-

tion: Skewness

and Kurtosis. If

the distribution

has a longer

left tail, the

data is nega-

tively skewed.

The opposite is

for the posi-

tively skewed. So we are basically detecting whether the data is symmetric about the

central value of the distribution. In options markets, the difference in implied volatili-

ty at different strike prices represents the market's view of skew, and is called volatility

skew. (In pure Black–Scholes, implied volatility is constant with respect to strike and

time to maturity.) Skewness causes the Skewness risk in the statistical models, that are

built out of variables which are assumed to be symmetrically distributed.

Kurtosis, on the other hand, measures the peakedness of the distribution as well as the

heaviness of the tail. Generally heavy tailed distributions don‘t have a finite variance.

In other words, we cannot calculate the variance for these distributions. Now if we

consider that the distribution is not heavy tailed and build the model on this assump-

tion, it can lead to Kurtosis risk of the model. For instance, Long-Term Capital Manage-

ment, a hedge fund cofounded by Myron Scholes,

ignored kurtosis risk to its detriment. After four suc-

cessful years, this hedge fund had to be bailed out

by major investment banks in the late 90s because

it understated the kurtosis of many financial securi-

ties underlying the fund's own trading positions.

There can be several situations as shown in the

chart. The value of kurtosis for a Mesokurtic Distribu-

tion is zero. For Platykurtic it‘s negative and for Lep-

tokurtic it‘s positive. Kurtosis is sometimes referred as volatility of volatility or the risk with-

in risk.


Box Plot for Detecting Outliers

An outlier is a score very different from the rest of the data. When we analyze data we

have to be aware of such values because they bias the model we fit to the data. A

good example of this bias can be seen by looking at a simple statistical model such as

mean. Suppose a film gets a rating from 1 to 5. Seven people saw the film and rated

the movie with ratings of 2,

5, 4, 5, 5, 5, and 5. All but

one of these ratings is fairly

similar (mainly 5 and 4) but

the first rating was quite dif-

ferent from the rest. It was a

rating of 2. This is an exam-

ple of an outlier. The box-

plots tell us something

about the distributions of

scores. The boxplots show us

the lowest (the bottom horizontal line) and the highest (the top horizontal line). The dis-

tance between the lowest horizontal line and the lowest edge of the tinted box is the

range between which the lowest 25% of scores fall (called the bottom quartile). The

box (the tinted area) shows the middle

50% of scores (known as interquartile

range); i.e. 50% of the scores are bigger

than the lowest part of the tinted area

but smaller than the top part of the tint-

ed area. The distance between the top

edge of the tinted box and the top hori-

zontal line shows the range between

which top 25% of scores fall (the top

quartile). In the middle of the tinted box

is a slightly thicker horizontal line. This

represents the value of the median. Like

histograms they also tell us whether the

distribution is symmetrical or skewed. For

a symmetrical distribution, the whiskers

on either side of the box are of equal length. Finally you will notice small some circles

above each boxplot. These are the cases that are deemed to be outliers. Each circle

has a number next to it that tells us in which row of the data editor to find the case.

Correcting Problems in the data

Generally we find problems related to the distribution or outliers while exploring the da-

ta. Suppose you detect outliers in the data. There are several options for reducing the

impact of these values. However, before you do any of these things, it‘s worth check-

ing whether the data you have entered is correct or not. If the data are correct then

the three main options you have are:

Remove the Case: It entails deleting the data from the person who contributed the


outlier. However, this should be done only if you have good reason to believe that

this case is not from the population that you intend to sample. For example, if you

were investigating factors that affected how much babies cry and baby didn‘t cry

at all, this would likely be an outlier. Upon inspection, if you discovered that this ba-

by was actually a 10 year old boy, then you would have grounds to exclude this

case as it comes from a different population.

Transform the data: If you have a non-normal distribution then this should be done

anyway (and skewed distributions will by their nature generally have outliers be-

cause it‘s these outliers that skew the distribution). Such transformation should re-

duce the impact of these outliers. For transformation we use the compute variable

facility.

Log Transformation (log Xi): Taking the logarithm of a set of numbers squashes

the right tail of the distribution. However, you cannot get a log value of zero or

negative numbers, so if your data tend to zero or produce negative numbers

you need to add a constant to all the data before you do transformation.

Square root transformation (√Xi): Taking the square root of large values has more

of an effect than taking the square root of small values. Consequently, taking

the square root of each of your scores will bring large scores closer to the cen-

ter. So this can be a very useful way to reduce positively skewed data. But we

still have the problems related to negative numbers.

Reciprocal transformation (1/Xi): Dividing 1 by each of the scores reduces the

impact of large scores. The transformed variable will have a lower limit of zero.

One thing to bear in mind with this transformation is that it reverses the scores in

the sense that scores that were originally large in the data set become small

after the transformation, but the scores that were originally small become big

after the transformation.

Change the score: If transformation fails, then you can consider replacing the

score. This on the face of it may seem like cheating (you are changing the data

from what was actually collected); however, if the score you‘re changing is very

unrepresentative and biases your statistical model anyway then changing the

score is helpful. There are several options for how to change the score. The first one

is next highest value plus one. We can replace our outliers with mean plus three

times standard deviation derived from the rest of the data. A variation of this meth-

od is that we can use two instead of three time standard deviation.


BAR GRAPH

Proc gchart data = day1.candy_sales_summary; Vbar subcategory; run;

gchart is the procedure to generate bar-chart. The data set we use here is the can-

dy_sales_summary. The bar chart is generated using the keyword vbar. This presenta-

tion is used to represent the qualitative variable subcategory. This code generates a

bar graph showing the frequency of occurrence of the different subcategory.

proc gchart data = day1.candy_sales_summary; vbar3d subcategory; run;

This code generates a 3d bar graph for ‗subcategory‘. This is a better form of repre-

senting a qualitative data. ‗vbar3d‘ is the keyword for generating a three dimensional

bar graph.

proc gchart data = day1.candy_sales_summary; hbar3d subcategory; run;

This code generates a horizontal 3d bar graph. The bar graph is generated for the vari-

able ‗subcategory‘. ‗hbar3d‘ is the keyword for generating the horizontal 3d bar

graph. This form of representing the data is useful when we are representing a spatial

data.

Proc gchart data = day1.candy_sales_summary; vbar3d subcategory/sum sumvar=sale_amount; run;

This code generates a 3d vertical bar graph for the variable ‗subcategory‘. But, corre-

sponding to each vertical bar graph for the subcategory it gives the total sale amount

on top of each of the vertical bar.

proc gchart data = day1.candy_sales_summary; vbar3d subcategory/sumvar=sale_amount; run;

This code results in the same output as the code above but does not display the sum

corresponding to each bar at the top. The ‗sum‘ keyword is responsible for the display.

SAS IMPLEMENTATION


proc gchart data = day1.candy_sales_summary; vbar3d subcategory/sum sumvar=sale_amount group=fiscal_year subgroup=fiscal_quarter; run;

This code generates a sub-divided multiple bar diagram. The group generates the bar

diagram corresponding to the ‗fiscal years‘ and show the sales corresponding to each

subcategory for a given fiscal year.

goptions vsize=6in hsize=20in; This code is run according to the margins specified by the options specified using the

‗goptions‘ keyword. This is a global statement which holds throughout the rest of the

session. Every graph constructed by the software thereon would have these dimen-

sions.

proc gchart data = day1.candy_sales_summary; vbar3d subcategory/sum sumvar=sale_amount group=fiscal_year subgroup=fiscal_quarter; run;

The multiple bar diagram which is generated as a result of the above code, appears

very shabby on screen. To make them look better, we need to space them out and

this is done through the above code. We specify the margins for the vertical and hori-

zontal axis. This is a global statement, in the sense that, that any graphical representa-

tions, here onwards, would take these dimensions as given.

PIE-CHART

proc gchart data= day1.candy_sales_summary; pie3d subcategory; run;

This code generates a 3 dimensional pie-chart using the keyword ‗pie-3d‘. ‗gchart‘ is

the keyword to generate the chart. The pie-chart represents each of the

‗subcategory‘ on a pie, i.e. as a percentage of 360 degrees.

proc gchart data= day1.candy_sales_summary; pie3d subcategory/discrete value= inside; run;

This is a variation of the previous pie chart representation. This would generate a pie-

chart where the discrete value of the respective ‗subcategory‘ would be placed in

the different slices. ‗value=inside‘ keeps the frequency values in the slices along with

SAS IMPLEMENTATION


the names of the subcategory. Each of the subcategory is shown in slices of different

colors.

proc gchart data= day1.candy_sales_summary; pie3d subcategory/discrete value=inside percent=inside slice=outside; run;

This code generates the pie chart such that the name of the frequency value and the

percentage frequency value of the subcategory inside the slice and the name of the

subcategory outside the slice.

proc gchart data= day1.candy_sales_summary; pie3d subcategory/discrete value=inside percent=inside slice=outside freq=sale_amount; run;

This code for pie-chart puts out the frequency of sale corresponding to the sale sub-

category. The percentage frequency of the sale and the discrete value of the sale of

the subcategory are shown outside and the name of the variable is shown outside the

slice.

HISTOGRAM

proc univariate data=day1.candy_sales_summary; var sale_amount; histogram sale_amount; run;

This is the representation of quantitative data. The ‗univariate‘ keyword is used to gen-

erate all the key descriptive statistics related to a particular variable. Here, the variable

under consideration is ‗sale_amount‘. The code to generate histogram is ‗histogram‘. If

no dimension is mentioned then, it is by default, a two dimensional diagram.

proc univariate data=day1.candy_sales_summary noprint; var sale_amount; histogram sale_amount; class subcategory; run;

The ‗univariate‘ key-word in the code generates all the descriptive statistics associated

with the variable ‗sale_amount‘ in the data set candy_sales_summary. Another objec-

tive of the code is to construct a histogram for the same variable using the keyword

‗histogram‘. The total amount of sales is generated for each of the sub- categories,

SAS IMPLEMENTATION


which is specified using the keyword ‗class‘.

SCATTER PLOT

proc gplot data= day1.candy_sales_summary; plot sale_amount*units; run;

‗gplot‘ is the procedure to generate a plot of two quantitative variables. The scatter

plot for two variables sale_amount and units is generated using the keyword ‗plot‘. The

variable on the left-hand-side of the * represents the variable on the y-axis and the

variable on the right-hand-side is the variable on the x-axis.

NORMALITY CHECK

proc univariate data=day1.class; var height; run;

The ‗univariate‘ keyword generates all the descriptive statistics associated with the var-

iable ‗height‘ in the data set ‗Class‘. The descriptive statistics associated with a distri-

bution helps in the identification of normality of a distribution. Normality of a distribution

implies an element of symmetry associated with the distribution. In this data set the

mean, median and the mode are approximately 62. The standard deviation is pretty

‗low‘ (5) compared to the existent mean. The Skewness and Kurtosis of the data set lies

in the neighborhood of zero. A basic analysis yields the result that the variable ‗height‘

is normally distributed in the data set ‗Class‘.

proc univariate data= day1.class normal plot; var height; qqplot height/normal (mu=est sigma=est color=green); run;

The qqplot (Quantile Quantile-plot) is an alternate technique for examining whether a

variable is normally distributed or not. ‗normal plot‘ is the key-word for generating a

normal plot of variable. The keyword ‗qqplot‘ generates a plot which compares a hy-

pothetical normal line (having an estimated mean and standard deviation) and actu-

al points from the distribution. If the actual points of the distribution lie around the

green coloured normal line, then the normality of the variable holds.

proc univariate data= day1.candy_sales_summary normal plot; var sale_amount; qqplot sale_amount/normal (mu=est sigma=est color=green); run;

SAS IMPLEMENTATION


This is the same code which has been executed for a different data set: can-

dy_sales_summary. The mean of the variable, sale_amount (4951.97) is significantly dif-

ferent from its median (4040.525) and mode (0.00). Also the average fluctuation in the

data set represented by the standard deviation is very high (3986). This means that the

mean is not a ‗good‘ representative value for the data set as there is a very high fluc-

tuation in the data set. It is easy to conclude that the variable sale_amount is not nor-

mally distributed.

BOXPLOT AND THE EXISTENCE OF OUTLIERS

The quality of the measures of Central tendency and dispersion are affected adversely

in the presence of outliers. Box-plot is widely used to examine the existence of outliers

in the data set. Our reference data set is a hypothetical data set consisting of the

marks and the name of the subject. Two important facts that must be kept in mind for

box plot are:

The number of observations in the data set must be at least as large as five.

If there are more than one category in the data set must be sorted according to

the category.

proc import datafile="C:\Documents and Settings\OrangeTree\Desktop\Tanmoy\Book1.csv" out=day1.boxplot dbms=csv replace; run;

A data set containing the marks of 5 students in the subjects English and Math‘s exist in

a csv format. The file is imported into the SAS library by using ‗proc import‘ code. The

logic of this code is to import a given file in its existent format, convert it to SAS format

and replace the freshly imported file with any file that would have the same name.

proc boxplot data=day1.boxplot; plot marks*subject/ boxstyle=schematic; run;

‗boxplot‘ is the key word for generating a boxplot. The plot is done between the marks

obtained by the students and the subject. The existence of the outliers in the data set

is observed as points outside the box. The ‗boxstyle‘ is a keyword to generate a partic-

ular format of the boxplot.

SAS IMPLEMENTATION


Chapter 2

INTRODUCTION TO PROBABILITY THEORY

F uture events are far from certain in the business world. Most managers who use

probabilities are concerned with two conditions:

The case when one event or another will occur

The situation where two or more events will both occur

We are interested in the first case when we ask, ―What is the probability that today‘s

demand will exceed our inventory?‖ To illustrate the second situation, we could ask,

―What is the probability that today‘s demand will exceed our inventory and that more

than 10% of our sales force will not report for work?‖ Probability is used throughout busi-

ness to evaluate financial and decision-making risks. Every decision made by manage-

ment carries some chance for failure, so probability analysis is conducted formally

("math") and informally (i.e. "I hope").

Consider, for example, a company considering entering a new business line. If the

company needs to generate $500,000 in revenue in order to break even and their

probability distribution tells them that there is a 10 percent chance that revenues will

be less than $500,000, the company knows roughly what level of risk it is facing if it de-

cides to pursue that new business line.

Three Approaches Towards Probability

Classical Approach:

"Probability of an event"

=(Number of outcomes where the event occurs)/(Total number of possible outcomes"

" )

Relative Frequency Approach:

Suppose, we are tossing a coin. Initially the ratio of number of heads to number of trials

will remain volatile. As the number of trials increases, the ratio converges to a fixed

number (say 0.5). So probability of getting a head is 0.5; this concept has been shown

in the following chart.

1.0

0.5

20 40 60 80 100 120 140 160 180 200 220 240 260

Number of Trials

Ra

tio


Axiomatic Approach:

A) 0 ≤ P(A) ≤ 1, for all event A

B) ∑ P(A) = 1

Apart from all these, there is a concept of subjective probability. It‘s basically based

on individual‘s past experience and intuition. Most higher level social and managerial

decisions are concerned with specific, unique situations. Decision makers at this level

make considerable use of subjective probability.

Concept of Random Variable

Informally, a random variable is the value of a measurement associated with an exper-

iment, e.g. the number of heads in n tosses of a coin. More formally, a random varia-

ble is defined as follows:

A random variable over a sample space is a

function that maps every sample point (i.e.

outcome) to a real number.

The picture shown has all the outcomes

when two dice are rolled. We can define a

random variable X which is the sum of points

appeared on the two dices. Then X can as-

sume values from 2 to 12. Each of these

numbers represents a set of outcomes. Ele-

ments of such sets have same outer color,

e.g. for X =5, we have the outcomes in the

yellow boxes.

Based on the events that we have, there

can be two types of random variables: Discrete random variable and Continuous ran-

dom variable. In the previous example, we are basically talking about discrete ran-

dom variable. Again, John Brower Minnoch had a weight of 635 kg. Let‘s say this is the

upper limit of human weight. So the weight of a person lies in between 0 and 635. So

here the random variable weight is continuous.

Probability Mass Function

Probability mass function (pmf) is a function that gives the probability that a discrete

random variable is exactly equal to some value. The probability mass function is often

the primary means of defining a discrete probability distribution.

Suppose that S is the sample space of all outcomes of a single toss of a fair coin, and X

is the random variable defined on S assigning 0 to "tails" and 1 to "heads". Since the

coin is fair, the probability mass function is given by:

The probability mass function of a fair die has been show in the

chart. All the numbers on the die have an equal chance of ap-

pearing on top when the die is rolled.


Probability Density Function

Probability density function (pdf) or density of a continuous random variable, is a func-

tion that describes the relative likelihood for this random variable to take on a given

value. The probability for the random variable to fall within a particular region is given

by the integral of this variable‘s density over the region.

If f(x) is the density function then the probability that X falls within a and b is given by

If you put this concept into a chart, then it will represent the area under the probability

density function curve between a and b.

f(x)

a b x

Expectation of A Random Variable

Suppose you toss a coin 10 times and get 7 heads. ―Hmm, strange,‖ you say. You then

ask a friend to try tossing the coin 20 times; she gets 15 heads and 5 tails. So you have,

in all, 22 heads and 8 tails out of

30 tosses. What did you expect?

Was it something close to 15

heads and 15 tails (half and half)?

Now suppose you turn the tossing

over to a machine and get 792

heads and 208 tails out of 1000

tosses of the same coin. Then you

might be suspicious of the coin

because it didn‘t live up to what

you expected. To obtain the ex-

pected value of a discrete ran-

dom variable, we multiply each

value of that the random variable

can assume by the probability of

occurrence of that value and

sum these products. Again, re-

member that an expected value

of 108.02 doesn‘t imply that tomorrow exactly 108.2 patients will visit the clinic.

Number of Patients (1) Probability (2) 1 X 2 100 0.01 1.00 101 0.02 2.02 102 0.03 3.06 103 0.05 5.15 104 0.06 6.24 105 0.07 7.35 106 0.09 9.54 107 0.1 10.70 108 0.12 12.96 109 0.11 11.99 110 0.09 9.90 111 0.08 8.88 112 0.06 6.72 113 0.05 5.65 114 0.04 4.56 115 0.02 2.30 Expected Number of Patients 108.02


Probability Distributions

Probability distributions are related to frequency distributions. We can think of a proba-

bility distribution as a theoretical frequency distribution. A theoretical frequency distri-

bution is a probability distribution that describes how outcomes are expected to vary.

These distributions deal with expectations, they are useful models in making inferences

and decisions under conditions of uncertainty. A probability distribution is a listing of

the probabilities of all the possible outcomes that could result if the experiment were

done.

As the Random Variable is of two types, the Probability Distributions, hence, are of two

types, namely, discrete and continuous. The Probability Distribution for the sum of point

on two dice rolled is as follows:

Common Probability Distributions

Related to real-valued quan-

tities that grow linearly (e.g. er-

rors, offsets): Normal Distributions

Related to positive real-

valued quantities that grow ex-

ponentially (e.g. prices, incomes,

populations): Log-normal Distri-

bution, Pareto Distribution

Related to real-valued quan-

tities that are assumed to be uni-

formly distributed over a

(possibly unknown) region: Uni-

form Distribution

Related to Bernoulli trials

(yes/no events, with a given probability): Bernoulli Distribution, Binomial Distribution

Related to events in a Poisson process (events that occur independently with a giv-

en rate): Poisson Distribution, Exponential Distribution

Binomial Distribution

The binomial distribution describes discrete data resulting from an experiment known

as Bernoulli process. The tossing of a fair coin a fixed number of times is a Bernoulli pro-

cess and the outcomes of such tosses can be represented by the binomial probability

distribution. The success or failure of interviewees on an aptitude test may also be de-

scribed by a Bernoulli process. On the other hand, the frequency distribution of the

lives of fluorescent lights in a factory would be measured on a continuous scale of

hours and would not qualify as a binomial distribution. The probability mass function,

the mean and the variance are as follows:


Characteristics of a Binomial Distribution

There can be only two possible outcomes: heads or tails, yes or no, success or fail-

ure

Each Bernoulli process has its

own characteristic probabil-

ity. Take the situation in which

historically seven – tenths of

all people who applied for a

certain type of job passed

the job test. We would say

that the characteristic proba-

bility here is 0.7, but we could

describe our testing results as

Bernoulli only if we felt certain

that the proportion of those

passing the test (0.07) re-

mained constant over time.

At the same time, outcome of one test must not affect the outcome of the other

tests.

Poisson Distribution

The Poisson distribution is used to describe a number of processes, including the distri-

bution of telephone calls going through a switchboard system, the demand of pa-

tients for service at a health institution, the arrivals of trucks and cars at a tollbooth,

and the number of accidents at an intersection. These examples all have a common

element: They can be described by a discrete random variable that takes on integer

values (0, 1, 2, 3, 4, and so on). The number of patients who arrive at a physician‘s of-

fice in an given interval of time will be 0, 1, 2, 3, 4, 5, or some other whole number. Simi-

larly, if you count the number of cars arriving at a tollbooth on an highway during

some 10 minutes period, the number will be 0, 1, 2, 3, 4, 5, and so on. The probability

mass function, the mean and the variance are as follows:

Characteristics of a Poisson Distribution

If we consider the example of number of cars, then the average number of vehi-

cles that arrive per rush hour can be estimated from the past traffic data.

If we divide the rush hour into intervals of one second each, we will find the follow-

ing statements to be true :

The probability that exactly one vehicle will arrive at the single booth per second is

a very small number and is constant for every one second interval.

The probability that two or more vehicles will arrive within one second interval is so


small that we can assign it a zero value.

The number of vehicles that arrive in a given one second interval is independent of

the time at which that one second interval occurs during the rush hour.

The number of arrivals in any one second interval is not dependent on the number

of arrivals in any other one second interval.

Normal Distribution

The normal distribution has applications in many areas of business administration. For

example:

Modern portfolio theory commonly assumes that the returns of a diversified asset

portfolio follow a normal distribution.

In operations management, process variations often are normally distributed.

In human resource management, employee performance sometimes is consid-

ered to be normally distributed.

The probability density function, mean, and variance are given by

Is The Distribution Normal?

The following conditions should be satisfied

by the distribution in order to be a normal dis-

tribution:

The mean, median and mode should be

almost equal

The standard deviation should be low

Skewness and kurtosis should be close to

zero

Median should lie exactly in between the

upper and lower quartile

Normal Probability Plot

The normal probability plot is a graphical technique for

normality testing: assessing whether or not a data set is

approximately normally distributed. Here we are basical-

ly comparing the observed cumulative probability with

the theoretical cumulative probability. If the observed

data are really from the normal distribution, then we

should get a straight line as shown in the chart.


Q - Q Plot

The points in this graph are obtained through inverting the

cumulative distribution function. Here we are comparing the

points of observed distribution to the theoretical distribution

for the same probability level. Here again, if the data are

from the theoretical distribution, the plot will be a straight line.

Standard Normal Distribution

It‘s is a normal distribution with

For a normal distribution,

68.2% of the data lies within

the (mean - standard devi-

ation, mean + standard de-

viation) range.


BINOMIAL DISTRIBUTION

data binom; binom_prob=pdf ('binomial', 50, 0.6, 100); run;

This code defines the probability of getting fifty successes in 100 trials of a binomial ex-

periment where the probability of getting success in a single trial is 0.6. This code cre-

ates a new data set by the name ‗binom‘ in the work library. The data set can also be

opened in the permanent library as well by assigning a library name. binom_prob is the

variable that stores the probability associated with the above-mentioned outcome.

‗Pdf‘ stands for probability density function. It generates the probability associated

with the given outcome, given the parameters of the distribution. ‗Pdf‘ is the general

command for calculating the probabilities associated with various points of a distribu-

tion (be it discrete or continuous) since SAS does not identify ‗pmf‘ (Probability Mass

Function).

data binom_plot; do x=0 to 20; binom_prob=pdf ('binomial', x, 0.5, 20); output; end; run;

This code generates a schedule of probabilities associated with the various outcomes

or success in the binomial distribution. A loop is created for generating the schedule.

The number of success in the experiment is kept variable and within the loop. The pa-

rameters to the distribution, namely the number of trials (20) and the probabilities of

success (0.5) are specified. The loop is terminated using the ‗end‘ keyword. The key

word ‗output‘ is used to print the output at each iteration. proc gplot data=binom_plot; plot binom_prob*x; run;

This command directly plots the binomial probability distribution with probabilities on

the vertical axis and the number of successes on the horizontal axis.

data binom_plot; do x=0 to 20; binom_prob=pdf ('binomial', x, 0.3, 20); output; end; run;

This is the command to generate the binomial probability distribution for 0 to 20 trials

with a much lower probability of success. Examining the nature of the distribution over

SAS IMPLEMENTATION


changing values of the probabilities of success gives us a fair idea of the Skewness of

the distribution. If the probability of obtaining a success in a particular trial is ‗low‘ then

the chances of getting very high successes is ‗low‘ and that of getting ‗low‘ successes

is very high.The distribution, given the specification of the parameters is a negatively

skewed distribution.

proc gplot data=binom_plot; plot binom_prob*x; run;

The command plots the binomial probability distribution for the newly specified param-

eters with the values of the probability on the vertical axis and the number of success

on the horizontal axis. The graphical representation displays the varying nature of

Skewness in the distribution very distinctly.

POISSON DISTRIBUTION

data day1.poisson; pois_prob=pdf ('Poisson', 12, 10); run;

This is a data step where a data set by the name ‗Poisson‘ is created in the permanent

library day1. The syntax defined by the function ‗pdf‘ is as follows:

New variable = pdf (name of the distribution, value of x, value of n). This code calcu-

lates the probability of obtaining a particular number of successes in the Poisson ex-

periment where the parameter to the experiment is 10 and the number of trials is 12.

data day1.pois_plot; do x=0to 25; pois_prob=pdf ('Poisson', x, 10); output; end; run;

This command plots the poisson probability distribution with probabilities on the vertical

axis and number of successes on the horizontal axis. The output keyword is to print the

output of each iteration.

proc gplot data=day1.pois_plot; plot pois_prob*x; run;

This command directly plots the poisson probability distribution with probabilities on the

vertical axis and the number of successes on the horizontal axis. Following set of codes

SAS IMPLEMENTATION


are used for analyzing the Skewness associated with the poisson distribution:

data day1.pois_plot; do x= 0 to 25; pois_prob=pdf ('poisson', x, 10.5); output; end; run; proc gplot data=day1.pois_plot; plot pois_prob*x; run;

The last couple of codes can be used to analyse the nature of the Skewness of the

poisson distribution. The Skewness can be analyzed by changing the parameters to

the distribution.

NORMAL DISTRIBUTION

data day1.normal; do x=-12 to 18 by 0.05; normal_prob=pdf ('normal', x, 3, 8); output; end; run;

This is a data step which creates a new data set ‗normal‘ in the user-defined library

day1. The command generates the normal probability distribution. The values of the

respective probability densities are stored in the variable ‗normal_prob‘. The syntax of

this function is: Name of the variable = pdf (distribution name, number of trials, mean,

variance). The mean and variance must be specified for a proper characterization of

the normal distribution. The schedule of probabilities corresponding to the different val-

ues of x is generated using the ‗do‘ loop. Since, normal distribution is a continuous dis-

tribution it assumes continuous values. By default, at every successive step in the loop

function the value is increased at a step of 1, which makes it a discrete loop. To make

it continuous we increase the trials at a step of 0.05. The result of each iteration is dis-

played using the ‗output‘ keyword. proc gplot data=day1.normal; run;

This command directly plots the normal probability distribution with probabilities on the

vertical axis and the number of trials on the horizontal axis. The graph obtained from

the data ‗normal‘ is symmetric in nature.

SAS IMPLEMENTATION


proc univariate data=day1.class normal plot; var height; histogram height/normal (mu=est sigma=est color=green); run;

The ‗proc univariate‘ is the procedure for listing out all the descriptive statistics associ-

ated with the variable ‗height‘ which is our analysis variable. The keyword ‗histogram‘

is used for generating a histogram over which a normal curve is super-imposed. The

normal curve here is a green coloured curve, specified by the estimated mean and

the estimated standard deviation. Super imposition of the normal curve over the histo-

gram gives us an idea whether the variable is normally distributed. If the ‗normal

curve‘ fits on nicely to the histogram then we say that the variable is ‗normally distribut-

ed‘. The variable ‗height‘ in the data set ‗class‘ has a normal plot. The normality of the

variable can be clearly observed in the diagram below:

proc univariate data=day1.candy_sales_summary normal plot; var sale_amount; histogram sale_amount/normal (mu=est sigma=est color=green); run;

This is the same code as above which has been used on a separate variable on a dif-

ferent data set. The variable ‗sale_amount‘ is not normally distributed and the normal

curve does not fit symmetrically on the histogram.

SAS IMPLEMENTATION


Chapter 3

SAMPLING THEORY AND ESTIMATION

S ampling is concerned with the selection of a subset of individuals from within a

population to estimate characteristics of the whole population. Researchers

rarely survey the entire population because the cost of a census is too high. The

three main advantages of sampling are that the cost is lower, data collection is

faster, and since the data set is smaller it is possible to ensure homogeneity and to im-

prove the accuracy and quality of the data.

Concept of Population

Sampling is concerned with the selection of a subset of individuals from within a popu-

lation to estimate characteristics of the whole popu-

lation. Researchers rarely survey the entire popula-

tion because the cost of a census is too high. The

three main advantages of sampling are that the

cost is lower, data collection is faster, and since the

data set is smaller it is possible to ensure homogene-

ity and to improve the accuracy and quality of the

data.

Techniques of Sampling

There are two broader techniques of sampling:

Probability Sampling or Random Sampling and Non-

probability sampling, among which only Random

Sampling can be used for statistical investigation.

Probability Sampling or Random Sampling

Probability sampling, or random sampling, is a sampling technique in which the proba-

bility of getting any particular sample may be calculated. Examples of random sam-

pling include: Simple Random Sampling

Without Replacement: One deliberately avoids choosing any member of the pop-

ulation more than once.

With Replacement: One member can be chosen more than once. Systematic Sampling

Systematic sampling relies on arranging the target population according to some

ordering scheme and then selecting elements at regular intervals through that or-


dered list. Suppose you are talking data from every 10th person entering into a

mall. Stratified Sampling

Where the population embraces a number of distinct categories or "strata‖, each

stratum is then sampled as an independent sub-population, out of which individual

elements can be randomly selected.

Where the population embraces a number of distinct categories or "strata‖, each stra-

tum is then sampled as an independent sub-population, out of which individual ele-

ments can be randomly selected.

male, full-time: 90

male, part-time: 18

female, full-time: 9

female, part-time: 63

Total: 180

and we are asked to take a sample of 40 staff, stratified according to the above cate-

gories.

The first step is to find the total number of staff (180) and calculate the percentage in

each group.

% male, full-time = 90 / 180 = 50%

% male, part-time = 18 / 180 = 10%

% female, full-time = 9 / 180 = 5%

% female, part-time = 63 / 180 = 35%

This tells us that of our sample of 40,

50% should be male, full-time.

10% should be male, part-time.

5% should be female, full-time.

35% should be female, part-time.

50% of 40 is 20.

10% of 40 is 4.

5% of 40 is 2.

35% of 40 is 14.

Another easy way without having to calculate the percentage is to multiply each

group size by the sample size and divide by the total population size (size of entire

staff):

male, full-time = 90 x (40 / 180) = 20

male, part-time = 18 x (40 / 180) = 4

female, full-time = 9 x (40 / 180) = 2

female, part-time = 63 x (40 / 180) = 14

Non-Probability Sampling

In non – probability sampling, we cannot assign any probability to the selected sam-

ple. Nonprobability sampling techniques cannot be used to infer from the sample to

the general population.

Examples of nonprobability sampling include:

Convenience, Haphazard or Accidental sampling - members of the population are


chosen based on their relative ease of access. To sample friends, co-workers, or

shoppers at a single mall, are all examples of convenience sampling.

Judgmental sampling or Purposive sampling - The researcher chooses the sample

based on who they think would be appropriate for the study. This is used primarily

when there is a limited number of people that have expertise in the area being re-

searched.

Sampling Bias

In statistics, sampling bias is when a sample is collect-

ed in such a way that some members of the intend-

ed population are less likely to be included than oth-

ers. It results in a biased sample, a non-random sam-

ple of a population (or non-human factors) in which

all individuals, or instances, were not equally likely to

have been selected. If this is not accounted for, re-

sults can be erroneously attributed to the phenome-

non under study rather than to the method of sam-

pling.

Sampling Distribution

The sampling distribution of a statistic is the distribution of

that statistic, considered as a random variable, when de-

rived from a random sample of size n. It may be considered

as the distribution of the statistic for all possible samples

from the same population of a given size.

Population Parameters and The Estimation Theory

A statistical parameter is a parameter that indexes a family

of probability distributions. It can be regarded as a numeri-

cal characteristic of a population or a model. For example, the family of normal distri-

butions has two parameters, the mean μ and the variance σ^2: if these are specified,

the distribution is known exactly. The family of Poisson distributions, on the other hand,

has only one parameter, the mean λ.

In statistics, our purpose is to learn about the population by studying the samples. Esti-

mation refers to the process by which one makes inferences about a population,

based on information obtained from a sample. Statisticians use sample statistics to esti-

mate population parameters. For example, sample means are used to estimate popu-

lation means. So, sample mean is an estimator here and the value of the mean is the

estimate A statistical parameter is a parameter that indexes a family of probability dis-

tributions. It can be regarded as a numerical characteristic of a population or a mod-

el. For example, the family of normal distributions has two parameters, the mean μ and

the variance σ^2: if these are specified, the distribution is known exactly. The family of

Poisson distributions, on the other hand, has only one parameter, the mean λ.

In statistics, our purpose is to learn about the population by studying the samples. Esti-


mation refers to the process by which one makes inferences about a population,

based on information obtained from a sample. Statisticians use sample statistics to esti-

mate population parameters. For example, sample means are used to estimate popu-

lation means. So, sample mean is an estimator here and the value of the mean is the

estimate. So as a parameters is to the population, a statistic is to a sample.

Types of Estimator

There are two types of estimator: Point Estimator and

Interval Estimator.

The point estimators yield single-valued results, whereas

an interval estimators results in a range of plausible val-

ues.

Properties of Estimator

Unbiased: The estimator is an unbiased estimator of if and only if the expectation

of the estimator is equal to the population parameter.

Consistency: An estimator is called consistent if increasing the sample size increases

the probability of the estimator being close to the population parameter.

Efficiency: Among unbiased estimators, there often exists one with the lowest vari-

ance, called the minimum variance unbiased estimator (MVUE) or an efficient esti-

mator.

Sufficiency: An estimator is called sufficient if no other statistic which can be calcu-

lated from the same sample provides any additional information as to the value of

the parameter.

Testing of Statistical Hypothesis

Statistical hypotheses are statements about real relationships; and like all hypotheses,

statistical hypotheses may match the reality, or they may fail to do so. Statistical hy-

potheses have the special characteristic that one ordinarily attempts to test them (i.e.,

to reach a decision about whether or not one believes the statement is correct, in the

sense of corresponding to the reality) by observing facts relevant to the hypothesis in a

sample. This procedure, of course, introduces the difficulty that the sample may or

may not represent well the population from which it was drawn.

Types of Hypotheses

Null Hypothesis (H0): Hypothesis testing works by collecting data and measuring how

likely the particular set of data is, assuming the null hypothesis is true. If the data-set is

very unlikely, defined as being part of a class of sets of data that only rarely will be ob-

served, the experimenter rejects the null hypothesis concluding it (probably) is false.

The null hypothesis can never be proven, only thing we can do is to reject it or not re-

ject it.

Alternative Hypothesis (H1 or HA): The alternative hypothesis (or maintained hypothesis


or research hypothesis) and the null hypothesis are the two rival hypotheses which are

compared by a statistical hypothesis test. An example might be where water quality in

a stream has been observed over many years and a test is made of the null hypothesis

that there is no change in quality between the first and second halves of the data

against the alternative hypothesis that the quality is poorer in the second half of the

record.

Examples of Statistical Hypotheses

The mean age of all Calcutta University students is 23.4 years.

The proportion of Calcutta University students who are women is 50 percent.

The heights of all the male students of Calcutta University are normally distributed.

Types of Errors in Testing of Hypothesis

There are two types of error as follows:

Type I Error: A type I error, also known as an error of the first kind, occurs when the null

hypothesis (H0) is true, but is rejected. It is asserting

something that is absent, a false hit. In terms of

folk tales, an investigator may be "crying wolf"

without a wolf in sight (raising a false alarm) (H0:

no wolf).

Type II Error: A type II error, also known as an error

of the second kind, occurs when the null hypoth-

esis is false, but it is erroneously accepted as true.

It is missing to see what is present, a miss. A type II

error may be compared with a so-called false

negative (where an actual 'hit' was disregarded

by the test and seen as a 'miss') in a test checking

for a single condition with a definitive result of true or false. A Type II error is committed

when we fail to believe a truth.

Consequences of Type I and Type II Errors

Both types of errors are problems for individuals, corporations, and data analysis.

Based on the real-life consequences of an er-

ror, one type may be more serious than the

other. For example, NASA engineers would

prefer to throw out an electronic circuit that is

really fine (null hypothesis H0: not broken; reali-

ty: not broken; action: thrown out; error: type I,

false positive) than to use one on a space-

craft that is actually broken (null hypothesis H0:

not broken; reality: broken; action: use it; error: type II, false negative). In that situation

a type I error raises the budget, but a type II error would risk the entire mission.


Level of Significance

Statistical significance is a statistical assessment of whether observations reflect a pat-

tern rather than just chance, the fundamental challenge being that any partial picture

is subject to observational error. In statistical testing, a result is deemed statistically sig-

nificant if it is unlikely to have occurred by chance, and hence provides enough evi-

dence to reject the hypothesis of 'no effect'. As used in statistics, significant does not

mean important or meaningful, as it does in everyday speech.

The significance level is usually denoted by the Greek symbol α. Popular levels of signif-

icance are 10% (0.1), 5% (0.05), 1% (0.01), 0.5% (0.005), and 0.1% (0.001). If a test of sig-

nificance gives a p-value lower than the significance level α, the null hypothesis is re-

jected.

Confidence Interval

In statistics, a confidence interval (CI) is a kind of interval estimate of a population pa-

rameter and is used to indi-

cate the reliability of an es-

timate. Confidence inter-

vals consist of a range of

values (interval) that act as

good estimates of the un-

known population parame-

ter. However, in rare cases,

none of these values may

cover the value of the pa-

rameter. The level of confi-

dence of the confidence

interval would indicate the

probability that the confi-

dence range captures this true population parameter given a distribution of sam-

ples. If a corresponding hypothesis test is performed, the confidence level corre-

sponds with the level of significance, i.e. a 95% confidence interval reflects an signifi-

cance level of 0.05, and the confidence interval contains the parameter values that,

when tested, should not be rejected with the same sample. In statistics, a confidence

interval (CI) is a kind of interval estimate of a population parameter and is used to indi-

cate the reliability of an estimate. Confidence intervals consist of a range of values

(interval) that act as good estimates of the unknown population parameter. However,

in rare cases, none of these values may cover the value of the parameter. The level of

confidence of the confidence interval would indicate the probability that the confi-

dence range captures this true population parameter given a distribution of sam-

ples. If a corresponding hypothesis test is performed, the confidence level corre-

sponds with the level of significance, i.e. a 95% confidence interval reflects an signifi-

cance level of 0.05, and the confidence interval contains the parameter values that,

when tested, should not be rejected with the same sample.


SIMPLE RANDOM SAMPLING WITHOUT REPLACEMENT

proc surveyselect data=day1.employee_satisfaction out=day1.emp1 method=srs n=50; run;

Surveyselect is the procedure for executing a sampling procedure. The data set that

we consider here is ‗employee_satisfaction‘. The method of sampling specified here is

‗simple random sampling without replacement‘ (SRS). We have pre-specified the sam-

ple size to be 50. This is a proc step which generates a report. Some important con-

cepts generated in the report are:

Random Number Seed: A integer used to set the starting point for generating a se-

ries of random numbers. The seed sets the generator to a random starting point. A

unique seed returns a unique random number sequence. Given the seed a series

of random numbers is generated. If no random number seed is specified, then the

numerical value of the system time is used for generating the subsequent random

numbers.

Selection Probability: This shows the probability of selecting a sample of ‗n‘ obser-

vations from a total of ‗N‘ observations (N > n). Each of the observations are equal-

ly likely of being drawn from the population and a sample observation once drawn

from a population is not returned back.

Sampling Weight: A sampling weight is a statistical correction factor that compen-

sates for a sample design that tends to over- or under-represent various segments

within a population. In some samples, small subsets of the population, such as reli-

gious, ethnic, or racial minorities, may be oversampled in order to have enough

cases to analyze. When these subsamples are combined with the larger sample,

their disproportionately large numbers must be diluted by a sampling weight. This is

just the reciprocal of the selection probability of a sample.

SIMPLE RANDOM SAMPLING WITH REPLACEMENT

proc surveyselect data=day1.employee_satisfaction out=day1.emp2 method=urs n=50; run;

This code describes an alternate technique of sampling. The method ‗urs‘ or unrestrict-

ed sampling refers to the type of random sampling where the sample points are re-

turned to the population once the observations are recorded. This process of sampling

is also called the simple random sampling with replacement. In the final data set that

we get, there might not be 50 unique observations, since repetition may occur in the

selection of the sample observations. In this form of sampling, the report generated

contains, in addition to the concepts introduced in srs, another concept called the ex-

pected number of hits.

SAS IMPLEMENTATION


The concept of the expected number of hits is synonymous to the concept of selec-

tion probability in the simple random sampling with replacement. This measure repre-

sents the average number of times a particular observation is selected in the process

of random sampling without replacement. The sampling weight, in this context, is the

reciprocal of the expected number of hits made in the procedure.

STRATIFIED RANDOM SAMPLING

STEP 1: SORTING THE DATA SET ACCORDING TO THE SUB_CATEGORY proc sort data=day1.candy_sales_summary out=day1.candy_sort; by subcategory; run;

The command sorts the data set according to the variable ‗subcategory‘. The sorting

of the data set is important because it divides the data set according to the available

strata. The variable ‗subcategory‘ act as the strata in the given data set.

STEP 2: SAMPLING USING THE STRATIFICATION TECHNIQUE proc surveyselect data=day1.candy_sort n= (5 7 15 10 12 8) method=seq out=day1.candy_seq; strata subcategory; run;

The method of sampling applied for each stratum is the sequential random sampling

technique. The observations to be chosen from each stratum are specified using ‗n‘.

SYSTEMATIC OR ORDERED SAMPLING

The sample, in this technique, is drawn from the population, based on a particular or-

der. For example: If a departmental store wants to know about the level of customer

satisfaction then he needs to survey the customers. If in a day the mall expects a foot

fall of 1000 customers and the number of sample size he requires is 100, then the mall

can question every 10th person walking in through the door.

proc surveyselect data= day1.candy_sort out=day1.candy_seq n=30 method=sys; run;

This command ‗method=sys‘ is used to execute the systematic sampling process. The

systematic number of observations that is to be sampled is calculated using: K = N/n,

where n = size of the sample, N = size of the population. So, for getting a sample size of

30, every 50th observation should be surveyed.

SAS IMPLEMENTATION


Chapter 4

IMPORTANT TESTS OF STATISTICAL SIGNIFICANCE (PART I)

Concept of Parametric Data

A parametric test is one that requires data from one of the large catalogue of

distributions that statisticians have described and for data to be parametric

certain assumptions must be true. If you use a parametric test when your da-

ta is not parametric then the results are likely to be inaccurate. Therefore, it is

very important that we check the assumptions before deciding which statistical test is

appropriate.

Assumptions of Parametric Test

Normally Distributed Data: It is assumed that the data are from one or more nor-

mally distributed populations. The rationale behind hypothesis testing relies on nor-

mally distributed populations and so if this assumption is not met then the logic be-

hind hypothesis testing is flawed. Most researchers eyeball their sample data using

a histogram and if the sample data look roughly normal, then the researchers as-

sume that the populations are also.

Homogeneity of Variance: The assumption means that the variance should be the

same throughout the data. In designs in which you test several groups of partici-

pants, this assumption means that each of these samples comes from populations

with the same variance.

Interval Data: Data should be measured at least at the interval level. This means

that the distance between points of your scale should be equal at all parts along

the scale. For example, if you had a 10 point anxiety scale, then the difference in

anxiety represented by a change in score from 2 to 3 should be the same as that

represented by a change in score from 9 to 10.

Independence: This assumption is that data from different participants are inde-

pendent, which means that the behavior of one participant does not influence the

behavior of another.

The assumptions of interval data and independent measurement are tested only by

common sense. The assumption of homogeneity of variance is tested in different ways

for different procedures .

Z Test

A Z-test is any statistical test for which the distribution of the test statistic under the null

hypothesis can be approximated by a normal distribution.


Suppose that in a particular geographic region,

the mean and standard deviation of scores on a

reading test are 100 points, and 12 points, re-

spectively. Our interest is in the scores of 55 stu-

dents in a particular school who received a

mean score of 96. We can ask whether this

mean score is significantly lower than the region-

al mean — that is, are the students in this school

comparable to a simple random sample of 55

students from the region as a whole, or are their

scores surprisingly low?

Assumptions

The parent population from which the sample is drawn should be normal

The sample observations are independent, i.e., the given sample is random

The population standard deviation σ is known

T Test

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t dis-

tribution if the null hypothesis is supported. Among the most frequently used t-tests are:

A one-sample location test of whether the mean of a normally distributed popula-

tion has a value specified in a null hypothesis.

A two sample location test of the null hypothesis that the means of two normally

distributed populations are equal.

A test of the null hypothesis that the difference between two responses measured

on the same statistical unit has a mean value of zero.

A test of whether the slope of a regression line differs significantly from zero.

Assumptions

Most t statistics have the form t= Z∕s.

Z follows a standard normal distribution under the null hypothesis or the parent pop-

ulation from which the sample is drawn should be normal

The sample observations are independent, i.e., the given sample is random

The population standard deviation σ is unknown

Two Independent Samples T Test

Consider you have Conducted a survey that studied the Commitment to Change in

your Organization. Now you require to find out if there are any differences in the Com-

mitment to Change between Male and Female Staff Members or for instance a re-

searcher wants to find out if Middle level employees are more satisfied than top level

employees. In this case the researchers needs the Satisfaction Scores for Middle and

Top Management. Here again we can see that One Variable (Satisfaction) is divided

into two groups (Middle and Top Level). So in summary when we need to compare


two groups for one numeric variable with each other

we would use Two Independent Samples T Test, here

the two samples are drawn from one single variable.

An Assumption for Two Independent Samples T test is

that the Data is Normally Distributed .

Forming The Research Hypotheses

Now for Instance we have conducted a Survey that

studied the salary of Respondent, Now we want to

Check if there are any difference in the salary of Male

and Female Employee in the Business Organization.

Example of research question: Are there any differences in the earning of males and

females employees?

What you need: One categorical independent variable with only two groups (e.g. sex:

males/ females). One continuous dependent variable (e.g. Salary).

Hypotheses of Two Independent Samples t Test:

H0: The two population means are equal, i.e. there is no difference in earnings

H1: The two population means are not equal, i.e. there is difference in earnings

Paired Sample T Test

A company markets an eight week long weight loss program and claims that at the

end of the program on average a participant will have lost

5 pounds. On the other hand, you have studied the pro-

gram and you believe that their program is scientifically un-

sound and shouldn't work at all. You want to test the hy-

pothesis that the weight loss program does not help people

lose weight. Your plan is to get a random sample of people

and put them on the program. You will measure their

weight at the beginning of the program and then measure

their weight again at the end of the program. Based on

some previous research, you believe that the standard de-

viation of the weight difference over eight weeks will be 5

pounds.

Assumptions

The assumptions underlying the paired samples t-test are similar to the one-sample t-

test but refer to the set of difference scores.

The observations are independent of each other

The dependent variable is measured on an interval scale

The differences are normally distributed in the population

Hypotheses of Paired Sample t Test:

H0: The two population means are equal

H1: The two population means are not equal

In summary, a paired sample t test tries to assesses whether an action is effective or

not.


A SINGLE VARIABLE T-TEST

The case study on a single variable t-test pertains to a leading hospital in the city. The

baseline blood pressures for 60 patients belonging to different age groups were rec-

orded. The data set contains three variables namely: the subject (id variable), Age

(numeric variable) and Baseline bp (numeric variable).

The objective of the case study is to check whether there has been a statistically signif-

icant change in the average blood pressure over a span of 45 days. We use the t-test

in this case. However, before using the test we need to test for the assumption of nor-

mality.

STEP 1: CHECK FOR NORMALITY proc univariate data=day1.bp normal plot; var baselinebp; qqplot baselinebp/normal (mu=est sigma=est color=pink); run;

The univariate procedure generates all the vital descriptive statistics associated with

the variable ‗baseline bp‘. The qq-plot of ‗baseline-bp‘ shows that observations of the

variable lie very close to the hypothetical pink-coloured normal line. Therefore, base-

linebp is normally distributed.

STEP 2: TESTING THE SIGNIFICANCE OF THE HYPOTHESIS proc ttest data=day1.bp h0=96 alpha=0.05; var baselinebp; run;

The procedure ‗ttest‘ is used to run the student‘s t-test. The null-hypothesis (h0) is speci-

fied to be equal to 96. This implies that any differences observed in the readings of the

average blood pressure are caused due to sampling fluctuations in the data set. The

keyword ‗alpha‘ is used to denote the level of significance which shows the probability

of committing a Type I error.

The t-test generates the following tables:

Statistics: This table generates the vital statistics associated with the variable base-

linebp. It displays the sample mean, variance and standard error associated with

the sample.

T-test: This table reports the results associated with the t-test. The most important

component of this table is the p-value which is shown at the end of the table. The p

-value shows a value of 0.2688 which is much higher than the level of significance.

Therefore, an analyst knows that he runs a very high chance of committing a Type-I

error if he rejects the Null-Hypothesis. Thus, it is in the interest of the analyst to ac-

cept the Ho. So, in this situation it can be inferred that minor fluctuations observed

in the mean blood pressure are due to sampling fluctuations.

SAS IMPLEMENTATION


TWO INDEPENDENT SAMPLE T-TEST

The two-independent sample t-test is useful for examining significant differences in the

mean of two data sets. The present case study considers two renowned pizza compa-

nies: ABC and XYZ. The manager of the XYZ company is apprehensive of the falling

sales compared to its competitor ABC. The absolute delivery time for Pizza company

ABC is less than XYZ, but this would be considered a crucial factor in explaining the de-

clining sales of XYZ if the differences in the mean delivery time of company ABC are

significantly less than the mean delivery time of XYZ.

STEP 1: IMPORTING THE REQUIRED FILE

The file containing the required information does not initially exist in the SAS data base.

The original file is in a csv format and so, we first import the data set using the ‗import‘

file. This imports the dataset to the SAS database and renames it ‗twoind_sample‘. proc import datafile="C:\Documents and Settings\OrangeTree\Desktop\Analytics data sets and case studies\twoindsample.csv" out=day1.twoind_sample dbms=csv replace; run; STEP 2: RUNNING THE T-TEST proc ttest data=day1.twoind_sample; class company; var waiting_time__in_minutes_; run;

The t-test is executed using the procedure ‗ttest‘. Since this is a t-test to check the dif-

ference of mean between two groups, we introduce the ‗class‘ keyword to identify

the two pizza companies. The variable in terms of which the t-test is to be carried out is

the variable ‗waiting_time__in_minutes_‘. Three important tables are generated once

the code is executed:

STATISTICS: This table describes the vital statistics associated with the two pizza com-

panies. This table gives us a clear idea that the delivery time of the pizza company

ABC is distinctly less than the delivery time of the company XYZ. How can we say

so? This can be said so from the confidence intervals within which the sample

means of the two companies lie.

EQUALITY OF VARIANCES: To compare the means of two different sets it is neces-

sary to check that the variances of the two set. The population variances of the

two data sets must be identical in nature. This implies that the mean-difference test

is executed under the assumption that the variance remains constant across the

two data sets. The equality of variances is tested using the Folded F-test. This is de-

fined as: F = max (s12,s2

2)/min(s12,s2

2) where s12 and s2

2 are variances of category 1

and category 2.

SAS IMPLEMENTATION


The hypothesis tested is:

H0: The population variances are equal

V/s

HA: The population variances are unequal

The decision rule used is the p-value rule whereby the null hypothesis is accepted if

the exact probability of committing the type I error exceeds the benchmark proba-

bility as prescribed by the level of significance. Here, the p-value associated with

the folded F-statistic is 0.38. This is much greater than the level of significance.

Hence, the chance of committing a type I error is much higher in this model and

we do not take the risk of committing the error and accept the null hypothesis.

Therefore, it is safe to conclude that the population variances of the two pizza

companies are not identically different.

T-TESTS: This table displays the results of the t-test corresponding to the difference in

the mean delivery time of pizzas. The results are displayed under two sub-headings:

Pooled Variance and Unequal variance. We consider the results corresponding to

the Pooled variance for the t-test analysis. The p-value corresponding to the t-

statistic is 0.0003 which is less than the prescribed level of significance. Therefore, it

is easy to conclude that the difference in the mean delivery time of the pizza com-

panies ABC and XYZ are significantly different from one another.

PAIRED SAMPLE T-TEST

To analyze the impact of e-learning on the students, the Ministry of the Human Re-

source Development of the Government of India performed an exploratory study on

the a sample of 50 students. The students were first taught in the traditional method of

teaching and then through the method of e-learning without the presence of any

teachers. The marks were recorded for the students before the e-learning and after

the e-learning. The marks were then compared to analyze the impact of the e-

learning on the performance of the students.

STEP1: IMPORTING THE DATA FILE

The first step in this part is to import the required datafile using the proc import key-

word. The original file is in the csv format. proc import datafile="C:\Documents and Settings\OrangeTree\Desktop\Analytics data sets and case studies\pairedsample.csv" out=day1.pairedsample dbms=csv replace; run;

STEP2: RUNNING THE PAIRED SAMPLE T-TEST proc ttest data=day1.pairedsample;

SAS IMPLEMENTATION


paired before*after; run;

The keyword ‗paired‘ is used to execute the paired t-test between the marks ‗before‘

and marks ‗after‘. The hypothesis set up is:

H0: The ex-post and ex-ante means are not significantly different

v/s

HA: The ex-post and ex-ante means are significantly different

The results for this t-test are displayed through the following tables:

STATISTICS: The statistic table shows that the mean marks of students have in-

creased after incorporating the e-learning process. The question that arises from

the table above is: Is the rise in the mean marks post the e-learning a significant

rise? To test the significance of the change we use the t-test table.

T-TEST: The t-test table details the significance of the difference of the paired

means. The p-value rule is used for deciding whether the null hypothesis is should

be accepted or not. The p-value generated (0.4539) within the model is greater

than the level of significance. This means that the differences in the means are not

statistically significant.

Therefore, the analysis shows that the mean of the performance of the students post

the e-learning process did not change significantly. Hence, e-learning employed by

the ministry of education did not prove to be effective as a strategy.

SAS IMPLEMENTATION


Chapter 5

UNDERSTANDING THE ASSOCIATION BETWEEN THE VARIABLES

Chi Square Test for Independence of Attributes

C onsider the following questions:

Is their any association between income level and brand preference?

Is their any association between family size and size of washing machine

bought?

Are the attributes educational background and type of job chosen independent?

The solutions to the above questions need the help of Chi-Square test of independ-

ence in a contingency table. Please note that the variables involved in Chi-Square

analysis are nominally scaled. Nominal data are also known by two names - categori-

cal data and attribute data.

Contingency Table: Is

there any relation be-

tween age and invest-

ment?

Assumptions

The data should be

categorical variables

Total frequency should

be reasonably large,

say greater than 50

The observations of the sample are independent, i.e., the samples are random

The theoretical frequency of any category or class should not be less than 5

Hypotheses of the test are

H0: There is no association between the variables

H1: There is an association between the variables

Calculation of Chi Square Statistic

Investment

Stock Bond Cash Total

Age

25 - 34 30 10 1 41 35 - 44 35 25 2 62 45 - 54 38 35 4 77 55 - 70 22 30 4 56 Total 125 100 11 236


Calculation of Theoretical Frequency

Remember, Chi square test of independence only checks whether there is any associ-

ation between the attributes, but it does not tell what is the nature of the association.

Correlation Analysis

The simplest way to look at whether two variables are associated is to look at whether

they covary. To understand what covariance is, we first need to think back to the con-

cept of variance.

Variance = ∑ (xi - mx)2 / (N – 1) = ∑ (xi - mx) (xi - mx)/ (N – 1)

The mean of the sample is represented by mx, xi is the data point in question and N is

the number of observations. If we are interested in whether two variables are related,

then we are interested in whether

changes in one variable are met with

similar changes in the other variable.

When there are two variables, rather

than squaring each difference, we

can multiply the difference for one var-

iable by the corresponding difference

for the second variable. As with the

variance, if we want an average value

of the combined differences for the

two variables, we must divide by the

number of observations (we actually

divide by N – 1). This averaged sum of

combined differences is known as the

covariance: Cov(x,y) = ∑ (xi - mx) (yi - my)/ (N – 1)

There is, however, one problem with covariance as a measure of the relationship be-

tween variables and that is that it depends upon the scales of measurement used. So,

covariance is not a standardized measure. To overcome the problem of dependence

on the measurement scale, we need to convert the covariance into a standard set of

units. This process is known as standardization.

Therefore, we need a unit of measurement into

which any scale of measurement can be con-

verted. The unit of measurement we use is the

standard deviation.

The standardized covariance is known as a cor-

relation coefficient.

r = covxy / sx sy = ∑ (xi - mx) (yi - my)/ [(N – 1) sx sy]

which always lies in between –1 and 1.

Remember, correlation doesn‘t necessarily imply

causation.


Test of Hypotheses for Correlation

For pairs from an uncorrelated bivariate normal distribution, the sampling distribution of

Pearson's correlation coefficient follows Student's t-distribution with degrees of freedom

n − 2. Specifically, if the underlying variables have a bivariate normal distribution, the

variable

has a Student's t-distribution in the null case (zero correlation).

Partial Correlation

A correlation between two variables in which the effects of other variables are held

constant is known as partial correlation. The partial correlation for 1 and 2 with control-

ling variable 3 is given by:

r12.3 = (r12 – r13 r23) / [√ (1 – r132) √ (1 – r232)]

For example, we might find the ordinary correlation between blood pressure and

blood cholesterol might be a high, strong positive correlation. We could potentially

find a very small partial correlation between these two variables, after we have taken

into account the age of the subject. If this were the case, this might suggest that both

variables are related to age, and the observed correlation is only due to their com-

mon relationship to age.

Correlations

Duration

of Educa-

tion Professional

Experience

Salary in

Dollar per

Hour

Age of

the Per-

son Duration of

Education Pearson Cor-

relation 1 -.308* .115 -.238

Sig. (2-tailed) .017 .381 .067

Professional

Experience Pearson Cor-

relation -.308* 1 .121 .985**

Sig. (2-tailed) .017 .358 .000

Salary in Dol-

lar per Hour Pearson Cor-

relation .115 .121 1 .180

Sig. (2-tailed) .381 .358 .169

Age of the

Person Pearson Cor-

relation -.238 .985** .180 1

Sig. (2-tailed) .067 .000 .169 *. Correlation is significant at the 0.05 level (2-tailed). **. Correlation is significant at the 0.01 level (2-tailed).


CORRELATION

proc corr data=day1.correlation; var Education Experience Age Wage_dollars_per_hour_; run;

proc corr is used to calculate correlation between two or more quantitative variables.

The var option identifies the variables whose correlation coefficients are to be quanti-

fied. The output to this code generates a 4x4 correlation matrix. Each element in this

matrix shows the correlation coefficient between two variables. Associated with each

correlation coefficient is a p-value which shows the statistical significance of the corre-

lation coefficient.

PARTIAL CORRELATION

proc corr data=day1.correlation; var Education Experience; partial Age; run;

This code produces the correlation between the two variables Education and Experi-

ence. The option partial is used to adjust the correlation coefficient value between Ed-

ucation and Experience for the impact of the variable ‗Age‘. This adjustment is im-

portant to find out the extent exactly to which Education and Experience are correlat-

ed.

MATRIX PLOT

ods html; ods graphics on; proc corr data=day1.correlation noprint plots=matrix; var Education Experience Wage_dollars_per_hour_ Age; run; ods graphics off; ods html close;

For a matrix view of the correlations we first set the ods (Output Delivery System) to

html. Then we turn on the graphics mode. In the proc corr we use the options noprint

to suppress the output in the output window. At the same time, we set the type of the

plot to matrix. After running the code, we turn off the graphics mode and reset the

output delivery system.

SAS IMPLEMENTATION


CHI SQUARE TEST FOR INDEPENDENCE OF ATTRIBUTES

Here we are trying to find out whether there is any association between the Frequen-

cy_of_Readership and Level_of_Educational_Achievement. This test is done under the

procedure freq and we request a chi square test in the table statement. proc freq data=day1.chi; tables Frequency_of_Readership * Level_of_Educational_Achievement/chisq; run;

SAS IMPLEMENTATION


Chapter 6

IMPORTANT TESTS OF STATISTICAL SIGNIFICANCE (PART II)

One Way ANOVA

A manager wants to raise the productivity at his company by increasing the

speed at which his employees can use a particular spreadsheet program. As

he does not have the skills in-house, he employs an external agency which

provides training in this spreadsheet program. They offer 3 packages - a be-

ginner, intermediate and advanced course. He is unsure which course is needed for

the type of work they do at his company so he sends 10 employees on the beginner

course, 10 on the intermediate and 10 on the advanced course. When they all return

from the training he gives them a problem to solve using the spreadsheet program

and times how long it takes them to complete the problem. He wishes to then com-

pare the three courses (beginner, intermediate, advanced) to see if there are any dif-

ferences in the average time it took to complete the problem.

Assumptions

Response variable are normally distributed (or approximately normally distributed)

Samples are independent

Variances of populations are equal

Responses for a given group are independent and identically distributed normal

random variables

The hypotheses for the test are:

H0: The population means are equal

H1: At least one of the population means is different

The name ‗One Way ANOVA‘ implies that the number of independent variable is one.

Here the inter-group variation is basically systematic variation and the intra-group vari-

Beginner Intermediate Advanced

Tim

e


ation is unsystematic. Then we are checking whether inter – group variation is signifi-

cantly larger than the intra – group variation.

Two Way ANOVA

The two-way analysis of variance (ANOVA) test is an extension of the one-way ANOVA

test that examines the influence of different categorical independent variables on one

dependent variable. While the one-way ANOVA measures the significant effect of one

independent variable (IV), the two-way ANOVA is used when there are more than one

IV and multiple observations for each IV. The two-way ANOVA can not only determine

the main effect of contributions of each IV but also identifies if there is a significant in-

teraction effect between the IVs.

Example

A researcher was interested in whether an individual's interest in politics was influenced

by their level of education and their gender. They recruited a random sample of par-

ticipants to their study and asked them about their interest in politics, which they

scored from 0 - 100 with higher scores meaning a greater interest. The researcher then

divided the participants by gender (Male/Female) and then again by level of educa-

tion (School/College/University).

What is Interaction?

When gender and level of education interact, we find 6 different groups, namely,

Male – School, Female – School, Male – College, Female – College, Male – University

and Female – University. Using two way ANOVA, we are trying to understand whether

any of the group is significantly different from the rest. If the interaction levels don‘t

show any significant differences, nor will the main factors for their levels.

Assumptions

As with other parametric tests, we make the following assumptions when using two-

way ANOVA:

The populations from which the samples are obtained must be normally distributed

Sampling is done correctly. Observations for within and between groups must be

independent

The variances among populations must be equal (homogeneity)

Data are interval or nominal

The Hypotheses for the test are:

For each factor and interaction,

H0: Means of all groups are equal

H1: There is one significant difference


ONE-WAY ANOVA

We demonstrate one way anova through a case study. The case that we consider is

that of three production plants: Maruti, Hyundai and Tata. The associated processing

time of cars in each of these plants is mentioned along with them. The objective of the

analyst is to find out whether there exists a significant difference between the mean

processing time of the plant.

proc anova data=day1.anova; class plant; model processing_time=plant; run;

‗anova‘ is the procedure used in analysis of variance when the data is balanced.

‗Class‘ is the keyword for specifying the different groups in the problem. In this case,

the class variables are the respective production plants of the companies. The ‗model‘

keyword is used for executing functions which involve an independent and a depend-

ent variable. The left-hand side of the equality is the dependent variable and the right-

hand side represents the independent variable. The code generates the following ta-

bles:

First table shows the statistics associated with the overall goodness of the model.

This table displays the variations across the groups (Mean Model Sum of Squares)

and within the groups (Mean Squares of Errors). The F-statistic is calculated as a ra-

tio of the Explained variation in the model to the unexplained variation. The p-

value rule is employed to check the significance of the F-value. The p-value for the

F-statistic in this study is 0.1447, which is significantly greater than the level of signifi-

cance. Thus it can be concluded that there is no significant difference in the pro-

cessing time of cars in the three plants.

The second table generates all the descriptive statistics corresponding to the varia-

ble mean_processing_time_of_plant.

The mean processing times of plants of the three companies are not significantly differ-

ent from each other. One problem with the one-way anova is that it does not include

any interaction effect between the independent variables. This problem is addressed

by two-way anova.

TWO-WAY ANOVA

A survey referred to weight gained by men because of different factors, viz, the

amount of food consumed by the men and the type or nature of diet. By Ten repre-

sentative men were randomly selected and each of them were fed with each type of

diet in the two specified diet amounts (i.e. ―High‖ and ―low‖ respectively). The weight

gained by the men was measured in grams. There are three variables with a total of 60

observations.

SAS IMPLEMENTATION


The numeric variable Weight Gain denotes the weight gained by the men. The two

separate samples of pre and post treatment weight is not taken; rather; a single sam-

ple of actual weight gain is considered. The variable Diet Amount denotes the amount

of diet. It is a categorical variable recording two responses; 1 for ‗High‘ and 2 for ‗Low‘

amounts of diet. Also, the variable Diet Type denotes the type of diet consumed which

is also a categorical variable. It records three responses: 1 for Vegetarian diet, 2 for

non-vegetarian diet and 3 for a mixed diet.

The objective of the study is to locate the factors which most significantly affect the

weight gain in individuals. The code for two-way anova is:

proc glm data=day1.twowayanova; class Diet_Amount Diet_type; model Weight_gain=Diet_Amount Diet_type Diet_Amount*Diet_type; means Diet_amount Diet_type/tukey; run;

This can also be done using proc anova. But anova works well when the data is bal-

anced, i.e. the interaction groups are equal in size. Also we are more interested about

the type III sum of squares. So we prefer proc glm over proc anova.

SAS IMPLEMENTATION


Chapter 7

EXPLORATORY FACTOR ANALYSIS

S uppose, we are interested in consumers‘ evaluation of a brand of coffee. We

take a random sample of consumers whom were given a cup of coffee. They

were not told which brand of coffee they were given. After they had drunk the

coffee, they were asked to rate it on 14 semantic – differential scales. The 14 at-

tributes which were investigated are shown below:

1. Pleasant Flavor – Unpleasant Flavor

2. Stagnant, muggy taste – Sparkling, Refreshing Taste

3. Mellow taste – Bitter taste

4. Cheap taste – Expensive taste

5. Comforting, harmonious – Irritating, discordant

6. Smooth, friendly taste –

Rough, hostile taste

7. Dead, lifeless, dull taste –

alive, lively, peppy taste

8. Tastes artificial – Tastes like

real coffee

9. Deep distinct flavor – Shal-

low indistinct flavor

10. Tastes warmed over –

Tastes just brewed

11. Hearty, full – bodies, full fla-

vor – Warm, thin empty flavor

12. Pure, clear taste – Muddy,

swampy taste

13. Raw taste – Stale taste

14. Overall preference: Excel-

lent quality – Very poor quality

A factor analysis of the ratings

given by consumers indicated

that four factors could summa-

rize the 14 attributes. These

factors were: comforting quali-

ty, heartiness, genuineness and

freshness.

Here we are only exploring the factors, but we cannot confirm whether these are the

only factors, hence the name Exploratory Factor Analysis.

Factor Attributes A. Comforting Quality 1. Pleasant flavor

3. Mellow taste

5. Comforting taste

12. Pure, clear taste B. Heartiness 9. Deep distinct flavor

11. Hearty, full - bodied, full fla-

vor C. Genuineness 2. Sparkling taste 4. Expensive taste

6. Smooth, friendly taste

7. Alive, lively, peppy taste

8. Tastes like real coffee

14. Overall preference D. Freshness 10. Tastes just brewed 13. Raw taste


Principal Component Analysis

Principal component analysis was developed by Pearson and adapted for factor

analysis by Hotelling. A goal for the user of PCA is to summarize the interrelationships

among a set of original variables in terms of a smaller set of uncorrelated principal

components that are linear combinations of the original variables.

Estimating The Initial Communalities

PCA assumes that there is as much variance to be analyzed as the number of ob-

served variables and that all of the variance in an item can be explained by the ex-

tracted factors. Communality means the variance that the items and factors share in

common.

Eigenvalues and Eigen Vectors

PCA has been described as Eigen analysis or seeking of the solution to the characteris-

tic equation of the correlation matrix. An Eigen value represents the amount of vari-

ance in all of the items that can explained by a given principal component or factor.

An Eigen vector of a correlation matrix is a column of weights.

Is Factor Analysis Feasible?

Correlation Matrix Check: Is it a combination of high and low correlations?

KMO MSA Check: The Kaiser – Meyer – Olkin Measure of Sampling Adequacy tests

whether the partial correlations among variables are small.

Bartlett‘s Test of Sphericity: It tests whether the correlation matrix is an identity ma-

trix, which could indicate that the factor model is inappropriate.

Factor Loadings

To obtain a principal component, each of the weights of a Eigen vector is multiplied

by the square root of the principal component‘s associated Eigen value. These newly

generated weights are called factor loadings and represent the correlation of each

item with the given principal component.

Deciding The Number of Factors

A Priori Criterion: Number of Factors to extract is pre-decided

Eigen Value Criterion:

Min Eigen Criterion: We decide the floor of Eigen value. If the floor is 0.6 and there

are 3 Eigen values above that mark, then we are looking for 3 factors.

Proportional and Cumulative Variance: We consider how much information is ex-

plained by an individual factor and on aggregate by the selected factors.

Scree Plot: This is basically graphical presentation of proportional variance

So, PCA explains the entire variance and EFA explains a part of it. In EFA we are basi-


cally trying to explain the common variance among the variables.

Factor Analysis is an Interdependence technique. In interdependence techniques the

variables are not classified as dependent or independent; rather, the whole set of in-

terdependence relationships is examined.

Problems of Factor Loadings and Solutions

Initially, the weights are distributed across all the variables. So it is not possible to under-

stand the underlying factor of one or more variables. To remove this problem , we ap-

ply rotation to the axes.

We mainly deal with two types of rotation:

Orthogonal Rotation: Varimax

Oblique Rotation: Promax

The problem with oblique rotation is that it makes the factors correlated. Varimax rota-

tion is used in principal component analysis so that the axes are rotated to a position in

which the sum of the variances of the loadings is the maximum possible.

1 2 3 4 5 6 7 8

Eig

en

Va

lue

Number of Factors

Scree Plot


EXPLORATORY FACTOR ANALYSIS

Here we are concerned about the underlying factors of the employee satisfaction.

Here the name of the data set is employee_satisfaction. Let‘s first look at the variables

in the data set.

proc contents data=day1.employee_satisfaction position short; run; /*Employee Organization_competitive_place Like_ppl_I_work_with Job_allows_learn_newthngs Paid_more_than_others I_like_work_culture Frnds_have_heard_of_this_comp Co_looks_good_on_resume Can_work_from_home cut-ting_edge_work_done good_perks_incentives Good_pension_plan I_never_worked_on_weekend Paid_well_for_my_work*/

So apart from the variable employee which is basically the identification of the em-

ployee, all the variables contribute to the satisfaction of the employee. Using factor

analysis we are going to find out the underlying factors of the employee satisfaction

and see which variable belongs to which factor.

But first we have to see whether factor analysis is feasible or not.

proc factor data=day1.employee_satisfaction corr msa scree; var Organization_competitive_place Like_ppl_I_work_with Job_allows_learn_newthngs Paid_more_than_others I_like_work_culture Frnds_have_heard_of_this_comp Co_looks_good_on_resume Can_work_from_home cut-ting_edge_work_done good_perks_incentives Good_pension_plan I_never_worked_on_weekend Paid_well_for_my_work; run;

The corr option in the data statement of procedure factor produces the correlation

matrix mentioned in the var statement. If the correlation between the variables are

very near to zero (say within +/- 0.2 ), then the variables are independent. So they

themselves are the factors. The other option msa produces a KMO MSA Check. The

scree option produces a scree plot.

Now suppose we want to produce 4 factors. Then we set the value of n to 4.

proc factor data=day1.employee_satisfaction corr msa scree n = 4 rotate = varimax; var Organization_competitive_place Like_ppl_I_work_with Job_allows_learn_newthngs Paid_more_than_others I_like_work_culture Frnds_have_heard_of_this_comp Co_looks_good_on_resume Can_work_from_home cut-ting_edge_work_done good_perks_incentives Good_pension_plan I_never_worked_on_weekend Paid_well_for_my_work; run;

SAS IMPLEMENTATION


The rotate option specifies the type of rotation that we give. Here we have assigned

Varimax rotation.

If we want to calculate all the scoring coefficients, then we mention the option score.

proc factor data=day1.employee_satisfaction corr msa scree score mineigen = 0.5; var Organization_competitive_place Like_ppl_I_work_with Job_allows_learn_newthngs Paid_more_than_others I_like_work_culture Frnds_have_heard_of_this_comp Co_looks_good_on_resume Can_work_from_home cut-ting_edge_work_done good_perks_incentives Good_pension_plan I_never_worked_on_weekend Paid_well_for_my_work; run;

The mineigen = 0.5 option implies we want to retain those factors only that have eigen

values greater than 0.5.

For individual factor scores, we write specify the option out = day1.factor_scores.

proc factor data=day1.employee_satisfaction n = 4 out = day1.factor_scores; var Organization_competitive_place Like_ppl_I_work_with Job_allows_learn_newthngs Paid_more_than_others I_like_work_culture Frnds_have_heard_of_this_comp Co_looks_good_on_resume Can_work_from_home cut-ting_edge_work_done good_perks_incentives Good_pension_plan I_never_worked_on_weekend Paid_well_for_my_work; run;

SAS IMPLEMENTATION


Chapter 8

CLUSTER ANALYSIS

C luster analysis groups individuals or objects into clusters objects in the same

cluster are more similar to one another than they are to objects in other

clusters. The attempt is to maximize the homogeneity of objects within the

clusters while also maximizing the heterogeneity between the clusters. Like

factor analysis, cluster analysis is also a inter dependence technique.

A Simple Example

Suppose, you have done a pilot marketing of a candy on a randomly selected sample

of consumers. Each of the consumers was given a candy and was asked whether they

liked it and whether they will buy it. Now

the respondents were grouped into fol-

lowing four clusters:

Now the group ―NOT LIKED, WILL BUY‖

group is a bit unusual. But people can

buy for others. From a strategy point of

view, the group ―LIKED, WILL NOT BUY‖ is

important, because they are potential

customers. A possible change in the pric-

ing policy may change the purchasing

decision.

What Exactly Are We Looking for?

From the example, it is very clear that we must have some objective on the basis of

which we want to create clusters. The

following questions need to be an-

swered:

What kind of similarity are we look-

ing for? Is it pattern or proximity?

How do we form the groups?

How many groups should we

form?

What‘s the interpretation of each

cluster?

What‘s the strategy related to

each of these clusters? Customer Profiling


Problems with Cluster Analysis

Cluster analysis does not have a theoretical statistical basis. So no inference can be

made from the sample to the population. It‘s only an exploratory technique. Noth-

ing guarantees unique solutions.

Cluster analysis will always create clusters, regardless of actual existence of any

structure in the data. Just because clusters can be found doesn‘t validate their ex-

istence.

The Cluster solution cannot be generalized because it is totally dependent upon

the variables used as the basis for similarity measure. This criticism can be made

against any statistical technique. With cluster variate completely specified by the

researcher, the addition of spurious variables or the deletion of relevant variables

can have substantial impact on the resulting solution. As a result, the researcher

must be especially cognizant of the variables used in the analysis, ensuring that

they have a strong conceptual support.

Types of Cluster Analysis Non-Hierarchical K Means

Hierarchical

Agglomerative Divisive

In data mining, hierarchical clustering is a method of

cluster analysis which seeks to build a hierarchy of clus-

ters. Strategies for hierarchical clustering generally fall

into two types:

Agglomerative: This is a "bottom up" approach: each

observation starts in its own cluster, and pairs of clusters

are merged as one moves up the hierarchy.

Divisive: This is a "top down" approach: all observations

start in one cluster, and splits are performed recursively

as one moves down the hierarchy.

Metric and linkage

In order to decide which clusters should be combined (for

agglomerative), or where a cluster should be split (for divi-

sive), a measure of dissimilarity between sets of observations

is required. In most methods of hierarchical clustering, this is

achieved by use of an appropriate metric (a measure of dis-

tance between pairs of observations), and a linkage criterion

which specifies the dissimilarity of sets as a function of the

pairwise distances of observations in the sets. Generally the

distance metric is the Euclidean distance. As for linkages,

there are single linkage (the shortest distance between two

clusters), complete linkage (the longest distance), and average linkage (the average

of all the distances between the two clusters).


Ward’s Minimum Variance Cluster Analysis

Ward's minimum variance criterion minimizes the total within-cluster variance.

At each step the pair of clusters with minimum cluster distance are merged.

To implement this method, at each step find the pair of clusters that leads to mini-

mum increase in total within-cluster variance after merging.

Related Statistics

Semi Partial R Squared: The semi-partial R-squared (SPR) measures the loss of homoge-

neity due to merging two clusters to form a new cluster at a given step. If the value is

small, then it suggests that the cluster solution obtained at a given step is formed by

merging two very homogeneous clusters.

R Square: R-Square (RS) measures the heterogeneity of the cluster solution formed at a

given step. A large value represents that the clusters obtained at a given step are

quite different (i.e. heterogeneous) from each other, whereas a small value would sig-

nify that the clusters formed at a given step are not very different from each other.

Related Charts

Dendrogram: It‘s a chart showing which two clusters are merging at which distance.

Icicle: It‘s a chart showing which case is being merged into the a cluster at which lev-

el.


K Means Clustering

In data mining, k-means clustering is a method of clus-

ter analysis which aims to partition n observations into

k clusters in which each observation belongs to the

cluster with the nearest mean. A key limitation of k-

means is its cluster model. The concept is based on

spherical clusters that are separable in a way so that

the mean value converges towards the cluster center.

The clusters are expected to be of similar size, so that

the assignment to the nearest cluster center is the cor-

rect assignment. Researchers generally use Hierar-

chical Methods to find out the optimal number of clus-

ters and then use K Means method to determine the actual clusters.


CLUSTER ANALYSIS

proc import datafile="C:\Documents and Settings\Asus\Desktop\SAS MATERIALS\Analytics data sets and case studies\IPL_Cluster.csv" out=day1.iplcluster dbms=csv replace; run;

The original data set IPL_cluster is in the csv format and therefore we need to import it.

We use the ‗proc import‘ code to import this file in the sas library.

proc contents data=day1.iplcluster position short; run; /*Player Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes*/

This data set contains a variety of variables which needs to be used repeatedly for our

analysis. The ‗proc content‘ code helps us to get the list of all the variables in their cre-

ation order.

proc standard data=day1.iplcluster out=day1.iplstandard mean=0 std=1; var Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes; run;

In cluster analysis, the idea is to club together the related observations. In order to club

together the related homogeneous observations, we need some sort of a composite

weight. In this dataset, it does not make any sense to add up the ‗runs scored‘ with the

number of ‗not outs‘ or the ‗number of sixes hit‘. So the first step towards segmenting

the ipl data set is to standardize the entire data set, so that all the variables become

free of units. This code creates a standardized data set free of units.

proc cluster data=day1.iplstandard outtree=day1.cluster_tree method=ward; id player; var Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes; run;

In this code we are applying the cluster procedure on the dataset ‗iplstandard‘. We

use an ‗outtree‘ command to generate a dataset by the name ‗cluster_tree‘ which

can be used to generate the dendrogram. The ‗method=ward‘ stands for ‗Ward‘s

minimum variance method‘. It clubs down those observations which would induce the

minimum increase in the error sum of squares or within the group variation. The com-

mand ‗id player‘ retains the player variable in the data set without performing any

mathematical function on the variable.

SAS IMPLEMENTATION


This code generates a cluster history. The cluster history contains the following im-

portant components which can be explained as follows:

Semi Partial R Squared: The semi-partial R-squared (SPR) measures the loss of homo-

geneity due to merging two clusters to form a new cluster at a given step. If the

value is small, then it suggests that the cluster solution obtained at a given step is

formed by merging two very homogeneous clusters.

R Square: R-Square (RS) measures the heterogeneity of the cluster solution formed

at a given step. A large value represents that the clusters obtained at a given step

are quite different (i.e. heterogeneous) from each other, whereas a small value

would signify that the clusters formed at a given step are not very different from

each other.

TIED: It implies that the performance of the two observations clustered together is

not unique. There are also other pairs who have performed in a similar manner. It is

to be observed that the pair we choose is going to affect the cluster formation.

Therefore, it is our discretion to choose the pair, which we want. SAS behaves like a

default user where it uses a tiebreaker rule to get the pairs of the cluster.

Proc tree data=day1.cluster_tree; Run;

This code is used to generate the dendrogram. Proc tree is the procedure to generate

the dendrogram. Now, suppose we want to retain 4 clusters for our analysis. The follow-

ing code is used to do so:

Proc tree data=day1.cluster_tree nclusters=4 out=day1.cluster_result; id player; copy Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes; run;

The keyword ‗nclusters‘ is used for specifying the number of clusters that is to be re-

tained. The data set cluster_result has two new variables by the name ‗cluster‘ and

‗clus_name‘. The column ‗cluster‘ has observations like 1,2,3,4 representing that which

player belongs to which particular cluster. The variable ‗clus_name‘ has the observa-

tions CL4, CL5, CL7 and CL8.

data day1.cluster1; set day1.cluster_result (keep=player cluster); run;

We use the ‗keep‘ command to retain only the ‗player‘ and ‗cluster‘ in the data set.

The newly created data set is cluster1. The next following three steps are for arranging

the data to make meaningful decisions:

The first step is to sort the data sets cluster 1 by player and put out the results on the

SAS IMPLEMENTATION


cluster 2. We sort the data by the variable ‗players‘. Also, we sort the ‗iplcluster‘ data

set by player and display the results in cluster3. We are sorting the dataset as we want

to merge dataset cluster2 and cluster3 together. We are doing this in order to add a

new column to the original data set ‗iplcluster‘. proc sort data=day1.cluster1 out=day1.cluster2; by player; run; proc sort data=day1.iplcluster out=day1.cluster3; by player; run; data day1.merge_cluster; merge day1.cluster2 day1.cluster3; by player; run;

Next, we want to study the properties of the four clusters:

proc means data=day1.merge_cluster mean std; var Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes; class cluster; run;

This code is to show the properties of the cluster. By examining the descriptive statistics

associated with each of the clusters, we can explain which the best cluster is.

proc print data=day1.merge_cluster; where cluster=4; run;

The objective of this code is to print the four major clusters from which we can form our

decision of the best cluster.

SAS IMPLEMENTATION


Chapter 9

LINEAR REGRESSION

I n regression analysis we fit a predictive model to our data and use that model to

predict values of the dependent variable from one or more independent variables.

Simple regression seeks to predict an outcome variable from a single predictor vari-

able whereas multiple regression seeks to predict an outcome from several predic-

tors. We can predict any data using the following general equation:

Outcomei = (Model)i + Errori

The model that we fit here is a linear model.

Linear model just means a model based on

a straight line. One can imagine it as trying

to summarize a data set with a straight line.

Some Important Features of a Straight Line

A straight line can be defined by two things:

1)The slope or gradient of the line (b1)

2)The point at which the line crosses the ver-

tical axis of the graph, also known as the in-

tercept of the line (b0). So our general equa-

tion becomes: Yi = (b0 + b1Xi) + εi

Here Yi is the outcome that we want to predict and Xi is the i-th score on the predictor

variable. The intercept b0 and the slope b1 are the parameters in the model and are

known as regression coefficients. There is a residual term εi which represents the differ-

ence between the score predicted by the line and the i-th score in reality of the de-

pendent variable. This term is the proof of the fact that our model will not fit perfectly

the data collected. With regression we strive to find the line that best describes the da-

ta.

Same intercept, but different slope Same slope, but different intercept


Difference between Correlation and Regression

Correlation analysis is concern with knowing whether there is a relationship between

variables and how strong the relationship is. Regression analysis is concern with finding

a formula that represents the relationship between variables so as to find

an approximate value of one variable from the value of the other(s).

Assumptions of Simple Linear Regression

An unilinear between an independent and dependent variable can be represent-

ed by a linear regression.

The independent variable must be non-stochastic in nature, i.e. the variable

doesn‘t have any distribution associated with it.

The model must be linear in parameters.

The independent variable should not be correlated with the error term.

The error terms must be independent of each other, i.e. occurrence of one error

term should not influence the occurrence of other error terms.

The Method of Least Square

The method of least squares is a way of finding the line

that best fits the data. Of all the possible lines that could

be drawn, the line of best fit is the one which results in

the least amount of difference between the observed

data points and the line.

The figure shows that when any line is fitted to a set of

data, there will be small differences between the line

and the actual data. We are interested in the vertical

differences between the line and the actual data because we are using the line to

predict the values of Y from the values of X. Some of these differences are positive

(they are above the line, indicating that the model underestimates their value) and

some are negative (they are below the line, indicating that the model overestimates

their value).

Understanding The Goodness of Fit

The goodness of fit of a statistical mod-

el describes how well it fits a set of ob-

servations. Measures of goodness of fit

typically summarize the discrepancy

between observed values and the val-

ues expected under the model in

question. In linear regression, the fit is

expressed through R2, apart from R2,

there is another measure of goodness

of fit known as Adjusted R Square. The

R2 value for a regression can be made


arbitrarily high simply by including more and more predictors in the model. Adjusted R2

takes into account the number of independent variables in the model.

From Sample to Population

Like other statistical methods, using regression we are trying to discover a relationship

between the dependent and independent variable(s) from a sample and try to draw

inference on the population. So there comes the tests of significance in linear regres-

sion.

The Equation of the estimated line is:

Here alpha and beta are the estimated value of the intercept and the slope respec-

tively. The tests of significance are related to these two estimates.

Test of Significance of The Estimated Parameters

Global Test

H0: All the parameters are equal to zero simultaneously

H1:At least one is non – zero

This test is conducted by using a F statistic similar to that we saw in ANOVA.

Local Test

For each individual parameters,

H0: The parameter value is zero

H1: The value is non – zero

This test is conducted by using a t statistic similar to a one sample t test.

For Simple Linear Regression there is no difference between the Global and Local tests

as there is only one independent variable.


Multiple Linear Regression

The Multiple Linear Regression (MLR) is basically an extension of Simple Linear Regres-

sion. Single linear regression has single explanatory variable whereas multiple regres-

sion considers more than one independent variables to explain the dependent varia-

ble. So from a realistic point of view, MLR is more attractive than the simple linear re-

gression. For example,

(Salary)i = a + b1(Education)i + b2(Experience)i + b3(Productivity)i + b4 (Work Experi-

ence)i + ei

Assumptions

The relationship between the dependent and the independent variables is linear.

Scatter plots should be checked as an exploratory step in regression to identify pos-

sible departures from linearity.

The errors are uncorrelated with the independent variables. This assumption is

checked in residuals analysis with scatter plots of the residuals against individual

predictors.

The expected value of residuals is zero. This is not a problem because the least

squares method of estimating regression equations guarantees that the mean is

zero.

The variance of residual is constant. An example of violation is a pattern of residuals

whose scatter (variance) increases over time. Another aspect of this assumption is

that the error variance should not change systematically with the size of the pre-

dicted values. For example, the variance of errors should not be greater when the

predicted value is large than when the predicted value is small.

The residuals are random or uncorrelated in time.

The error term is normally distributed. This assumption must be satisfied for conven-

tional tests of significance of coefficients and other statistics of the regression equa-

tion to be valid.

Concept of Multicollinearity

The predictors in a regression model are often called

the ―independent variables‖, but this term does not

imply that the predictors are themselves independ-

ent statistically from one another. In fact, for natural

systems, the predictors can be highly inter-

correlated. ―Multicolinearity‖ is a term reserved to

describe the case when the inter-correlation of pre-

dictor variables is high. It has been noted that the

variance of the estimated regression coefficients

depends on the inter-correlation of predictors. How-

ever, multicolinearity does not invalidate the regres-

sion model in the sense that the predictive value of

the equation may still be good as long as the pre-

diction are based on combinations of predictors


within the same multivariate space used to calibrate the equation. But there are sever-

al negative effects of multicolinearity. First, the variance of the regression coefficients

can be inflated so much that the individual coefficients are not statistically significant –

even though the overall regression equation is strong and the predictive ability good.

Second, the relative magnitudes and even the signs of the coefficients may defy inter-

pretation. Third, the values of the individual regression coefficients may change radi-

cally with the removal or addition of a predictor variable in the equation. In fact, the

sign of the coefficient might even switch.

Signs of Multicollinearity

High correlation between pairs of predictor variables

Regression coefficients whose signs or magnitudes do not make good physical

sense

Statistically non-significant regression coefficients on important predictors

Extreme sensitivity of sign or magnitude of regression coefficients to insertion or de-

letion of a predictor variable

What is VIF?

The Variance Inflation Factor (VIF) is a statistic that can be used to identify multicolin-

earity in a matrix of predictor variables. ―Variance Inflation‖ refers here to the men-

tioned effect of multicolinearity on the variance of estimated regression coefficients.

Multicolinearity depends not just on the bivariate correlations between pairs of predic-

tors, but on the multivariate predictability of any one predictor from the other predic-

tors. Accordingly, the VIF is based on the multiple coefficient of determination in re-

gression of each predictor in multivariate linear regression on all the other predictors:

VIFi = 1/(1 – Ri2)

where Ri2 is the multiple coefficient of determination in a regression of the i-th predictor

on all other predictors, and VIFi is the variance inflation factor associated with the i-th

predictor. Note that if the i-th predictor is independent of the other predictors, the vari-

ance inflation factor is one, while if the i-th predictor can be almost perfectly predict-

ed from the other predictors, the variance inflation factor approaches infinity. In that

case the variance of the estimated regression coefficients is unbounded. Multicolin-

earity is said to be a problem when the variance inflation factors of one or more pre-

dictors becomes large. How large it appears to be is a subjective judgment. Some re-

searchers use a VIF of 5 and others use a VIF of 10 as a critical threshold. The VIF is

closely related to a statistic call the tolerance, which is 1/VIF.

Analysis of The Residuals

Analysis of residuals consists of examining graphs and statistics of the regression residu-

als to check that model assumptions are satisfied. Some frequently used residuals tests

are listed below. All these are to check whether the error terms are identically inde-

pendently distributed.

Time series plot of residuals: The time series plot of residuals can indicate such prob-

lems as non-constant variance of residuals, and trend or autocorrelation in residu-


als.

Scatter plot of residuals against predicted values: The residuals are assumed to be

uncorrelated with the predicted values. Violation is indicated by some noticeable

pattern of dependence in the scatter plots.

Scatter plots of residuals against individual predictors: The residuals are assumed to

be uncorrelated with the individual predictors. Violation of these assumptions

would be indicated by some noticeable pattern of dependence in the scatter

plots, and might suggest transformation of the predictors.

Histogram of residuals: The residuals are assumed to be normally distributed. Ac-

cordingly, the histogram of residuals should resemble a normal probability density

function curve.

Lag-1 scatter plot of residuals: This plot also deals with the assumption of independ-

ence of residuals. The residuals at time t should be independent of the residuals at

time t-1. The scatter plot should therefore resemble a formless cluster of points.

The Idea of Autocorrelation

Autocorrelation is a mathematical representa-

tion of the degree of similarity between a given

time series and a lagged version of itself over

successive time intervals. It is the same as calcu-

lating the correlation between two different time

series, except that the same time series is used

twice - once in its original form and once lagged

one or more time periods. Autocorrelation is cal-

culated to detect patterns in the data. In the

chart, the first series is random, whereas the se-

cond one shows patterns.

Durbin-Watson (D-W) Statistic

The Durbin-Watson (D-W) statistic tests for autocorrelation of residuals, specifically lag-1

autocorrelation. The D-W statistic tests the null hypothesis of no first-order autocorrela-

tion against the alternative hypothesis of positive first-order autocorrelation. The alter-

native hypothesis might also be negative first-order autocorrelation. Assume the residu-

als follow a first-order autoregressive process

et = pet-1+ nt

where nt is random and p is the first-order autocorrelation coefficient of the residuals. If

the test is for positive autocorrelation of residuals, the hypotheses for the D-W test can

be written as H0: p = 0 against H1: p > 0

The D-W statistic is given by d = ∑(ei - ei-1)2 / ∑ei2

It can be shown that if the residuals follow a first-order autoregressive process, d is re-

lated to the first-order autocorrelation coefficient, p, as d = 2 (1 – p).

The above equation implies that

d = 2 if no autocorrelation (p = 0)

d = 0 if 1st order autocorrelation is 1

d = 4 if 1st order autocorrelation is -1


The simple linear regression model explains the causality relationship between the de-

pendent variable and a single independent variable. We use the Walmart case study

to explain the different important characteristics of the model. The case study analyses

various factors which explain customer satisfaction for the retail giant, Walmart. This is

basically a rating data set where the customers have rated various departments of

Walmart. Based on this data Walmart will try to understand how it can improve its cus-

tomer satisfaction.

proc import datafile="C:\Documents and Settings\OrangeTree\Desktop\Analytics data sets and case studies\walmart.csv" out=day1.walmart dbms=csv replace; run;

This code is meant for importing the file ‗Walmart‘ from the folder Analytics data sets

and case studies. This file is initially in the csv format and is being imported in order to

convert it to the SAS format. Post importing, we need to make a list of the variables so

that we can use them as and when necessary in our analysis. We make a list of the

variables in the data set using ‗position short‘ keyword.

proc contents data=day1.walmart position short; run; /*Advertising Competitive_Pricing Complaint_Resolution Customer_Satisfaction Deliv-ery_Speed E_Commerce Order_Billing Packaging Price_Flexibility Product_Line Prod-uct_Quality Salesforce_Image Technical_Support Warranty_Claims*/ Proc reg data=day1.walmart; model customer_satisfaction=product_quality; run; quit;

‗reg‘ is the procedure used to execute regression analysis. The key word ‗model‘ is

used for creating a model with customer_satisfaction as the dependent variable and

product_quality as the independent variable.

The results generated show the following tables:

The first table shows that the number of observations read in the model is equal to

the number of observations used in the model. The Walmart data set has 200 ob-

servations and we are also using 200 observations. Therefore, this data set is free

from the problems of missing observations.

The next table is the ANOVA table. This table gives us an idea about the overall

goodness of fit of the model. The overall fit is defined by the p-value associated

with the F-statistic. Here, the p-value is less than the level of significance so the null-

hypothesis is rejected. The hypothesis for testing the significance of the model is:

SAS IMPLEMENTATION


H0: The slope coefficient of product_quality is insignificant

V/s

HA: The slope coefficient of product_quality is significant.

Therefore, we can conclude that product quality has a significant impact in ex-

plaining the variations in the customer satisfaction.

The next table generates the descriptive statistics associated with the dependent

variable product_quality. The R2 statistic gives the goodness of fit associated with

the model. This measure reflects the proportion of variation in the dependent varia-

ble explained by the independent variable. Root MSE is the standard deviation of

the error term. Dependent Mean is the mean of the actual values of the depend-

ent variable.

The parameter estimates table we will be checking for the individual null hypothe-

sis. From the individual null hypothesis we find that the p-value is less than the level

of significance (0.05); which means that we are rejecting the individual null hypoth-

esis and declaring that the variable product_quality is a significant explanatory var-

iable for explaining the variation in the customer_satisfaction.

Although product quality affects customer satisfaction significantly, but the percent-

age of variation explained by this variable is only 27%. This means that only this variable

is not sufficient for explaining the variations in Customer_satisfaction. Therefore, we

need to introduce more explanatory variables. Hence, we turn to the Multiple Linear

Regression Model.

MULTIPLE LINEAR REGRESSION ANALYSIS

This is a standard model we follow when we would like to apply the regression tech-

nique and there are certain assumptions, which need to be satisfied if the Classical Lin-

ear Regression Model (CLRM) is to be valid. However, before going into the assump-

tions we would split the data set into two parts: Training and Validation data sets. The

objective of splitting the data set into two parts is to check for the robustness of the re-

sult obtained. The selection of the observations in the data set must be random and

therefore we use the ‗ranuni‘ function to break the data set into the two parts. The

code below shows how to break the data set into two parts:

data day1.walmart1; set day1.walmart; rannum=ranuni (0); run; quit;

This is a data step whereby we create a new data set ‗walmart1‘ from the set

‗Walmart‘. The newly created dataset ‗Walmart1‘ has a set of random numbers at-

tached to every observation in the data set. These random numbers are generated

using the ‗ranuni‘ command. This function generates random numbers from a uniform

SAS IMPLEMENTATION


distribution. Now, given these random numbers we would break the data set into two

parts using:

data day1.waltraining day1.walvalidation; set day1.walmart1; if rannum<=0.7 then output day1.waltraining; else output day1.walvalidation; run; quit;

Instead of conducting the entire regression analysis on the original Walmart data set

into: Training data set which would contain 70% of the observations and Validation da-

ta set which would contain 30% of the total observations. This entire technique of

breaking the data set into two parts is called the Robust regression technique, since

the regression is initially conducted on the training data set and then the result ob-

tained in the training data set is validated in the Validation data set to check for the

robustness of the results. The entire purpose of this check is to ensure the reliability of

this model. In this method of creating the data set each and every observation in the

original data set has an equal probability of being selected in the training or in the val-

idation data set. The observations with corresponding random numbers greater than

0.7 are assigned to the validation data set and those with less than or equal to 0.7 are

assigned to the training data set.

Before proceeding with any predictions using the CLRM we must examine whether the

assumptions to the model are satisfied. The three most important assumptions are:

There should not be any multicollinearity among the explanatory variables

There should not be autocorrelation among the error terms

The variance of the error terms should be constant

The first check that we perform is for checking multicollinearity:

Proc reg data=day1.waltraining; model customer_satisfaction=Advertising Competitive_Pricing Complaint_Resolution E_Commerce Order_Billing Packaging Price_Flexibility Product_Line Product_Quality Salesforce_Image Technical_Support Warranty_Claims/vif; run; quit;

We use the Variance Inflation Factor for checking the existence of multicollinearity.

The variance inflation factor is executed through the ‗vif‘ keyword. The VIF is measured

using the auxiliary regression, i.e. the regression of one independent variable on the

other independent variables. If VIF is greater than 10, then there is severe multicolline-

arity. The variable with the highest value above 10 is dropped. In this case, it is the vari-

able Delivery_Speed which had a vif of 65 approx. Then all other variables except De-

livery_Speed are included.

SAS IMPLEMENTATION


The next objective is to choose the best model. This is done by the choosing the model

with the highest Adjusted R-square. The adjusted R2 takes into account the number of

independent variables in the model and adjusts for the loss of degrees of freedom. So,

we need to select the model with the highest adjusted R2. The code for it is as follows:

proc reg data=day1.waltraining; model customer_satisfaction=Advertising Competitive_Pricing Complaint_Resolution E_Commerce Order_Billing Packaging Price_Flexibility Product_Line Product_Quality Salesforce_Image Technical_Support Warranty_Claims/selection=adjrsq; run; quit;

The keyword ‗selection=adjrsq‘ is used for selecting the model with the highest adjust-

ed R-square. The model with the highest adjusted R-square is considered to be the

model with the greatest explanatory capacity. The models are listed in the ascending

order of the goodness of fit of the model, i.e. the model with the highest goodness of

fit is listed first and the one with the lowest goodness of fit is the last model in line. Mod-

el here comprises of a combination of variables such that it yields an adjusted R-

square measure. The table ‗adjrsq‘ table displays the number of variables in the model

corresponding to the readings of adjusted R2 model. From now on we use the varia-

bles that are prescribed by the model with the highest ‗adjrsq‘.

/*Competitive_Pricing Complaint_Resolution E_Commerce Packaging Price_Flexibility Product_Line Product_Quality Salesforce_Image Warranty_Claims*/

The original model contained 13 variables. Among them the variable Delivery_Speed

was dropped to solve the problem of multicollinearity. The model with the highest

‗adjrsq‘ contains nine variables. This means three more variables have been dropped

to reach the model with the best fit. So, a part of the explanatory capacity of the

model is foregone which might enter the error or the unexplained part. This might cre-

ate a systematic behavior among the error terms. So, we need to check whether the

error terms are identically independently distributed or not. This check requires us to

check the existence of: (a) Autocorrelation (b) Heteroscedasticity. To check for auto-

correlation we use the Durbin-Watson test statistic. The code for checking the autocor-

relation is as follows: proc reg data=day1.waltraining; model customer_satisfaction=Competitive_Pricing Complaint_Resolution E_Commerce Packaging Price_Flexibility Product_Line Product_Quality Salesforce_Image Warran-ty_Claims/dw; run; quit;

‗dw‘ is the key word for generating the Durbin Watson test statistics associated with this

model. The dw test measures the extent of correlation between the error terms.

SAS IMPLEMENTATION


However, the statistic reveals the value of the autocorrelation between the error terms.

It does not talk about the significance of the autocorrelation. For doing so, we need to

use the DW test tables. Since, SAS cannot compute the tables therefore no concrete

conclusions can be formed about the nature of autocorrelation between the error

terms. An alternative method is to use a technique which can test simultaneously for

the existence of autocorrelation and heteroscedasticity. This test is called the Spec test

or the Specification test.

proc reg data=day1.waltraining; model customer_satisfaction=Competitive_Pricing Complaint_Resolution E_Commerce Packaging Price_Flexibility Product_Line Product_Quality Salesforce_Image Warranty_Claims/spec; run; quit;

The Specification test is executed using the keyword ‗spec‘. This test aims to check the

following hypothesis:

H0: The error terms are identically and independently distributed

V/s

HA: The error terms are not identically and independently distributed

The null hypothesis is accepted if the p-value associated with the test is greater than

the level of significance. Since, SAS by default, considers the 5% level of significance,

therefore, if the p-value associated with this test is greater than 0.05, then we accept

the null hypothesis that the error term is random. Once, we are confirmed that the as-

sumptions to the classical linear regression model are satisfied we next obtain the pre-

dicted value of the customer satisfaction. The following code is used for this purpose:

proc reg data=day1.waltraining; model customer_satisfaction= E_commerce Price_Flexibility Product_Quality Prod-uct_Line Salesforce_image; Output out=day1.reg_final Predicted=pred_sat Residual=error; run; quit;

The command ‗output out‘ is used for creating a new data set where the outputs on

predicted satisfaction and residual are added in addition to the information on the

existing variables. The command ‗predicted‘ is used to generate the predicted cus-

tomer_satisfaction values and these are saved in ‗pred_sat‘. Similarly, the command

‗residual‘ is used to calculate the residual values. These values are stored in the varia-

ble name ‗error‘.

proc corr data=day1.reg_final;

SAS IMPLEMENTATION


var pred_sat customer_satisfaction; run;

In this step, we check the correlation between the predicted and the actual value of

customer satisfaction. The higher the correlation between the two values, the better is

the prediction of customer satisfaction.

Now, our next task is to create the validation data set using the estimates of the pa-

rameters from the training data set. Using these estimates we estimate the correlation

between them and compare this with the results obtained in the training data set. If

the difference in the correlation coefficient of the training and the validation data set

is somewhere between 5%-6% then we know that the results we have obtained is ro-

bust.

TESTING PARAMETER ESTIMATES USING THE VALIDATION DATA SET

data day1.wal_valid; set day1.walvalidation; pred_sat=-3.16541- 0.30252*E_Commerce + 0.34468*Price_Flexibility + 0.45464*Product_Quality + 0.47807*Product_Line + 0.63257*Salesforce_Image; run;

The estimates in green have been obtained from the ‗parameter estimates table‘ in

the training data set.

CHECKING THE CORRELATION BETWEEN ACTUAL AND PREDICTED SATISFACTION

proc corr data=day1.wal_valid; var pred_sat customer_satisfaction; run;

SAS IMPLEMENTATION


Chapter 10

LOGISTIC REGRESSION

I n a nutshell, logistic regression is multiple regression but with an outcome variable

that is a categorical dichotomy and predictor variables that continuous or cate-

gorical. In pain English, this simply means that we can predict which of two catego-

ries a person is likely to belong to given certain other information.

Example: Will the Customer Leave the Network?

This example is related to the Telecom Industry. The market is saturated. So acquiring

new customers is a tough job. A study for the European market shows that acquiring a

new customer is five time costlier than retaining an existing customer. In such a situa-

tion, companies need to take proactive measures to maintain the existing customer

base. Using logistic regression, we can predict which customer is going to leave the

network. Based on the findings, company can give some lucrative offers to the cus-

tomer. All these are a part of Churn Analysis.

Example: Will the Borrower Default?

Non - Performing Assets are big problems for the banks. So the banks as lenders try to

assess the capacity of the borrowers to honor their commitments of interest payments

and principal repayments. Using a Logistic Regression model, the managers can get

an idea of a prospective customer defaulting on payment. All these are a part of

Credit Scoring.

Example: Will the Lead become a Customer?

This is a key question in Sales Practices. Conventional salesman runs after, literally, eve-

rybody everywhere. This leads to a wastage of precious resources, like time and mon-

ey. Using logistic regression, we can narrow down our search by finding those leads

who have a higher probability of becoming a customer.

Example: Will the Employee Leave the Company?

Employee retention is a key strategy for HR managers. This is important for the sustaina-

ble growth of the company. But in some industries, like Information Technology, em-

ployee attrition rate is very high. Using Logistic regression we can build some models

which willl predict the probability of an employee leaving the organization within a

given span of time, say one year. This technique can be applied on the existing em-

ployees. Also, it can be applied in the recruitment process.


So, we are basically talking about the probability of occurrence or non – occurrence

of something.

The Principles Behind Logistic Regression

In simple linear regression, we saw that the outcome variable Y is predicted from the

equation of a straight line: Yi = b0 + b1 X1 + εi in which b0 is the intercept and b1 is the

slope of the straight line, X1 is the value of the predictor variable and εi is the residual

term. In multiple regression, in which there are several predictors, a similar equation is

derived in which each predictor has its own coefficient.

In logistic regression, instead of predicting the value of a variable Y from predictor vari-

ables, we calculate the probability of Y = Yes given known values of the predictors.

The logistic regression equation bears many similarities to the linear regression equa-

tion. In its simplest form, when there is only one predictor variable, the logistic regres-

sion equation from which the probability of Y is predicted is given by:

P(Y = Yes) = 1/ [1+ exp{ - (b0 + b1 X1 + εi )}]

One of the assumptions of linear regression is that the relationship between variables is

linear. When the outcome variable is dichotomous, this assumption is usually violated.

The logistic regression equation described above expresses the multiple linear regres-

sion equation in logarithmic terms and thus overcomes the problem of violating the

assumption of linearity. On the hand, the resulting value from the equation is a proba-

bility value that varies between 0 and 1. A value close to 0 means that Y is very unlikely

to have occurred, and a value close to 1 means that Y is very likely to have occurred.

Why Can’t We Use Linear Regression?

One of the assumptions of linear regression is that the relationship between variables is

linear. When the outcome variable is dichotomous, this assumption is usually violated.

The logistic regression equation described above expresses the multiple linear regres-

sion equation in logarithmic terms and thus overcomes the problem of violating the

assumption of linearity. On the hand, the resulting value from the equation is a proba-

bility value that varies between 0 and 1. A value close to 0 means that Y is very unlikely

to have occurred, and a value close to 1 means that Y is very likely to have occurred.

Look at the data points in the following charts. The first one is for Linear Regression and

the second one for Logistic Regression.


How Do We Get Equation?

In case of linear regression, we used ordinary least square method to generate the

model. In Logistic Regression, we use a technique called Maximum Likelihood Estima-

tion to estimates the parameters. This method estimates coefficients in such a way that

makes the observed values highly probable, i.e. the probability of getting the ob-

served values becomes very high.

Comparison: Discriminant Analysis and Logistic Regression

Discriminant Analysis deals with the issue of which group

an observation is likely to belong to. On the other hand,

logistic regression commonly deals with the issue of how

likely an observation is to belong to each group, i.e. it esti-

mates the probability od an observation belonging to a

particular group. Discriminant Analysis is more of a classifi-

cation technique like Cluster analysis.

Comparison: Linear Probability Model and Logistic Regression

The simplest binary choice model is the linear probability model, where as the name

implies, the probability of the event occurring is assumed to be a linear function of a

set of explanatory variables as follows:

P(Y = Yes) = b0 + b1 X1 + εi

whereas the equation of logistic regression is as follows:

P(Y = Yes) = 1/ [1+ exp{ - (b0 + b1 X1 + εi )}]

You may find a great resemblance of Linear regression with the Linear Probability Mod-

el. From expectation theory, it can be shown that, if you have two outcomes like yes or

no, and we regress those values on an independent variable X, we get a LPM. In this

case, we code yes and no as 1 and 0 respectively.

Why Can’t We Use Linear Probability Model?

The reason is the as why we cannot use the linear regression for a dichotomous out-

come variable discussed in the last slide. Moreover, you may find some negative prob-

abilities and some probabilities greater than 1! And the error term will make you crazy.

So we have to again study Logistic Regression. NO CHOICE!


Time for Mathematics

If we code yes as 1 and no as 0, the logistic regression equation can be written as fol-

lows:

Now if we divide probability of yes by the probability of no, then we get a new meas-

ure called ODDS. Odds shouldn‘t be confused with probability. Odds is simply the ratio

of probability of success to probability of failure. Like we may say, what‘s the odds of

India winning against Pakistan. Then we are basically comparing the probability of In-

dia winning to probability of Pakistan winning.

If we take natural logarithm on the both sides, we have:

This is why Logistic regression is also known as Binary Logit Model.

Change In Odds

If we change X by one unit then the change in Odds is given by:

Now if we divide the 2nd relation by the 1st one, we get eβ. So if we change X by 1

unit, then odds changes by a multiple of eβ. So the expression (eβ- 1)* 100% gives the

percentage change. Remember this kind of understanding is valid only when X is con-

tinuous. When X is categorical, we refer to Odds Ratio.

Odds Ratio

Suppose we are comparing the odds for a Poor Vision Person getting hit by a car to

the odds for a Good Vision Person getting hit by a car.

Suppose, the accident is encoded as 1.

Let P(Y =1| Poor Vision) = 0.8

& P(Y =1| Good Vision) = 0.4


So, P(Y = 0| Poor Vision) = 1 – 0.8 = 0.2

& P(Y =0| Good Vision) = 1 – 0.4 = 0.6

So, Odds( Poor Vision getting hit by a car) = 0.8 / 0.2 = 4

& Odds( Good Vision getting hit by a car) = 0.4 / 0.6 = 0.67

So, Odds Ratio = 4/ 0.67 = 6

The Odds ratio implies as we move from a good vision person to a poor vision person,

the odds of getting hit by a car becomes 6 times.

Developing The Model: Model Convergence

In order to estimate the logistic regression model, the likelihood maximization algorithm

must converge. The term infinite parameters refers to the situation when the

likelihood equation does not have a finite solution (or in other words, the maximum

likelihood estimate does not exist). The existence of maximum likelihood estimates for

the logistic model depends on the configurations of the sample points in the observa-

tion space. There are three mutually exclusive and exhaustive categories: complete

separation, quasi-complete separation, and overlap.

Assessing the Model

We saw in multiple regression that if we want to assess whether a model fits the data,

we can compare the observed and the predicted values of the outcome by using R2 .

Likewise, in logistic regression, we can use the observed and predicted values to assess

the fit of the model. The measure we use is the log – likelihood.

The log-likelihood is therefore based on summing the probabilities associated with the

predicted and actual outcomes. The log – likelihood statistic is analogous to the resid-

ual sum of squares in multiple regression in the sense that it is an indicator of how much

unexplained information is there after the model has been fitted. It‘s possible to calcu-

late a log-likelihood for different models and to compare these models by looking at

Complete Separation: No

Estimate

Quasi Separation: No Esti-

mate

Quasi Separation: Has Es-

timates


the difference between their log-likelihoods. One use of this is to compare the state of

a logistic regression against some kind of baseline model. The baseline model that‘s

usually used is the model when only the constant is included. If we then add one or

more predictors to the model, we can compute the improvement of the model as fol-

lows:

Now, what we should do with the rest of the variables which are not in the equation.

For that we have a statistic called Residual Chi – Square Statistic. This statistic tells us

that the coefficients for the variables not in the model are significantly different from

zero, in other words, that the addition of one or more of these variables to the model

will significantly affect its predictive power.

Testing of Individual Estimated Parameters

The testing of individual estimated parameters or coefficients for significance is similar

to that in multiple regression. In this case, the significance of the estimated coefficients

is based on Wald‘s statistic. The statistic is a test of significance of the logistic regression

coefficient based on the asymptotic normality property of maximum likelihood esti-

mates and is estimated as:

The Wald statistic is chi-square distributed with 1 degrees of freedom if the variable is

metric and the number of categories minus 1 if the variable is non-metric.

The Hosmer-Lemeshow Goodness-of-Fit Test

The Hosmer and Lemeshow goodness of fit (GOF) test is a way to assess whether there

is evidence for lack of fit in a logistic regression model. Simply put, the test compares

the expected and observed number of events in bins defined by the predicted proba-

bility of the outcome. The null hypothesis is that the data are generated by the model

developed by the researcher.

Hosmer Lemeshow test statistic:

where Oi is the observed frequency of the i-th bin, Ni is the total frequency of the i-th

bin. Пi is the average estimated probability of the i-th bin.

Statistics Related to Log-likelihood

AIC (Akaike Information Criterion) = -2log L + 2(k + s), k is the total number of response

level minus 1 and s is the number of explanatory variables.

SC (Schwarz Criterion) = -2log L + (k + s)∑j fj , fj is the frequency of the j th observation.

Cox and Snell (1989, pp. 208–209) propose the following generalization of the coeffi-


cient of determination to a more general linear model:

where n is the sample size and L(0) is the intercept only model,

So,

Nagelkerke (1991) proposes the following adjusted coefficient, which can achieve a

maximum value of one:

All these statistics have similar interpretation as the R2 in Linear Regression. So, in this

part we are trying to assess how much information is reflected through the model.

Understanding the Relation between the Observed and Predicted Outcomes

It is very important to understand the relation between the observed and predicted

outcome. The performance of the model can be benchmarked against this relation.

Simple Concepts

Let us consider the following table.

In this table, we are working with unique observations. The model was developed for Y

= Yes. So it should show high probability for the observation where the real outcome

has been Yes and a low probability for the observation where the real outcome has

been No.

Consider the observations 1 and 2. Here the real outcomes are Yes and No respective-

ly, and the probability of the Yes event is greater than the probability of the No event.

Such pairs of observations are called Concordant Pairs. This is in contrast to the obser-

vations 1 and 4. Here we get the probability of the No is greater than the probability of

Yes. But the data was modeled for P(Y = Yes). Such a pair is called a Discordant Pair.

Now consider the pair 1 and 3. The probability values are equal here, although we

have opposite outcomes. This type of pair is called a Tied Pair. For a good model, we

would expect the number of concordant pairs to be fairly high.

Related Measures

Let nc , nd and t be the number of concordant pairs, discordant pairs and unique ob-

servations in the dataset of N observations. Then (t - nc - nd ) is the number of tied pairs.

Observation 1 2 3 4

Outcome Y = Yes No No No

P(Y = Yes) 0.59 0.24 0.59 0.72


Then we have the following statistics:

In ideal case, all

the yes events

should have

very high proba-

bility and the no

events with very

low probability

as shown in the

left chart. But

the reality is

somehow like

the right chart. We have some yes events with very low probability and some no

events with very high probability.

What Should be the Cut-point Probability Level?

It is a very subjective issue to decide on the cut-point probability level, i.e. the proba-

bility level above which the predicted outcome is an Event i.e., Yes. A Classification

Table can help the researcher in deciding the cutoff level.

A Classification table has several key concepts.

Event It is our targeted outcome, e.g. will the customer churn? Yes

is the event.

Non Event The opposite of Event. In previous example, No is the Non-

Event

Correct Event For a probability level, prediction is an event and observed

outcome is also an event.

Correct Non Event

For a probability level, prediction is a non-event and ob-

served outcome is also a non-event.

Incorrect Event For a probability level, prediction is an event but observed

outcome is a non-event.

Incorrect Non Event For a probability level, prediction is a non-event but ob-

served outcome is an event.

Correct Percentage of correct predictions out of total predictions


Other Measures Related to Classification Table

Receiver Operating Characteristic Curves

Receiver operating characteristic (ROC) curves are

useful for assessing the accuracy of predictions. In a

ROC curve the Sensitivity is plotted in function of 100-

Specificity for different cut-off points of a parameter.

Each point on the ROC curve represents a sensitivity/

specificity pair corresponding to a particular decision

threshold. The area under the ROC curve (AUC) is a

measure of how well a parameter can distinguish be-

tween two groups. In our case the parameter is the

probability of the event.

ROC curve shows sensitivity on the Y axis and 100 mi-

nus Specificity on the X axis. If predicting events (not non-events) is our purpose, then

on Y axis we have Proportion of Correct Prediction out of Total Occurrence and on the

X axis we have proportion of Incorrect Prediction out of Total Non-Occurrence for dif-

ferent cut-points.

If the ROC curve turns out to be the red straight line, then it implies that the model is

segregating cases randomly.

Sensitivity Measures the ability to predict an event correctly, calcu-

lated as:

(Correctly predicted as events/ Total number of observed

events )* 100

Specificity Measures the ability to predict a non-event correctly, cal-

culated as:

(Correctly predicted as non-events/ Total number of ob-

served non-events )* 100

False Positive (Incorrectly predicted as event/Total prediction as Event)

* 100

False Negative (Incorrectly predicted as non-event/Total prediction as

Non-Event) * 100


LOGISTIC REGRESSION

This case study is aims to study the credit risk faced by a German bank while extending

loans to borrowers. The credibility of the borrower is entirely private information of the

borrower himself. So, the bank needs to design measures to screen the good credible

borrowers from the bad defaulters. The objective of the analyst is to design a model so

that given any customer who comes up to the bank, the bank is able to predict

whether the customer is ‗good‘ or ‗bad‘. Since, the type of the customers is dichoto-

mous in nature, the best technique of predictive tools that we can use is: Logistic Re-

gression. The dichotomous situation can be represented as:

Y = 1 if the customer is a good customer

= 0 if the customer is a bad customer

Since, the data set is in a csv format therefore we import the data set using the code

below: proc import datafile="C:\Documents and Settings\OrangeTree\Desktop\Analytics data sets and case studies\German Bank data.csv" out=day1.german_bank dbms=csv replace; run;

The new data set in the SAS format is stored in the library ‗day1‘ by the name

‗german_bank‘. The data set is a combination of categorical, character and numeri-

cal variables which describe the status of the customers in some way or the other. Let

us explain some of the variables in this data set:

CHK_ACCNT: This is a categorical variable which shows the amount of money that a

customer has with the bank in the check account. This indicates the credibility of the

customer. If a customer belongs to the category 0, then it speaks adversely about the

credibility of the customer as a customer who belongs to this category has negative

CHK_ACCNT balance. A customer who belongs to category 1 is relatively more relia-

ble compared to the later. Therefore, this variable is categorized in an ascending or-

der. The highest category (3) implies that the customer does not have any checking

account with the bank.

HISTORY: This is a categorical variable classified into four categories. The category 0

represents those customers who have a very clean credit history in that they have nev-

er taken any loans. Category 4 represents the worst possible case where the customer

having the category implies that he has a critical account.

DURATION: This is a numerical variable which describes the time period for which a

loan has been taken.

Each of these variables explains one or the other side of a borrower‘s credibility. We

again use the Robust regression technique for executing the logistic regression. The

german_bank data set is decomposed into training and validation data sets using the

SAS IMPLEMENTATION


‗ranuni‘ function.

data day1.german_bank1; set day1.german_bank; rannum= ranuni (0); run;

This data step is used to create a new data set ‗german_bank1‘ containing an addi-

tional column on random numbers generated by the ranuni code. The random num-

bers have been stored in the variable ‗rannum‘. These random numbers would now be

used for splitting the data set into the training and the validation data set using the fol-

lowing code: data day1.gertraining day1.gervalidation; set day1.german_bank1; if rannum<0.7 then output day1.gertraining; else output day1.gervalidation; run;

The two data sets formed by splitting the original data set are: gertraining, the training

data sets where all statistical and empirical techniques are applied to derive the pri-

mary results and gervalidation where the results obtained in the training data sets are

validated.

proc contents data=day1.german_bank1 position short; run; /*AGE AMOUNT CHK_ACCT CO_APPLICANT DURATION EDUCATION EMPLOYMENT FOREIGN FURNITURE GUARANTOR HISTORY INSTALL_RATE JOB MALE_DIV MALE_MAR_or_WID MALE_SINGLE NEW_CAR NUM_CREDITS NUM_DEPENDENTS OBS OTHER_INSTALL OWN_RES PRESENT_RESIDENT PROP_UNKN_NONE RADIO_TV REAL_ESTATE RENT RESPONSE RETRAINING SAV_ACCT TELEPHONE USED_CAR rannum */

The position short statement allows us to make a list of the variables that we need to

use repeatedly for our analysis.

proc reg data=day1.gertraining; model response= CHK_ACCT DURATION HISTORY NEW_CAR USED_CAR FURNITURE RADIO_TV EDUCATION RETRAINING AMOUNT SAV_ACCT EMPLOYMENT INSTALL_RATE MALE_DIV MALE_SINGLE MALE_MAR_or_WID CO_APPLICANT GUARANTOR PRESENT_RESIDENT REAL_ESTATE PROP_UNKN_NONE AGE OTHER_INSTALL RENT OWN_RES NUM_CREDITS JOB NUM_DEPENDENTS TELEPHONE FOREIGN/vif; run; quit;

The main objective of this step is to check for the existence of multicollinearity among

the independent variables. The explanation and operation of this step is similar to the

SAS IMPLEMENTATION


one performed in the multiple linear regression model.

proc logistic data=day1.gertraining desc; model response=CHK_ACCT DURATION

HISTORY NEW_CAR USED_CAR FURNITURE RADIO_TV EDUCATION RETRAINING AMOUNT SAV_ACCT EMPLOYMENT INSTALL_RATE MALE_DIV MALE_SINGLE MALE_MAR_or_WID CO_APPLICANT GUARANTOR PRESENT_RESIDENT REAL_ESTATE PROP_UNKN_NONE AGE OTHER_INSTALL RENT OWN_RES NUM_CREDITS JOB NUM_DEPENDENTS TELEPHONE FOREIGN/selection=stepwise;

run;

The ‗proc logistic‘ is used to generate the logistic regression. The ‗model‘ keyword is

used to generate the logistic regression model where ‗response‘ is treated to be the

dependent variable. The ‗desc‘ keyword is used to model the probability of Y=1 as by

default SAS models for the lowest value (here zero). The selection= stepwise is the tech-

nique of selection used for entering the variables. At the first step, the intercept is en-

tered. This acts as the baseline or the reference model. Then at each step an addition-

al variable is added. This allows us to capture the impact of each variable inde-

pendently. This code also generates the Maximum likelihood Estimation table. This

gives us the log of the odds ratio. This table does not provide us with useful insights to

the model but it is used to form an idea about the parameter estimates when the re-

sults are validated using the validation data sets. The more important table from the

interpretation point of view is the Odds ratio.

Estimate table: This table gives us the value eβ. This gives the percentage of variation in

the dependent variable due to one unit change in an independent variable. This ta-

ble clearly explains the responsiveness of the dependent variable on the independent

variables. For eg: if the duration of credit increases by 1 unit; then the log of odds of

getting a good credit rating changes by -0.026 times. The odds ratio table for duration

is 0.974. Then if the duration increases by 1 unit then probability of getting a good

credit rating falls by 2.6%.

proc logistic data=day1.gertraining desc; model response=CHK_ACCT DURATION HISTORY NEW_CAR USED_CAR FURNITURE RADIO_TV EDUCATION RETRAINING AMOUNT SAV_ACCT EMPLOYMENT INSTALL_RATE MALE_DIV MALE_SINGLE MALE_MAR_or_WID CO_APPLICANT GUARANTOR

SAS IMPLEMENTATION


PRESENT_RESIDENT REAL_ESTATE PROP_UNKN_NONE AGE OTHER_INSTALL RENT OWN_RES NUM_CREDITS JOB NUM_DEPENDENTS TELEPHONE FOREIGN/selection=stepwise ctable pprob= (0 to 1 by 0.1); run;

The option cable generates the classification table. The keyword ‗pprob‘ generates

the cut off probability levels of a customer being a good customer. For every cut-off

probability level, we have a certain number of correctly and incorrectly classified

event and non-event. For e.g. if the cut-off probability level is 0.1, then any customer

having a probability of being a ‗good customer‘ will be identified as a ‗good custom-

er‘. So, as the cut-off probability level changes, the number of correctly and incorrect-

ly classified event and non-event will change accordingly.

proc logistic data=day1.gertraining desc; model response=CHK_ACCT DURATION HISTORY NEW_CAR USED_CAR FURNITURE RADIO_TV EDUCATION RETRAINING AMOUNT SAV_ACCT EMPLOYMENT INSTALL_RATE MALE_DIV MALE_SINGLE MALE_MAR_or_WID CO_APPLICANT GUARANTOR PRESENT_RESIDENT REAL_ESTATE PROP_UNKN_NONE AGE OTHER_INSTALL RENT OWN_RES NUM_CREDITS JOB NUM_DEPENDENTS TELEPHONE FOREIGN/selection=stepwise ctable pprob= (0 to 1 by 0.1) lackfit outroc=day1.roc; run;

The keyword ‗lackfit‘ is the keyword to generate the ‗goodness of fit‘ measures. The

goodness-of-fit statistic that we use here is the Hosmer-Lemeshow test statistics.

Outroc is the keyword to generate the ROC measures. The ROC measures are also

known as the Receiver Operating Characteristics. These measures are used to meas-

ure the accuracy of our predictions. The ROC measures are: Sensitivity, Specificity,

False Positive and False Negative etc. The two measures that we use extensively uses

are: Sensitivity and 1-Specificity. The Sensitivity measures the goodness or the accuracy

of the model while 1-Specificity reflects the weakness of the model. In the later meas-

ure, we are trying to find out that of the total number of non-events, how many non-

events were not identified by the model from before.

proc contents data=day1.roc position short; run; /*_STEP_ _PROB_ _POS_ _NEG_ _FALPOS_ _FALNEG_ _SENSIT_ _1MSPEC_ */ proc gplot data= day1.roc; plot _SENSIT_ *_1MSPEC_; run;

In this code, we are trying to plot the ROC measures to understand the goodness of fit

SAS IMPLEMENTATION


of the model. The plot of these two measures gives us a concave plot. This means that

as Sensitivity is increasing, 1-Specificity is increasing but at a diminishing rate. Given the

concave curve, the area under the curve will be greater than 0.5. The ‗c‘-value or the

value of the concordance index gives the measure of the area under the ROC curve.

If c=0.5 then it would have meant that the model cannot perfectly discriminate be-

tween 0 and 1 responses. Then it implies that the initial model cannot perfectly say that

who is a ‗good‘ customer and who is a ‗bad‘ customer. If c> 0.5, then the model can

perfectly discriminate between ‗0‘ and ‗1‘. The ‗c‘-value for our model is 0.81 which is

far greater than the cut-off value of 0.5. So, we can safely regard our model as a

‗good model‘.

The next step in our model construction is to find out the predicted values of the prob-

ability associated with a customer for being a good customer. We use the following

code:

proc logistic data=day1.gertraining desc; model response=CHK_ACCT DURATION HISTORY NEW_CAR USED_CAR FURNITURE RADIO_TV EDUCATION RETRAINING AMOUNT SAV_ACCT EMPLOYMENT INSTALL_RATE MALE_DIV MALE_SINGLE MALE_MAR_or_WID CO_APPLICANT GUARANTOR PRESENT_RESIDENT REAL_ESTATE PROP_UNKN_NONE AGE OTHER_INSTALL RENT OWN_RES NUM_CREDITS JOB NUM_DEPENDENTS TELEPHONE FOREIGN/selection=stepwise ctable pprob= (0 to 1 by 0.1) lackfit outroc=day1.roc; output out=day1.logistic p=predicted; run;

Output out is the keyword that helps to generate a dataset from the proc steps. ‗p‘ will

give the predicted probability of a borrower being a good borrower. In the data set

the last column is the predicted probability values. However, we need a cut-off value

of probability to decide on the ‗really good customers‘. The next code is used to set a

limit for the borrowers to be a ‗good borrower‘.

data day1.logistic1; set day1.logistic; status= (predicted>0.5); run;

In essence, this code states that the predicted value of the probability greater than 0.5

then the value of the status is 1 else it is 0. Now we want to understand the percentage

of predictions which match with the initial belief obtained from the data set. We use a

contingency frequency table for this analysis.

proc freq data=day1.logistic1;

SAS IMPLEMENTATION


tables Response*Status/ norow nocol; run;

The proc freq keyword is used to generate the contingency table. Here we compare

the (1-1) and (0-0) pairs. The total percentage of frequency in these two boxes repre-

sents the percentage of correct predictions made by the model. This model shows

that approximately 77.1% of the predictions are correct. Now, we need to validate the

results obtained in the training data set. The codes in (a) and (b) are the set of proce-

dures that we follow for validating the data sets:

The parameter estimates in the green color represent the parameter estimates in the

training data set. A regression equation is constructed so that given a value of the vari-

able in the data set, we can obtain a predicted value of the probability Y=1.

(a) data day1.gervalidation1; set day1.gervalidation; z=0.3626+0.5486*CHK_ACCT -0.0260*DURATION+0.4938*HISTORY -0.9548*NEW_CAR-1.1585*EDUCATION -0.00012*AMOUNT+0.3377*SAV_ACCT -0.3124*INSTALL_RATE+0.4367*MALE_SINGLE +1.2503*GUARANTOR+0.0269*AGE -0.6323*OTHER_INSTALL-0.4558*NUM_CREDITS; predicted=exp (z)/ (1+exp (z)); status= (predicted>0.5); run;

The predicted value of the probability of Y=1 is obtained using the standard formula

from the logistic regression literature, which has already been discussed.

(b) proc freq data=day1.gervalidation1; table Response*Status/norow nocol; run;

This step is used for creating a contingency table for checking the correctness of pre-

diction. The (1-1) and (0-0) pairs are checked to see the extent of correctness in the

model in accordance with the observations in the data set. This percentage correct-

ness is tallied with those in the training data set. A percentage difference between the

two sets around (5%-6%) is considered to be a good estimate. This code is the final

check of the goodness of fit of the obtained model.

SAS IMPLEMENTATION


Chapter 11

TIME SERIES ANALYSIS

L et‘s consider the following table. The table shows the quarterly sales of a com-

pany. Our purpose is to predict the sales figure of the next quarter.

So, in a time series we are trying express the dependent variable (Sales) as a

function of time period.

Chart of the Data: Look at the X and Y axis

Formal Definition

A time series is a sequence of data points, measured typically at successive times,

spaced at uniform time intervals. Now the period or the uniform time interval can be

as large as a century if you collect geological data, can be as small as a second if you

collect biological data, and can be a quarter if you consider economic data. The im-

portant issue is that the time interval must be uniform.

Time series analysis comprises methods for analyzing time series data in order to ex-

tract meaningful statistics and other characteristics of the data.

Time series forecasting is the use of a model to forecast future events based on known

past events: to predict data points before they are measured.

Assumptions of Time Series Analysis

Pattern of past will continue into the future.

Discrete time series data should be equally spaced over time. There should be no

missing values in the training data set( or should be handled using appropriate miss-

ing value treatment in the data preparation process).

Time Series Analysis cannot be used to predict effect of random events( Example-

Terrorist Attacks or acts of god such as Tsunami, disaster, etc.).

Year Quarter Sales 2008 I 10.2 II 12.4 III 14.8 IV 15.0 2009 I 11.2 II 14.3 III 18.4 IV 18.0


What is a Trend Component?

The trend is the long term pattern of a time series. A trend can be positive or negative

depending on whether the time series exhibits an increasing long term pattern or a de-

creasing long term pattern. If a time series does not show an increasing or decreasing

pattern then the series is stationary in the mean. For example, population increases

over a period of time, price increases over a period of years, production of goods of

the country increases over a period of years. These are the examples of upward trend.

The sales of a commodity may decrease over a period of time because of better

products coming to the market. This is an example of declining trend or downward

trend.

What is a Cyclical Component?

Any pattern showing an up and down movement around a given trend is identified as

a cyclical pattern. The duration of a cycle depends on the type of business or industry

being analyzed. A business cycle showing these oscillatory movements has to pass

through four phases-prosperity, recession, depression and recovery.


What is a Seasonal Component?

Seasonality occurs when the time series exhibits regular fluctuations during the same

month (or months) every year, or during the same quarter every year. This continues to

repeat year after year. The major factors that are responsible for the repetitive pattern

of seasonal variations are weather conditions and customs of people. More woolen

clothes are sold in winter than in the season of summer .Regardless of the trend, we

can observe that in each year more ice creams are sold in summer and very little in

Winter season. The sales in the departmental stores are more during festive seasons

that in the normal days.

What is a Random Component?

This component is unpredictable. Every time series has some unpredictable compo-

nent that makes it a random variable. These variations are fluctuations in time series

that are short in duration, erratic in nature and follow no regularity in the occurrence

pattern. These variations are also referred to as residual variations since by definition

they represent what is left out in a time series after trend ,cyclical and seasonal varia-

tions. Irregular fluctuations results due to the occurrence of unforeseen events like

floods, earthquakes, wars, famines, etc.


Since economic cycles are very hard to predict, most time series pattern are de-

scribed in terms of trend and seasonality. The irregular or the random events can be

smoothed out by using Simple, Weighted, or Exponential Moving Averages.

Formula

Simple Moving Average of n periods

Weighted Moving Average of n

periods

Exponential Moving Average of n

periods

For Exponential Moving Average, a small α indicates, we are giving less emphasis on

recent periods and more on the previous periods, as a result, we get a slower moving

average.

Understanding Different Types of Moving Averages: Which One is Faster?

Understanding Moving Averages of Different Periods: Find out the Trends


Now How Do We Make Predictions?

The overall idea is that we extract a trend part, adjust the trend for seasonal compo-

nent, and make forecast. Now there can be two variations:

Y = T + C + S + e, Or Y = T * C * S * e

Where T = Trend Component

C = Cyclical Component

S = Seasonal Component

and e is the random part

These two variations are respectively known as Additive and Multiplicative Models.

Various Trends

Different Approaches

STEPAR (Stepwise Auto-regression)

Here we fit a time trend model to the series and takes the difference between each

value and the estimated trend. This process is called DETRENDING. Then, the remaining

variation is fit using an autoregressive model.

We will learn about the Autoregressive Model in the coming segments.

EXPONENTIAL

Exponential smoothing forecasts are forecasts for an integrated moving-average pro-

cess; however, the weighting parameter is speci-

fied by the user rather than estimated from the

data. As a general rule, smaller smoothing weights

are appropriate for series with a slowly changing

trend, while larger weights are appropriate for vol-

atile series with a rapidly changing trend.

WINTERS METHOD

The WINTERS method uses updating equations

similar to exponential smoothing to fit parameters

for the model.

xt=(a+bt)s(t)+ ϵt

where a and b are trend parameters

and the function s(t) selecting the seasonal pa-

rameter.


Stochastic Processes

A random or stochastic process is a collection of random variables ordered in time. An

example of the continuous stochastic process is an electrocardiogram and an exam-

ple of the discrete stochastic process is GDP.

In what sense can we regard GDP as a stochastic process?

Consider for instance the GDP of $2872.8 billion for 1970–I. In theory, the GDP figure for

the first quarter of 1970 could have been any number, depending on the economic

and political climate then prevailing. The figure of 2872.8 is a particular realization of all

such possibilities. You can think of the value of $2872.8 billion as the mean value of all

possible values of GDP for the first quarter of 1970. Just as we use sample data to draw

inferences about a population, in time series we use the realization to draw inferences

about the underlying stochastic process.

Stationary Stochastic Processes

A stochastic process is said to be stationary if its mean and variance are constant over

time and the value of the covariance be-

tween the two time periods depends only on

the distance or gap or lag between the two

time periods and not the actual time at

which the covariance is computed. Such a

time series will tend to return to its mean

(called mean reversion) and fluctuations

around this mean (measured by its variance)

will have a broadly constant amplitude. If a

time series is not stationary in the sense just defined, it is called a non-stationary time

series.

Why are stationary time series so important?

If the time series is non-stationary, then each set of time set data will have its own char-

acteristics. So we cannot generalize the behavior of one set to other sets.

Non- Stationary Process:

Variance is Changing

Non- Stationary Process:

Mean is Changing


Random Walk Model (RWM)

This is a type of Non-Stationary Process. The term random walk is often compared with

a drunkard‘s walk. Leaving a bar, the drunkard moves a random distance ut at time t,

and, continuing to walk indefinitely, will eventually drift farther and farther away from

the bar. The same is said about stock prices. Today‘s stock price is equal to yesterday‘s

stock price plus a random shock. There are two types of Random Walk:

Random Walk Without Drift (No Constant Term):

Where ut is a white noise error term with

mean = 0 and variance = σ2

Random Walk With Drift (With Drift Parameter):

Where δ is the drift parameter

Random Walk Without Drift

Random Walk With Drift (δ = 0.06)


Unit Root Problem

A stochastic process expressed by Yt= ρY(t-1)+ut, where 0≤ρ≤1, is called a unit root sto-

chastic process. If ρ is in fact 1, we face what is known as the unit root problem, that is,

the process is non-stationary; we already know that in this case the variance of Yt is not

stationary. The name unit root is due to the fact that ρ=1. Thus the terms non-

stationarity, random walk, and unit root can be treated as synonymous. If, however, if

|ρ|≤1, that is if the absolute value of ρ is less than one, then it can be shown that the

time series Yt is stationary. In practice, then, it is important to find out if a time series

possesses a unit root.

Unit Root Process (ρ = 1): Non-stationary

Unit Root Process (-1< ρ <1): Stationary

Putting Them All Together

The distinction between stationary and non-stationary stochastic processes (or time

series) has a crucial bearing on whether the trend (the slow long-run evolution of the

time series under consideration) observed in the constructed time series is deterministic

or stochastic. Broadly speaking, if the trend in a time series is completely predictable

and not variable, we call it a deterministic trend, whereas if it is not predictable, we

call it a stochastic trend. To make the definition more formal, consider the following


model of the time series Yt :

Case 1: Pure Random Walk

If β1=0, β2=0 and β3=1, we get, Yt= Y(t-1)+ ut which is nothing but a RWM without drift and

is therefore non-stationary. Again, ΔYt= (Yt - Y(t-1) )= ut which is stationary. Hence, a

RWM without drift is a difference stationary process.

Case 2: Random Walk With Drift

If β1≠0, β2=0 and β3=1, we get, Yt= β1+Y(t-1)+ ut which is a RWM with drift and is therefore

non-stationary. Again, ΔYt= (Yt - Y(t-1) )= β1+ ut which means that Yt will exhibit a positive

(β1>0)or negative (β1<0) trend. Such a trend is called a Stochastic Trend. Again this is a

difference stationary process as ΔYt is stationary.

Case 3: Deterministic Trend

If β1≠0, β2≠0 and β3=0, we get, Yt=β1+ β2 t+ ut which is called a Trend Stationary Process.

Although the mean of the process β1+ β2 t is not constant, its variance is. Once the val-

ues of β1 and β2 are known, the mean can be forecasted perfectly. Therefore, if we

subtract the mean from Yt, the resulting series will be stationary, hence the name trend

stationary. This procedure of removing the (deterministic) trend is called detrending.

Case 4: Random Walk With Drift and Deterministic Trend

If β1≠0, β2≠0 and β3=1, we get, Yt=β1+ β2 t+Yt-1+ ut we have a random walk with drift and

a deterministic trend. ΔYt= (Yt - Y(t-1) )=β1+ β2 t+ ut, which implies that Yt is non – station-

ary.

Case 5: Deterministic Trend with Stationary Component

If β1≠0, β2≠0 and β3<1, we get, Yt=β1+ β2 t+β3 Yt-1+ ut which is stationary around a deter-

ministic trend.

Deterministic Versus Stochastic Trend


How Do We Test Whether The Time Series Is Stationary?

By now, all of us probably have a good idea about the nature of stationary stochastic

processes and their importance. There are broadly three ways to find out whether the

Time Series under consideration is stationary or not. The first way is to plot the time series

and look into the chart. In the last few topics, we have seen how a non- stationary se-

ries looks like. For example, if the chart is showing an upward trend, it‘s suggesting that

the mean of the data is changing. This may suggest that the data is non – stationary.

Such an intuitive feel is the starting point of more formal tests of stationarity. The other

methods of checking stationarity is Autocorrelation Function or Correlogram and Unit

Root Test.

What is Autocorrelation Function or Correlogram?

Autocorrelation refers to the correlation of a time series with its own past and future

values. The first-order autocorrela-

tion coefficient is the simple corre-

lation coefficient of the first N – 1

observations x1, x2, x3, …, x(N-1)and

the next N – 1 observations, x2, x3,

x4, …, xN. Similarly, we can define

higher order autocorrelation coef-

ficients. So for different order or

lag, we will get different autocorre-

lation coefficients. As a result, we

can define the autocorrelation co-

efficients as a function of lag. This

function is known as Autocorrela-

tion Function and its graphical

presentation is known as Correlogram. A rule of thumb is to compute ACF up to one-

third to one-quarter the length of the time series. The statistical significance of any au-

tocorrelation coefficient can be judged by its standard error. Bartlett has shown that if

a time series is purely random, that is, it exhibits white noise, the sample autocorrelation

coefficients follows a normal distribution with

mean = 0 and variance = 1/ Sample Size. Correlogram of A Stationary Process


The Unit Root Test

As we learnt from the unit root stochastic process

Yt= ρY(t-1)+ut, where -1≤ρ≤1

Also we learnt that in the case of unit root, ρ=1.

For theoretical reasons, we convert the equation as follows:

Yt - Y(t-1)=(ρ-1) Y(t-1)+ ut

Or, ΔYt=δY(t-1)+ ut

Where δ=(ρ-1) and Δ is the first difference operator.

Here, we are going to test the null hypothesis that δ=0, that is, ρ=1 which means that

the time series under consideration is non – stationary.

Now the problem is that under null hypothesis, the t value of the estimated coefficient

of Yt−1does not follow the t distribution even in large samples. Dickey and Fuller have

shown that, under the null hypothesis, the estimated t value of the coefficient of Yt−1

follows the τ (Tau) Statistic. In honor of the discoverer, this test is known as Dickey –

Fuller (DF) Test. In conducting DF test, we assumed that the error terms ut are uncorre-

lated. But in case the ut are correlated, Dickey and Fuller have developed a test,

known as the Augmented Dickey–Fuller (ADF) test.

Problems of Unit Root Tests

Apart from ADF, there are other unit root tests. There are limitations to these tests in-

cluding ADF. Most tests of the Dickey Fuller type tend to accept the null of unit root

more frequently than is warranted. That is, these tests may find a unit root even when

none exists.

How Do We Transform A Non- Stationary Time Series?

Difference-Stationary Process (DSP): If a time series has a unit root, the first differ-

ences of such time series are stationary.

Trend-Stationary Process (TSP): A Trend Stationary Process is stationary around the

trend. Hence, the simplest way to make such a time series stationary is to regress it

on time and the residuals from this regression will then be stationary.

It should be pointed out that, if a time series is DSP but we treat it as TSP, this is called

Under-Differencing. On the other hand, if a time series is TSP but we treat it as DSP, this

is called Over-Differencing. The consequences of these types of specification errors

can be serious.

Integrated Stochastic Process

If a time series has to be differenced once to make it stationary, we such time series

Integrated of Order 1. Similarly, if a time series has to be differenced twice (i.e., take

the first difference of the first differences) to make it stationary, we call such a time se-

ries integrated of order 2. For example, pure random walk is non-stationary, but its first

difference is stationary. So we call random walk without drift integrated of order 1.

In general, if a (non-stationary) time series has to be differenced d times to make it sta-


tionary, that time series is said to be integrated of order d and is denoted as Yt∼I(d). If

a time series is stationary at the beginning, then it is said to be integrated of order zero.

Most economic time series are generally I(1); that is, they generally become stationary

only after taking their first differences.

Now How Do We Forecast?

Before forecasting we need to model the time series. If a time series is stationary, we

can model it in a variety of ways.

Autoregressive (AR) Process

Let Yt represents GDP at time t. If we model Yt as

where δ is the mean of Y and ut is an uncorrelated

random error term with zero mean and constant variance σ2 , then we say Yt follows a

first order autoregressive, or AR(1) stochastic process. Here the value of Y at time t de-

pends on its value in the previous time period and a random term; the Y values are ex-

pressed as deviations from their mean value. In general, we can have an AR(p) as fol-

lows:

Moving Average (MA) Process

Suppose we model Y as:

where μ is a constant and u, as before, is the white noise stochastic error term. Here Y

at time t is equal to a constant plus a moving average of the current and past error

terms. Thus, in the present case, we say that Y follows a first-order moving average, or

an MA(1) process. More generally, a MA(q) process is expressed as:

In short, a moving average process is simply a linear combination of white noise error

terms.

Autoregressive Moving Average (ARMA) Process

Of course, it is quite likely that Y has characteristics of both AR and MA and is therefore

ARMA. Thus, Yt follows an ARMA(1, 1)process if it can be written as:

where θ represents a constant term. Again this expression can be generalized for an

ARMA(p, q) process.


Autoregressive Integrated Moving Average (ARIMA) Process

Now in the last segment we learnt about the Integrated Stochastic Process of order d,

which implies we have to difference a time series d times to make it stationary. So, giv-

en a time series, we first have to difference it d and then apply an ARMA(p, q) to mod-

el it. Then we say the original time series is ARIMA(p, d, q). Thus, an ARIMA(2, 1, 2) time

series has to be differenced once(d=1)before it becomes stationary and the (first-

differenced) stationary time series can be modeled as an ARMA(2, 2) process, that is, it

has two AR and two MA terms. Of course, if d=0 (i.e., a series is stationary to begin

with), ARIMA(p, d=0,q)= ARMA(p, q). Note that an ARIMA(p, 0, 0) process means a

purely AR(p)stationary process; an ARIMA(0, 0,q) means a purely MA(q) stationary pro-

cess. Given the values of p, d, and q, one can tell what process is being modeled.

The Box–Jenkins Methodology

The objective of B–J [Box–Jenkins] is to identify and estimate a statistical model which

can be interpreted as having generated the sample data. If this estimated

model is then to be used for forecasting we must assume that the features of this mod-

el are constant through time, and particularly over future time periods. Thus we must

have either a stationary time series or a time series that is stationary after one or more

differencing.

What actually are we looking for?

The most important question while modeling a time series is how does one know

whether it follows a purely AR process (and if so, what is the value of p) or a purely MA

process (and if so, what is the value of q) or an ARMA process (and if so, what are the

values of p and q) or an ARIMA process, in which case we must know the values of p,

d, and q. The BJ methodology comes in handy in answering the preceding question.

The method consists of four steps:

Identification: Find out the appropriate values of p, d, and q

Estimation: Having identified the appropriate p and q values, the next stage is to

estimate the parameters of the autoregressive and moving average terms includ-

ed in the model

Diagnostic Checking: Having chosen a particular ARIMA model, and having esti-

mated its parameters, we next see whether the chosen model fits the data reason-

ably well Forecasting

So, How Do We Find p, d and q?

The chief tools in identification are the autocorrelation function (ACF),the partial auto-

correlation function (PACF),and the resulting correlograms, which are simply the plots

of ACFs and PACFs against the lag length.

Identifying d

If the series has positive autocorrelations out to a high number of lags, then it prob-


ably needs a higher order of differencing

If the lag-1 autocorrelation is zero or negative, or the autocorrelations are all small

and patternless, then the series does not need a higher order of differencing. If the

lag-1 autocorrelation is -0.5 or more negative, the series may be overdifferenced

The optimal order of differencing is often the order of differencing at which the

standard deviation is lowest

Identifying AR(p)

ACF decays exponentially or with dampened sine wave pattern or both and PACF has

significant spikes through lags p.

Identifying MA(q)

The PACF decays exponentially and the ACF has significant spikes through lags q.

How Do We Estimate The Parameters?

Estimating the parameters for the Box–Jenkins models is a quite complicated non-

linear estimation problem. For this reason, the parameter estimation should be left to a

high quality software program that fits Box–Jenkins models. The main approaches to

fitting Box–Jenkins models are non-linear least squares and maximum likelihood estima-

tion. Maximum likelihood estimation is generally the preferred technique.

How Good Is The Model?

One simple diagnostic is to obtain residuals from the model developed and obtain

the ACF and PACF of these residuals, say, up to lag 25. The correlograms of both auto-

correlation and partial autocorrelation give the impression that the residuals estimated

from model are purely random.


TIME SERIES ANALYSIS

data day1.sales_time (drop=date); set day1.salest; date1=input (date, monyy5.); format date1 monyy5.; run;

This code converts the variable ‗date‘ from a ‗character‘ variable to a ‗numeric‘ vari-

able. The input function is used to make this conversion.

proc forecast data=day1.sales_time out=day1.sales11 lead=10 interval=month; id date1; var sales; run;

The procedure Forecast is used to generate the forecasted output for the following

ten periods. So, date1 would have the upcoming 10 time points. _TYPE_ in Sales11 indi-

cates forecasted value because here we have generated the forecasted values of

Sales. _LEAD_ is talking about the 10 lead points and ‗sales‘ has corresponding predic-

tions on Sales where the point estimation on Sales are being given. However, we want

an interval but not a point estimate. To obtain the interval estimation we use the fol-

lowing code:

proc forecast data=day1.sales_time out=day1.sales11 lead=10 interval=month outlimit alpha=0.01; id date1; var sales; run;

In this code the tests of confidence intervals are conducted at 99%, i.e. the level of sig-

nificance is 0.01 which has been specified using the command ‗alpha‘. The outlimit

option restricts the sales11 data set to show only the forecasted values along with their

limits. But if we use outfull instead of outlimit, then the this output will be accompanied

with the past actual value and their forecasted value from the model.

proc forecast data=day1.sales_time out=day1.sales11 lead=10 interval=month outfull alpha=0.01; id date1; var sales; run; data day1.sales12; set day1.sales11 (obs=50); run;

SAS IMPLEMENTATION


We use the ‗outfull‘ keyword to generate the forecasted values for the given time peri-

od. This would help us to compare the actual sales and forecasted sales figures for the

given time periods, i.e. from July 89 to July 91.

data day1.actual day1.forecast; set day1.sales12; if _type_ = "ACTUAL" then output day1.actual; else output day1.forecast; run;

Here we split the data set into the actual and forecasted data set. Next we create a

data set ‗merge_time‘.

data day1.merge_time; merge day1.actual (rename= (sales=actual_sales)) day1.forecast (rename= (sales=forecasted_sales)); by date1;run;

merge_time data set merges the actual and forecasted value.

proc contents data=day1.merge_time position short; run;

This code creates a list of the variables in the data set as in the list below:

/*date1 _TYPE_ _LEAD_ actual_sales forecasted_sales */ data day1.merge_time (drop=_type_ _lead_); set day1.merge_time; run;

From this data set we want to drop the _TYPE_ and since we want to get the lead, we

drop lead=0. Now to plot the actual and forecasted sales using the Big Legend Code:

legend1 label= (" ") value= ("original" "forecast"); symbol1 c=blue v=dot H=0.7 i=join l=1; symbol2 c=green v=dot H=0.7 i=join l=1; axis1 label= (angle=90 "actual vs. forecast"); axis2 label=("date");

This is a global statement, meaning that for the remaining SAS session the measures will

be applied for the other graphs as well. Now, legend 1 is the name of the legend that

we construct here and ‗label‘ refers to the label name that will be applied on the leg-

end. The value= (―original‖ ―forecast‖) refers to the labels assigned to the legend.

Now, ―symbol 1‖ refers to the symbol assigned to the first graph; i.e. graph for ‗actual

SAS IMPLEMENTATION


sales‘. Here, the color given is blue, and the graph is shown in terms of dots or bubbles.

The size of the bubbles is 0.7 given by H. ―i=join‖ means that the bubbles are joined by

straight lines l=1 refers to the width of the line. Same goes for the symbol 12 and the

colour of the line is green. ‗axis 1‘ refers to the vertical axis where label given to it is

‗actual versus forecasted‘ and the angle is 90 degree. ‗axis 2‘ refers to the horizontal

axis and the label given to it is ‗date‘.

proc gplot data=day1.merge_time; plot actual_sales*date1=1 forecasted_sales*date1=2/overlay vaxis=axis1 haxis=axis2 legend=legend1; run; quit;

The given here is merge_time and we are applying the ‗gplot‘ option since we are

plotting numeric variables. Here date 1 is plotted on the horizontal axis and the actual

and forecasted sale values are plotted on the vertical axis. The ‗over-lay‘ keyword has

been used to impose one graph over the other. Axis1 refers to the vertical axis and Ax-

is2 refers to the horizontal axis. Now; legend = legend1 means that we are applying the

pre-specified legend option and the quit statement is given to stop the global state-

ment execution.

At time point of June 90, we have the original sales exceeding the forecasted sales. At

the time point of April 91 we have forecasted sales greater than the actual sales and a

time point of August 1990; we have both the actual and forecasted sales almost equal

to each other. So, whenever there is any difference between the actual and forecast-

ed sales we call it a prediction error, which can be either positive or negative.

STATIONARITY OF TIME SERIES

The case study involves the study of a data set ‗timeseriesairline‘ where there are 144

observations on the number of passengers for different dates. Now, we would like to

see if the time series data on passengers is stationary.

proc contents data=day1.timeseriesairline position short; run; /*Date Passengers */

The ‗position short‘ keyword is used to list the variables in their creation order.

proc gplot data=day1.timeseriesairline; plot Passengers*Date; run; quit;

The ‗gplot‘ function is used to plot the variables passengers and date. ‗gplot‘ is used

SAS IMPLEMENTATION


since both the variables plotted are numeric variables. The plot shows that the mean

and variance exhibits an upward rising trend or, in other words it exhibits clear non-

stationarity. So now, let‘s see what adjustment can be made to make the data non-

stationary. So, our main objective is to make the data set stationary in terms of mean

and variance. We apply the differencing technique to solve this problem. The tech-

niques of simple differencing and log differencing are used to restore stationarity.

data day1.timeseriesairline1; set day1.timeseriesairline; passlog=log (Passengers); pass1=dif1 (Passengers); pass2=dif1 (passlog); run;

In this data step creates a new data set with the variables passlog, pass1, pass2 along

with the other prevalent variables. Passlog is the log of passenger value. Pass1 is the

first difference applied on the number of passengers. That is Pass1 is equal to

(passengers in February 1949- passengers in January 1949). Pass2 is the first order differ-

encing applied on ‗passlog‘. Therefore, Pass2 equals to (passlog in February 1949-

passlog in January 1948). Now, we would plot passlog, pass1, pass2 against the varia-

ble time to see which of them is stationary.

proc gplot data=day1.timeseriesairline1; plot Passlog*Date; run; quit;

In this graph where we are plotting (passlog) with respect to the date we are getting a

non-stationary graph, i.e. where the mean and variance are fluctuating over time and

are not constant.

proc gplot data=day1.t timeseriesairline1; plot Pass1*Date; run; quit;

In this graph where we plot Pass1 with respect to date we get a mean stationary pro-

cess, where the mean is constant and the variance is not. Though the mean is con-

stant, the variance is increasing over time. Thus, the simple first order differencing is not

sufficient to restore stationarity of the time series. So, we try to apply the log differenc-

ing.

proc gplot data=day1.timeseriesairline1; plot Pass2*Date; run;

SAS IMPLEMENTATION


In the graph where we plot Pass2 with respect to date, we get a variance stationary

process, where both variance and mean are constant, i.e. they do not change over

time. After checking for stationarity, we come to the time series modeling part:

proc arima data=day1.timeseriesairline; identify var=passengers stationarity= (adf= (1, 13)); run; quit;

‗proc arima‘ is the procedure for generating an Auto-Regressive Integrated Moving

Average (ARIMA) procedure on the dataset ‗timeseriesairline‘. The variable we consid-

er is ‗passengers‘. The ‗stationarity‘ key word is used to test whether the dataset is a

stationary or not. The stationarity is measured in terms of the number of passengers

who board the plane at different points of time during the period under consideration.

The Augmented Dickey- Fuller test statistic is used to check the stationarity of the data

variable ‗passengers‘. Here, for ADF the parameter values taken are (1, 13), i.e. since

ADF follows an AR process, we take two parameter values for that AR process, i.e. at

lag (1, 13). This means that we are trying to formulate two models here. The first, where

value in the month of February will only depend on the value in the month of January

and the second, where the value in the month of February 2012, depends on all past

values till February 2011. That the data ‗timeseriesairline‘ is non-stationary is very clear

from the pattern of the Correlogram. The Correlogram is the graphical representation

of the Autocorrelation Function. In the ACF, we find that the graph is gradually declin-

ing but it does not zero. Therefore, we can conclude that the data on the passengers

is not stationary. Therefore, we need make a first difference of the two models:

proc arima data=day1.timeseriesairline; identify var=passengers (1, 1) stationarity= (adf= (1, 13)); run; quit;

To solve the problem associated with the stationarity, the technique of simple first or-

der differencing is used on the variable ‗passengers‘. The (1,1) shows the number of

differencing done in the two models to make them stationary. When the first model,

i.e. where the passengers at time ‗t‘ are assumed to be dependent on the number of

passengers at time ‗t-1‘ is differenced once, then it becomes stationary, whereas in

the second model, where there are a total of 13 lags, one period differencing does

not remove the seasonal component from the data. Therefore, this model remains non

-stationary.

proc arima data=day1.timeseriesairline; identify var=passengers (1, 1) minic stationarity= (adf= (1, 13)); run; quit;

SAS IMPLEMENTATION


The question here is: Can we estimate the time series model in the presence of non-

stationarity? To answer this, our first job is to obtain the optimum number of lags in the

model using the ‗Bayesian Minimum Information Criteria‘ (BIC). This gives us the Mini-

mum possible number of lags, which are significantly required to forecast the depend-

ent time-series variable. The keyword ‗minic‘ is used to obtain the optimum number of

lags from all the existent lags. The BIC calculates that the optimum lag for the AR mod-

el should be 5 and that of the MA model is 3. The main idea here is that we want to

check whether the existence of optimum lags is sufficient for a proper estimation of a

time series model.

proc arima data=day1.timeseriesairline; identify var=passengers (1, 1) minic stationarity= (adf= (1, 13)); estimate p=5 q=3; forecast lead=12 interval=month; run; quit;

This code estimates the time series model using lags p=5 and q=3. The ‗forecast‘ key-

word is used to forecast the value of the time series variable ‗passengers‘. ‗lead‘ is the

keyword for assigning the number of periods in advance for which forecast is made

and ‗interval‘ is the keyword which specifies the time interval at which forecasting is to

be made . This code tries to forecast the number of passengers that would board the

plane over the next twelve months. The estimation results generate a report whereby

we can see that the parameter estimates calculated by the model are unstable esti-

mates. Therefore, it can be inferred that the model estimation is not possible in the ab-

sence of Stationarity. Therefore, the foremost important job is to restore stationarity into

the model.

proc arima data=day1.timeseriesairline; identify var=passengers (1, 12) minic stationarity= (adf= (1,13)); run; quit;

Stationarity in this model can be restored by differencing the model with 13 lags

twelve times, so that it becomes a single lag AR process. The optimum number of lags

is calculated using the ‗minic‘ keyword.

proc arima data=day1.timeseriesairline; identify var=passengers (1, 12) minic stationarity= (adf= (1, 13)); estimate p=1 q=0; forecast lead=12 interval=month; run; quit;

SAS IMPLEMENTATION


Appendix

SUGGESTED BOOKS AND REFERENCES

1. Gujarati, D. N. , Basic Econometrics, 5th Edition, Tata McGraw-Hill

2. Field, A. , Discovering Statistics Using SPSS, 2nd Edition, Sage Publications

3. Hair, J. , Anderson, R. , Babin, B. Multivariate Data Analysis, 7th Edition, Pren-

tice Hall

4. Malhotra, N. K. , Dash, S. , Marketing Research: An Applied Orientation, 5th

Edition, Pearson Education

5. SAS Online Doc® Version 8 PDF files, Worcester Polytechnic Institute,

http://www.math.wpi.edu/saspdf/common/mainpdf.htm

6. Rud, O. P. , Data Mining Cookbook: Modeling Data for Marketing, Risk, and

Customer Relationship Management, John Wiley & Sons, 2000

Head Office:

Orangetree Business Solutions Private Limited,

HB 206, Salt Lake City, Kolkata - 700 106.

Call Us:

09051563222

Mail Us:

[email protected]

Website:

www.orangetreeglobal.com