2 the linear regression model

2. The Linear Regression Model

Joshua Sherman Applied Econometrics

040693 University of Vienna

Regression model

We use regression models to answer the following types of questions:

If one variable changes in a certain way (PUNISHMENT), by how much will another variable (CRIME) change?

Given the value of one variable, can we predict the corresponding value of another?

Simple linear regression model

We begin by discussing the simple linear regression model, in which there is only one explanatory variable on the right hand side of the regression equation:

= 0 + 1

The unknown parameters 0 and 1 are the intercept and slope of the regression function. We refer to them as population parameters.

Suppose Y is sales of umbrellas and X is the amount of rainfall in centimeters in a given city. Then the slope coefficient 1 represents the change in the average number of umbrella sales given a 1 cm change in rainfall. The intercept coefficient represents the average number of umbrella sales in a city with no rainfall.

Simple linear regression model

The number of umbrella sales for all cities in which annual rainfall is a given amount (for example, 10 cm) will be scattered around the mean.

A probability density function (pdf) will depict how these values are scattered around the mean.

The mean is just one descriptor of a distribution. Another important descriptor is the variance.

The variance is defined as the average of the squared differences between the

values of a distribution and the mean. It is essentially a measure of the extent to which values of a distribution are spread out. Mathematically:

= ( )2

Probability density function for Y

Y

X=10 X=25

= 0 + 1

This regression function shows the average number of umbrellas sold at different levels of rainfall, in centimeters

The conditional variance of Y is = 2 for all values of X

Probability density function for Y

On the previous slide, the constant variance assumption implies that at each level of rainfall X we are equally uncertain about how far values of Y will be from their average value, = 0 + 1. Data satisfying this condition are considered to be homoskedastic. If this assumption is not satisfied, the data are considered to be heteroskedastic.

The error term

An observation on Y can be decomposed into two parts:

Systematic component:

= 0 + 1

Random component:

= = 0 1

Rearranging we obtain the simple linear regression model:

= 0 + 1 +

The error term

Why do we introduce an error term?

Unavailability of data Randomness in human behavior Net influence of a large number of small and independent causes

As e is the random component, we know that = 0 because:

= 0 1 = 0 We also know that the variances of Y and e are identical and equal

to 2 because they only differ by a constant. Thus the pdfs for Y and e are identical in all respects except for their location.

The error term initial assumptions

Several assumptions are required in order to run the simple linear regression model. Thus far we have assumed that: 1. = 0 + 1 + 2. = 0, which is equivalent to stating that = 0 + 1.

3. = 2 = (homoskedasticity)

Later we will see why these assumptions (and others) are important for our purposes

The population vs. the sample

In practice, the econometrician will possess a sample of Y values corresponding to some fixed X values rather than data from the entire population of values. Therefore the econometrician will never truly know the values of 0 and 1.

However, we may estimate these parameters. We will denote these estimators as 0 and 1.

Ordinary least squares

So how shall we find 0 and 1? We need a method or rule for how to estimate the population parameters using sample data.

The most widely used rule is the method of least squares, or ordinary least squares (OLS). According to this principle, a line is fitted to the data that renders the sum of the squares of the vertical distances from each data point to the line as small as possible.


Therefore the fitted line may be written as:

= 0 + 1

The vertical distances from the fitted line to each point are the least squares residuals, . They are given by:

= = 0 1


Mathematically, we want to find 0 and 1 such that the sum of the squared vertical distances from the data points to the line is minimized:

min 2

= ( )2 = (0 1)

2

If you do not recall how to find the solution for 0 and 1 using partial derivatives, the steps may be found in the course text.

The least squares estimators

Upon solving this minimization problem we find that:

1 = ( )( )

2

0 = 1

where =

and =

are the sample means

of the observations on Y and X.

OLS and the true parameter values

So how are the OLS estimators 0 and 1 related to 0 and 1?

If assumptions 1 and 2 from earlier hold, then 0 = 0 and 1 = 1 (proof provided in the text).

That is, if we were able to take repeated samples, the expected value of the estimators 0 and 1 would equal the true parameter values 0 and 1

When the expected value of any estimator of a parameter equals

the true parameter value, then that estimator is unbiased

Later we will explore how violation of certain assumptions will cause estimators to be biased

OLS and the true parameter values

So the idea behind OLS is that if we are dealing with an instance in which certain assumptions hold, the expected value of the estimators 0 and 1 will equal the true parameter values 0 and 1.

Coefficient of determination

We are interested in a measure that will indicate how good of a fit our sample regression line is to the data. Let us define = , the deviation of a variable from its mean. Using sample data we note that = . Then:

= +

In other words, the amount by which the data deviate from the mean can be broken into an explained portion ( ) and an unexplained portion, .


Using = + , we may square both sides and divide by N to obtain:

2

= ( )2

+ 2

We may then define the coefficient of determination 2 as the ratio of explained variation to total variation:

2 =

( )2

2

= 1 2

2


Therefore we have:

( )2: Total sum of squares (TSS). A measure of total variation in Y about the mean.

( )2: Explained sum of squares (ESS). The part of

total variation in Y about the mean that is explained by the sample regression.

2: Residual sum of squares (RSS). The part of total

variation in Y about the mean that is not explained by the sample regression.


It can also be shown that:

2 =

( )2

2

=( )

2

2 2 =

( )2

( 2 2)( 2 2)

Its limits are 0 2 1. If = for each i, then 2 = 1

How would the regression line appear graphically if 2=0? What is the intuition?


One should remain level-headed upon finding the 2:

It would not be surprising to find an 2 near 1 when working with particular types of time series data that trend smoothly over time

It would not be surprising to find a relatively low 2 when working with microeconomic data

involving consumer behavior. Variations in individual behavior may be difficult to fully explain.

There are several other measures that are important indicators of how to evaluate

a model:

Signs and magnitudes of the estimates Precision of the estimates The models predictive value

What makes a good estimator?

Unbiasedness

Earlier we stated that an estimator is unbiased if its mean is equal to the true value of the parameter being estimated

Efficiency

The smaller the variance, the better the chance that the estimate is close to the actual value of , which is unknown

What makes a good estimator?

Restricting an estimator to be a linear function of the observations on the dependent variable makes our choice of which unbiased estimator has smallest variance manageable.

An estimator that is linear, unbiased, and that has minimum variance among all linear unbiased estimators is called the best linear unbiased estimator (BLUE).

Assumptions when running OLS

We require several assumptions in order for the OLS estimators to be BLUE: = 0 + 1 +

= 0. It is important that the factors not explicitly included

in the model, and therefore incorporated into , do not systematically affect the average value of Y. That is, the positive values cancel out the negative values so that their average effect on Y is zero.

= 2 = . This is the assumption of homoskedasticity. Otherwise, our estimators will not have minimum variance.

Assumptions when running OLS

, = , = 0. That is, the covariance between any pair of random errors is zero. Otherwise, our estimators will not have minimum variance.

The variable X is not random and must take at least two different values. Without this condition, we cannot run OLS. Quite simply, if there is no variation in the X variable, then we will not be able to explain variation in the Y variable.

The values of are normally distributed about their mean and therefore Y is normally distributed (this is necessary for hypothesis testing, which we will discuss in a later lecture):

~ 0, 2

Variance

While the econometrician can never be certain that the estimates obtained are equal or close to the true parameters of the model (as the true parameters are unknowable), finding a coefficient with relatively small variance will certainly give him or her more confidence that the estimate is good

That is, given two different distributions of 1 with the same mean, we prefer the distribution with smaller variance

Variance size will be shown to be crucial when testing hypotheses

Variance

Given our previous definition of variance, if our assumptions (1-5) hold it can be shown that the variances of 0and 1 are:

0 = 2 2

2

1 =2

2

How does the extent to which is spread out relate to variance?

Variance

In addition, we may be interested in the variance of the random error term

The variance of the random error is:

= 2 =

2 = 2

Of course, the random errors are unobservable. So how shall we proceed?

Variance

Recall that:

= = 0 1

We may therefore replace with :

2 = 2

However, we must modify this formula slightly based on the number of

regression parameters (K) (what is the intuition?). When dealing with only 0 and 1, K=2. Therefore the formula that we use to ensure an unbiased estimator is:

2 = 2

Variance

Now that we have found 2, an unbiased estimator of 2, we may write:

0 = 2 2

2

1 = 2

2

The square roots of the estimated variances are the

standard errors of 0 and 1

Covariance

Earlier in the lecture we defined

= ( )( )

= () ()

By extension we may define the covariance between two random variables X and Y as:

, = = () ()

=

Positive covariance: When X is above (below) its mean, Y is likely to be above (below) its mean, and vice versa.

Negative covariance: When X is above (below) its mean, Y is likely to be below (above) its mean, and vice versa

Coefficient of correlation

However, interpreting is difficult because may arbitrarily increase or decrease depending on units of measurement. We may therefore scale the covariance by the standard deviations of the variables and define the coefficient of correlation as:

=

() ()=

Its limits are 1 1, where = 1 indicates a

perfect linear relationship between X and Y.

Covariance

Covariance between 0 and 1 is also a measure of the association between the two variables:

0, 1 = 0 (0) 1 (1)

It can then be shown that:

0, 1 = 2

2

Now that we have explored the theoretical background required to appreciate OLS, lets start working with an actual data set

2 the linear regression model

Documents

regression function

simple linear regression

values of y

regression models

regression equation

variances of y

conditional variance

average number of umbrellas