topic 1 slides 2015

90
A Non-technical Introduction to Regression () Introductory Econometrics: Topic 1 1 / 90

Upload: adarabseh

Post on 10-Nov-2015

14 views

Category:

Documents


0 download

DESCRIPTION

good

TRANSCRIPT

  • A Non-technical Introduction to Regression

    () Introductory Econometrics: Topic 1 1 / 90

  • This rst set of lectures will (quickly) go through basic data analysisup to regression in a reltatively non-technical fashion

    Based on Chapters 1 and 2 of the textbook

    This material should mostly be a review of material you should knowfrom your previous study (e.g. in your second year course).

    Since you have covered this material before, I will go through thismaterial quickly, with a focus on the most important tool of theapplied economist: regression.

    Please read through chapters 1 and 2, particularly if you need somereview of this material.

    () Introductory Econometrics: Topic 1 2 / 90

  • Types of Economic Data

    This section introduces types of data used by economists and denesthe notation and terminology associated with them

    Time Series DataCommon in macroeconomics and nance

    Examples: Gross Domestic product (GDP), stock prices, interestrates, exchange rates (called time series variablesor simplyvariables

    Data is collected at specic points in time (e.g. every month, everyday or every year).

    Yt is the observation on variable Y at time t.

    A time series runs from period t = 1 to t = T .

    () Introductory Econometrics: Topic 1 3 / 90

  • Cross-sectional Data

    Characterized by individual units such as companies, people orcountries.

    E.g. the wage of each of 100 individuals in a survey.

    Note: ordering does not matter (unlike with time series data).

    Yi is observation for individual i for i = 1 to N.

    Note: often we have quantitative data (e.g. wages are measured inpounds so data will be a number).

    Sometimes data is qualitative data. E.g. in survey may ask whethereach worker is Male or Female.

    Econometricians convert qualitative answers into numeric data (e.g.Male=1, Female=0)

    Variables which take on only values 0 or 1 are referred to as dummyvariables.

    () Introductory Econometrics: Topic 1 4 / 90

  • Panel Data

    Sometimes have data with both time series and a cross-sectionalcomponent.

    Survey 100 workers every year for 5 years

    Such data is referred to as panel data.

    We will not have time to cover panel data in this course, but readChapter 8 of the textbook if you want to learn more (e.g. if you areusing panel data in your dissertation)

    () Introductory Econometrics: Topic 1 5 / 90

  • Graphical Methods

    Time Series GraphsMonthly time series data from January 1947 through October 1996on the U.K. pound/U.S. dollar exchange rate is plotted in Figure 1.1.

    1940 1950 1960 1970 1980 1990 2000100

    150

    200

    250

    300

    350

    400

    450Figure 1.1: Time Series Plot of UK Pound to US Dollar Exchange Rate

    Year

    /$

    exch

    ange

    rate

    () Introductory Econometrics: Topic 1 6 / 90

  • Histograms (frequency distributions)

    Commonly used with cross-sectional dataExample: real GDP per capita in 1992 for 90 countries.Frequency table counts how many countries have GDP falling indierent class intervals or bins

    Table 1.1: Frequency Table for GDP per capita DataClass Interval Frequency0 to $2, 000 33$2, 001 to $4, 000 22$4, 001 to $6, 000 7$6, 001 to $8, 000 3$8, 001 to $10, 000 4$10, 001 to $12, 000 2$12, 001 to $14, 000 9$14, 001 to $16, 000 6$16, 001 to $18, 000 4

    () Introductory Econometrics: Topic 1 7 / 90

  • Histogram makes a bar chart out of a frequency table

    -5 0 5 10 15 200

    5

    10

    15

    20

    25

    30

    35Figure 1.2: Histogram of GDP per capita for 90 Countries

    GDP per capita (thousands of US dollars)

    Freq

    uenc

    y

    Histogram is closely related to the idea of a distribution (a point wewill build on later)

    () Introductory Econometrics: Topic 1 8 / 90

  • XY- Plots (scatter diagrams)

    Used to shed light on relationship between two variables.

    Example: deforestation versus population density

    0 500 1000 1500 2000 2500 30000

    1

    2

    3

    4

    5

    6Figure 1.3: XY-Plot of Population Density Against Deforestation

    Population per thousand hectares

    Aver

    age

    annu

    al fo

    rest

    loss

    (%)

    Nicaragua

    () Introductory Econometrics: Topic 1 9 / 90

  • Descriptive Statistics

    Graphs have an immediate visual impact that is useful for livening upan essay or report.

    Usually need to be numerically precise

    Almost all of what we do in this course will involve numerical (asopposed to graphical) summaries such as Descriptive Statistics

    These require some mathematical tools

    Before descriptive statistics, introduce some maths from Appendix Aof textbook

    () Introductory Econometrics: Topic 1 10 / 90

  • Mathematical Basics

    The level of mathematics used in this course not too high and youshould have learned in your second year course.

    But let me provide a brief summary of the basic mathematical toolsused in the course

    The Equation of a Straight LineLet Y and X be two variables.

    Any straight line relationship between the variables can be written interms of the equation:

    Y = + X

    and are coe cients which determine a particular line.

    Any line can be dened by its intercept and slope.

    is the intercept and the slope.

    Figure A.1 has an intercept of one ( = 1) and a slope of two( = 2).

    () Introductory Econometrics: Topic 1 11 / 90

  • 0 1 2 3 4 5 6 7 8 9 100

    5

    10

    15

    20

    25Figure A.1: A Straight Line

    X

    Y

    Intercept= a

    Slope of Line = b

    () Introductory Econometrics: Topic 1 12 / 90

  • Mathematical Basics (cont.): Logarithms

    For reasons which will be explained in the course, sometimes we willwant to work with the natural logarithm of A instead of A itself

    Notation: ln (A).

    Formal denition provided on page 335 of textbook.

    Key thing: ln (A) is a function of A that can be calculated by thecomputer

    The inverse of the natural logarithm function is the exponentialfunction, denoted by exp (A) (i.e. exp (ln (A)) = A)

    Properties that we will use:

    ln (AB) = ln (A) + ln (B)lnAB= B ln (A)

    ln (1+ A) A if A is small.

    () Introductory Econometrics: Topic 1 13 / 90

  • Mathematical Basics (cont.): Summation

    Remember:Yi is observation for individual i for i = 1 to N.

    Example: wages for 100 individuals

    Greek letter , is the summation (or adding up) operator andsuperscripts and subscripts on indicate the observations that arebeing added up.

    Example:100

    i=1Yi

    This adds up the wages for all of the 100 individuals.

    () Introductory Econometrics: Topic 1 14 / 90

  • Properties of the Summation Operator

    Let Xi and Yi for i = 1, ..,N be observations on two variables and cbe a constant

    N

    i=1cXi = c

    N

    i=1Xi

    N

    i=1(Xi + Yi ) =

    N

    i=1Xi +

    N

    i=1Yi

    N

    i=1c = cN

    () Introductory Econometrics: Topic 1 15 / 90

  • Back to Descriptive Statistics (or Summary Statistics)

    Ex (continued): real GDP per capita across the 90 countries.

    Histogram in Figure 1.2 plots the distribution of this variable

    Common to present mean and standard deviation to summarizenumerically main features of the data

    Mean is a measure of location (intuition: mean, average, typicalvalue, middle of distribution are similar concepts)

    Remember: Y1, ..,YN are N dierent observations on our variable(N = 90 countries)

    This is the sample.

    The mean is given by:

    Y =Ni=1 YiN

    () Introductory Econometrics: Topic 1 16 / 90

  • For real GDP data, Y = $5, 443.80Mean or average can hide variation across countries.

    Other useful summary statistics are the minimum and maximum.

    Minimum GDP per capita is $408 (Chad)Maximum is $17, 945 (U.S.).Dierence between minimum and maximum is one measure ofdispersion

    Dispersion = measure of variability, how spread out, how unequal, etc.

    () Introductory Econometrics: Topic 1 17 / 90

  • More common measure of dispersion is the standard deviation:

    s =

    sNi=1

    Yi Y

    2N 1

    Note: also common to use variance which is s2

    In GDP example: s = $5, 369.496Di cult to interpret in an absolute sense, but useful for relativecomparisons

    When comparing two dierent distributions, the one with the smallerstandard deviation will exhibit less dispersion.

    () Introductory Econometrics: Topic 1 18 / 90

  • Correlation

    Correlation numerically measures the degree of association or thestrength of the relationship between two variables.

    Let X and Y be two variables (e.g. population density anddeforestation, respectively)

    The formula to calculate the correlation between X and Y is:

    r =Ni=1

    Yi Y

    Xi X

    qNi=1

    Yi Y

    2qNi=1

    Xi X

    2Dont worry about memorizing this: computer will calculate it for you

    Note: if unclear from context, use subscripts: rXY is the correlationbetween variables X and Y

    () Introductory Econometrics: Topic 1 19 / 90

  • Properties of Correlation

    r lies between 1 and 1.Positive values of r indicate a positive correlation between X and Y .Negative values indicate a negative correlation. r = 0 indicates thatX and Y are uncorrelated.

    Larger positive values of r indicate stronger positive correlation.r = 1 indicates perfect positive correlation.

    Larger negative values of r indicate stronger negative correlation.r = 1 indicates perfect negative correlation.The correlation between Y and X is the same as the correlationbetween X and Y .

    The correlation between any variable and itself is 1.

    () Introductory Econometrics: Topic 1 20 / 90

  • Example: relationship between deforestation (Y ) and populationdensity (X ).

    Figure 1.3 plotted this data in an XY-plot

    r = 0.66.

    Positive relationship between deforestation and population density.

    How do you interpret the precise number?

    r2 measures the proportion of the cross-country variability indeforestation that is explained by the variabillity in population density.

    Crrelation is a numerical measure of the degree to which patterns inX and Y correspond.

    Since 0.662 = 0.44, 44% of cross-country variance in deforestationcan be explained by variance in population density.

    () Introductory Econometrics: Topic 1 21 / 90

  • Understanding Why Variables are Correlated

    Care must always be taken with interpreting correlations (orregression results)

    Correlation does not necessarily imply causality

    Example: smoking causes lung cancer, but drinking alcohol does not

    How would these facts reveal themselves in a data set?

    X = number of cigarettes smoked

    Y = lung cancer rate

    () Introductory Econometrics: Topic 1 22 / 90

  • We would nd rXY > 0

    This correlation does reect causality

    Let Z be a measure of alcohol consumption

    In practice, we often nd rXZ > 0 (smokers tend to drink more thannon-smokers)

    This will often lead to rYZ > 0 (correlation between drinking and lungcancer positive)

    But this correlation does not reect causality

    () Introductory Econometrics: Topic 1 23 / 90

  • Understanding Correlation Through XY-plots

    Figure 1.3 (plot of deforestation versus population) is type of graphwe obtain with moderately correlated variables.

    Figure 1.6 shows what happens with perfectly correlated variableswith r = 1)

    All the points lie exactly on a straight line.

    Stronger correlations imply stronger patterns (less scattering of datapoints) in an XY plot

    () Introductory Econometrics: Topic 1 24 / 90

  • -5 -4 -3 -2 -1 0 1 2 3 4 5-2.5

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5Figure 1.6: XY-Plot of Two Perfectly Correlated Variables

    X

    Y

    () Introductory Econometrics: Topic 1 25 / 90

  • Figure 1.7 plots two completely uncorrelated variables (r = 0).

    Note that the points are randomly scattered over the entire graph.

    -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2.5

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5Figure 1.7: XY-Plot of Unorrelated Variables

    X

    Y

    () Introductory Econometrics: Topic 1 26 / 90

  • Correlation Between Several Variables

    When we have many variables often want to obtain correlationsbetween each paper

    E.g. with X , Y and Z , then there are three possible correlations(rXY , rXZ and rYZ ).

    Can put them in a correlation matrix.

    Table 1.2: A Correlation MatrixX Y Z

    X 1.000Y rXY 1.000Z rXZ rYZ 1.000

    () Introductory Econometrics: Topic 1 27 / 90

  • Regression

    Regression is the most important tool of the applied economist

    Used to understand the relationships between many variables.

    We begin with simple regression to understand the relationshipbetween two variables, X and Y.

    Regression can be understood as a Best Fitting Line

    Example: see Figure 2.1 which is XY-plot of X = output versus Y =costs of production for 123 electric utility companies in the U.S. in1970.

    The microeconomist will want to understand the relationship betweenoutput and costs.

    Regression ts a line through the points in the XY-plot that bestcaptures the relationship between output and costs.

    () Introductory Econometrics: Topic 1 28 / 90

  • 0 10 20 30 40 50 60 70 800

    50

    100

    150

    200

    250

    300Figure 2.1: XY-plot of Output versus Costs

    Output (Millions of KwH)

    Cos

    ts (M

    illio

    ns o

    f $)

    () Introductory Econometrics: Topic 1 29 / 90

  • Simple Regression: Some Theory

    Question: What do we mean by best tting line?

    Assume a linear relationship between X = output and Y = costs

    Y = + X

    where is the intercept of the line and its slope.

    () Introductory Econometrics: Topic 1 30 / 90

  • Even if straight line relationship were true, never get all points on anXY-plot lying precisely on it due to measurement error.

    True relationship probably more complicated, straight line may just bean approximation.

    Important variables which aect Y may be omitted.

    Due to these factors we add an error, , which yields the regressionmodel :

    Y = + X +

    () Introductory Econometrics: Topic 1 31 / 90

  • What we know: X and Y .

    What we do not know: , and .

    Regression analysis uses data (X and Y ) to make a guess orestimate of what and are.Notation: b and b are the estimates of and .

    () Introductory Econometrics: Topic 1 32 / 90

  • Distinction Between Errors and Residuals

    We have data for i = 1, ., ,N individuals (or countries, or companies,etc.).

    Individual observations are denoted using subscripts: Yi fori = 1, ..,N and Xi for i = 1, ..,N

    True Regression Line holds for every observation:

    Yi = + Xi + i

    Error for i th individual can be written as:

    i = Yi Xi

    () Introductory Econometrics: Topic 1 33 / 90

  • If we replace and by estimates, we get tted (or estimated)regression line: bYi = b+ bXiand residuals are given by

    bi = Yi b bXiResiduals measure distance that each observation is from the ttedregression line.

    A good tting regression line will have observations lying near theregression line and, thus, residuals will be small.

    () Introductory Econometrics: Topic 1 34 / 90

  • Derivation of OLS Estimator

    How do we choose and b and b?A regression line which ts well will make residuals as small aspossible.

    Usual way of measuring size of the residuals is the sum of squaredresiduals (SSR), which can be written in the following (equivalent)ways:

    SSR = Ni=1b2i= Ni=1

    Yi b bXi2

    = Ni=1Yi bYi2

    ordinary least squares (OLS) estimator nds values of b and b whichminimize SSR

    Formula for OLS estimator will be discussed later (in practice,econometrics software packages will calculate b and b.)

    () Introductory Econometrics: Topic 1 35 / 90

  • Jargon of Regression

    Y = dependent variable.

    X = explanatory (or independent) variable.

    and are coe cients.b and b and are OLS estimates of coe cientsRun a regression of Y on X

    () Introductory Econometrics: Topic 1 36 / 90

  • Interpreting OLS Estimates

    Remember tted regression line is

    bYi = b+ bXiInterpretation of b is estimated value of Y if X = 0.Example: X = lot size, Y = house price. b = estimated value of ahouse with lot size = 0 (not of interest since houses with lot sizeequal zero do not exist).b is usually (but not always) the coe cient of most interest.The following are a few dierent ways of interpreting b.b is slope of the best tting straight line through an XY-plot such asFigure 2.2

    () Introductory Econometrics: Topic 1 37 / 90

  • 0 10 20 30 40 50 60 70 800

    50

    100

    150

    200

    250

    300

    350Figure 2.2: XY -plot of Output v ersus Costs with Fitted Regression Line

    Output (Millions of KwH)

    Cos

    ts (M

    illio

    ns o

    f $)

    () Introductory Econometrics: Topic 1 38 / 90

  • b can be interpreted as a derivative:b = d bYi

    dXi.

    b is the marginal eect of X on Y .It is a measure of how much the explanatory variable inuences thedependent variable.b is measure of how much Y tends to change when X is changed byone unit.

    The denition of unitdepends on the particular data set beingstudied.

    () Introductory Econometrics: Topic 1 39 / 90

  • Example: Costs of production in the electric utilityindustry data set

    Using data set in Figures 2.1 and 2.2 we nd b = 4.79.A measure of how much costs tend to change when output changesby a small amount.

    Costs are measured in terms of millions of dollars and output ismeasured as millions of kilowatt hours of electricity produced.

    Thus: if output is increased by one million kilowatt hours (i.e. achange of one unit in the explanatory variable), costs will tend toincrease by $4, 790, 000.

    () Introductory Econometrics: Topic 1 40 / 90

  • Measuring the Fit of a Regression Model

    The most common measure of t is R2.

    Intuition: Variability= (e.g.) how costs vary across companies

    Total variability in dependent variable Y =

    Variability explained by the explanatory variable (X ) in the regression

    +Variability that cannot be explained and is left as error.

    R2 measures the proportion of the variability in Y that can beexplained by X .

    () Introductory Econometrics: Topic 1 41 / 90

  • Formalizing the Denition of R-squared

    Remember that variance is measure of dispersion or variability.

    Variance of any variable can be estimated by:

    var (Y ) =Ni=1

    Yi Y

    2N 1

    where Y = Ni=1 YiN is the mean, or average value, of the variable.

    Total sum of squares (TSS) is proportional to variance of dependentvariable:

    TSS =N

    i=1

    Yi Y

    2

    () Introductory Econometrics: Topic 1 42 / 90

  • The following is not hard to prove:

    TSS = RSS + SSR

    RSS is regression sum of squares, a measure of the explanationprovided by the regression model:

    RSS =N

    i=1

    bYi Y 2SSR is the sum of squared residuals.Formalizes idea that variability in Y can be broken into explainedand unexplained partsWe can now dene our measure of t:

    R2 =RSSTSS

    or, equivalently,

    R2 = 1 SSRTSS

    .

    () Introductory Econometrics: Topic 1 43 / 90

  • Note that TSS , RSS and SSR are all sums of squared numbers and,hence, are all non-negative.

    This implies TSS RSS and TSS SSR. Using these facts, it canbe seen that 0 R2 1.Intuition: small values of SSR indicate that residuals are small and,hence, regression model is tting well.

    Thus, values of R2 near 1 imply a good t and that R2 = 1 implies aperfect t.

    Intuition: RSS measures how much of the variation in Y theexplanatory variables explain. If RSS is near zero, then we have littleexplanatory power (a bad t) and R2 near zero.

    () Introductory Econometrics: Topic 1 44 / 90

  • Example: In regression of Y = cost of production on X = output forthe 123 electric utility companies, R2 = .92. The t of the regressionline is quite good.

    92% of the variation in costs across companies can be explained bythe variation in output.

    In simple regression (but not multiple regression), R2 is thecorrelation between Y and X squared.

    () Introductory Econometrics: Topic 1 45 / 90

  • Basic Statistical Concepts in the Regression Model

    b and b or only estimates of and . How accurate are theestimates?

    This can be investigated through condence intervals.

    Closely related to the condence interval is the concept of ahypothesis test.

    Intuition relating to condence intervals and hypothesis tests givenhere, formal derivation provided later.

    () Introductory Econometrics: Topic 1 46 / 90

  • Condence Intervals

    Example: b = 4.79 is the point estimate of in the regression ofcosts of production on output using our electric utility industry data

    Point estimate is best guess of what is.

    Condence intervals provide interval estimates which give a range inwhich you are highly condent that must lie.

    Example: If condence interval is [4.53, 5.05]

    We are condent that is greater than 4.53 and less than 5.05

    () Introductory Econometrics: Topic 1 47 / 90

  • We can obtain dierent condence intervals corresponding todierent levels of condence.

    95% condence interval: we are 95% condent that lies in theinterval

    90% condence interval: we are 90% condent that lies in theinterval, etc..

    The degree of condence (e.g. 95%) is referred to as the condencelevel.

    Example: for the electric utility data set, the 95% condence intervalfor is [4.53, 5.05].

    "We are 95% condent that the marginal eect of output on costs isat least 4.53 and at most 5.05".

    () Introductory Econometrics: Topic 1 48 / 90

  • Hypothesis Testing

    Hypothesis testing involves specifying a hypothesis to test. This isreferred to as the null hypothesis, H0.

    It is compared to an alternative hypothesis, H1.

    E.g. H0 : = 0 vs. H1 : 6= 0 is common (and Gretl will print outresults for this hypothesis test)

    Many economic questions of interest have form: Does theexplanatory variable have an eect on the dependent variable?or,equivalently, Does = 0 in the regression of Y on X?

    () Introductory Econometrics: Topic 1 49 / 90

  • Aside on Condence Intervals and Hypothesis Testing

    Hypothesis testing and condence intervals are closely related.

    Can test whether = 0 by checking whether the condence intervalfor contains zero.

    If it does not then we can reject the hypothesis that = 0

    or conclude X has signicant explanatory power for Y

    or is signicantly dierent from zeroor is statisticallysignicant.

    If condence interval does include zero then change the word rejectto acceptand has signicant explanatory powerwith does nothave signicant explanatory power, and so on.

    () Introductory Econometrics: Topic 1 50 / 90

  • Condence interval approach to hypothesis testing is equivalent toapproach to hypothesis testing discussed next

    Just as condence intervals came with various levels of condence(e.g. 95%), hypothesis tests come with various levels of signicance.

    Level of signicance is 100% minus the condence level.

    E.g. if 95% condence interval does not include zero, then you maysay I reject the hypothesis that = 0 at 5% level of signicance(i.e. 100%-95%=5%)

    () Introductory Econometrics: Topic 1 51 / 90

  • Hypothesis Testing (continued)

    First step: specify a hypothesis to test and choosing a signicancelevel.

    E.g. H0: = 0 and the 5% level of signicance.

    Second step: calculate a test statistic and compare it to a criticalvalue (a concept we will dene in Chapter 3).

    E.g. For H0: = 0, the test statistic is known as a t-statistic (ort-ratio or t-stat):

    t =bsb

    where we will explain sb later.

    () Introductory Econometrics: Topic 1 52 / 90

  • Idea underlying hypothesis testing is that we accept H0 if the value ofthe test statistic is consistent with what could plausibly happen if H0is true.

    If H0 is true, then we would expect b to be small (i.e. if = 0 thenexpect b near zero).But if b is large this is evidence against H0.Formally test statistic is large or small relative to critical value takenfrom statistical tables of the Student-t distribution (dene later).

    () Introductory Econometrics: Topic 1 53 / 90

  • For empirical practice, often do not need critical value since P-valuefor this and other tests produced by computer packages.

    P-value is level of signicance at which you can reject H0.

    E.g. with 5% level of signicance and software package gives P-valueof 0.05 then reject H0.

    If the P-value is less than 0.05 then you can also reject H0.

    Students often want to interpret the P-value as measuring theprobability that = 0.

    E.g. if P-value less than 0.05 one wants to say "There is less than a5% probability that = 0 and, since this is very small, I can rejectthe hypothesis that = 0."

    This is not formally correct. But, it does provide some informalintuition to motivate why small P-values lead you to reject H0.

    () Introductory Econometrics: Topic 1 54 / 90

  • Hypothesis Testing involving R-squared: The F-statistic

    Another popular hypothesis to test is H0: R2 = 0.

    If R2 = 0 then X does not have any explanatory power for Y .

    Note: for simple regression, this is equivalent to a test of = 0.

    However, for multiple regression (which we will discuss shortly), thetest of R2 = 0 will be dierent than tests of whether regressioncoe cients equal zero.

    Same strategy: calculate a test statistic and compare to a criticalvalue.

    Or most software will also calculate a P-value which directly gives ameasure of the plausibility of H0 : R2 = 0

    () Introductory Econometrics: Topic 1 55 / 90

  • Test statistic is called the F-statistic:

    F =(N 2)R2(1 R2)

    The appropriate statistical table for obtaining the critical value isF-distribution (to be explained later)

    Or if the P-value for the F-test is less than 5% (i.e. 0.05), weconclude R2 6= 0.If the P-value for the F-test is greater than or equal to 5% , weconclude R2 = 0.

    Of course, you can use levels of signicance other than 5%.

    () Introductory Econometrics: Topic 1 56 / 90

  • Computer packages typically provide the following:b, the OLS estimate of .The 95% condence interval, which gives an interval where we are95% condent will lie.

    Standard deviation (or standard error) of b, sb , which is a measure ofhow accurate b.The test statistic, t, for testing H0: = 0.

    The P-value for testing H0: = 0.

    R2 which measures the proportion of the variability in Y explained byX

    The F-statistic and P-value for testing H0 : R2 = 0.

    () Introductory Econometrics: Topic 1 57 / 90

  • Example: Cost of Production in the Electric Utility Industry

    Regression of Y = the costs of production and X = output ofelectricity by 123 electric utility companies.

    Table 2.1 presents regression results in the form they would beproduced by most software packages.

    Table 2.1: Regression Results Using Electric Utility Data Set

    Variable CoeStandError

    t-stat P-value95% conf.interval

    Intercept 2.19 1.88 1.16 0.25 [1.53, 5.91]Output 4.79 0.13 36.36 5 1067 [4.53, 5.05]

    R2 = 0.92 and the P-value for testing H0 : R2 = 0 is 5.4 1067.

    () Introductory Econometrics: Topic 1 58 / 90

  • Multiple Regression

    Multiple regression same as simple regression except manyexplanatory variables.

    Intuition and derivation of multiple and simple regression very similar.

    We will emphasise only the few dierences between simple andmultiple regression.

    () Introductory Econometrics: Topic 1 59 / 90

  • Example: Explaining House Prices

    Data on N = 546 houses sold in Windsor, Canada.

    Dependent variable, Y , is the sales price of the house in Canadiandollars.

    Four explanatory variables:

    X1= the lot size of the property (in square feet)

    X2 = the number of bedrooms

    X3 = the number of bathrooms

    X4 = the number of storeys (excluding the basement).

    () Introductory Econometrics: Topic 1 60 / 90

  • OLS Estimation of the Multiple Regression Model

    With k explanatory variables model is:

    Yi = + 1X1i + 2X2i + ..+ kXki + i

    i subscripts to denote observations, i = 1, ..,N.

    With multiple regression have to estimate and 1, .., k .

    OLS estimates are found by choosing the values of b and b1, b2, .., bkthat minimize the SSR:

    SSR =N

    i=1

    Yi b b1X1i b2X2i .. bkXki2

    Computer packages like Gretl will calculate OLS estimates.

    () Introductory Econometrics: Topic 1 61 / 90

  • Statistical Aspects of Multiple Regression

    Largely the same as for simple regression.

    Formulae for condence intervals, test statistics, etc. have only minormodications.

    R2 is still a measure of t.

    Can test R2 = 0 in same manner as for simple regression.

    If you nd R2 6= 0 then conclude that explanatory variables togetherprovide signicant explanatory power (Note: this does not necessarilymean each individual explanatory variable is signicant).

    Condence intervals can be calculated for each individual coe cientas before.

    Can test j = 0 for each individual coe cient (j = 1, 2, .., k) asbefore.

    Note: have a condence interval and a test statistic for eachcoe cient.

    () Introductory Econometrics: Topic 1 62 / 90

  • Interpreting OLS Estimates in the Multiple RegressionModel

    Mathematical Intuition: Total vs. partial derivative

    Simple regression:

    =dYdX

    Multiple Regression:

    j =YXj

    for the j th coe cient j = 1, .., k.

    () Introductory Econometrics: Topic 1 63 / 90

  • Interpreting OLS Estimates in the Multiple RegressionModel

    Verbal intuition: with simple regression is the marginal eect of Xon Y

    Multiple regression: j is the marginal eect of Xj on Y , ceterisparibus

    j is the eect of a small change in the jth explanatory variable on

    the dependent variable, holding all the other explanatory variablesconstant.

    () Introductory Econometrics: Topic 1 64 / 90

  • Example: Explaining House Prices (continued)

    Multiple regression results using the house price data set:

    Table 2.2: Multiple Regression Using House Price Data Set

    Variable Coe cient t-stat P-value95% conf.interval

    Intercept 4009.55 1.11 0.27 [11087, 3068]Lot Size 5.43 14.70 2 1041 [4.70, 6.15]# bedrm 2824.61 2.33 0.02 [439, 5211]# bathrm 17105.17 9.86 3 1021 [13698, 20512]# storeys 7634.90 7.57 1 1013 [5655, 9615]

    Furthermore, R2 = 0.54 and the P-value for testing H0 : R2 = 0 is1.2 1088.

    () Introductory Econometrics: Topic 1 65 / 90

  • Example: Explaining House Prices (continued)

    How can we interpret the fact that b1 = 5.43?An extra square foot of lot size will tend to add $5.43 onto the priceof a house, ceteris paribus.

    For houses with the same number of bedrooms, bathrooms andstoreys, an extra square foot of lots size will tend to add $5.43 ontothe price of a house.

    If we compare houses with the same number of bedrooms, bathroomsand storeys, those with larger lots tend to be worth more. Inparticular, an extra square foot in lot size is associated with anincreased price of $5.43.

    () Introductory Econometrics: Topic 1 66 / 90

  • Condence interval for 1: I am 95% condent that the marginaleect of lot size on house price (holding other explanatory variablesconstant) is at least $4.70 and at most $6.15Hypothesis testing: Since the P-value for testing H0 : 1 = 0 is lessthan 0.05, we can conclude that 1 is signicant at the 5% level ofsignicance"

    Can make similar statements for the other coe cients.

    Since R2 = 0.54 can say: 54% of the variability in house prices canbe explained by the four explanatory variables

    Since the P-value for testing H0 : R2 = 0 is less than 0.05, we canconclude that the explanatory variables (jointly) have signicantexplanatory power at the 5% level of signicance

    () Introductory Econometrics: Topic 1 67 / 90

  • Which Explanatory Variables to Choose in a MultipleRegression Model?

    We will relate this question to topics of omitted variables bias andmulticollinearity.

    First note that there are two important considerations which pull inopposite directions.

    It is good to include all variables which help explain the dependentvariable (include as many explanatory variables as possible).

    Including irrelevant variables (i.e. ones with no explanatory power)will lead to less precise estimates (include as few explanatory variablesas possible).

    Playing o these two competing considerations is an important aspectof any empirical exercise. Hypothesis testing procedures can help withthis.

    () Introductory Econometrics: Topic 1 68 / 90

  • Omitted Variables Bias

    To illustrate this problem we use the house price data set.

    A simple regression of Y = house price on X = number of bedroomsyields a coe cient estimate of 13, 269.98.

    But in multiple regression (see Table 2.2), coe cient on number ofbedrooms was 2, 824.61.

    Why are these two coe cients on the same explanatory variable sodierent? i.e. 13, 269.98 is much bigger than 2, 824.61.

    () Introductory Econometrics: Topic 1 69 / 90

  • Answer 1: They just come from two dierent regressions whichcontrol for dierent explanatory variables (dierent ceteris paribusconditions).

    Answer 2:

    Imagine a friend asked: I have 2 bedrooms and I am thinking ofbuilding a third, how much will it raise the price of my house?

    Simple regression: Houses with 3 bedrooms tend to cost $13,269.98more than houses with 2 bedrooms

    Does this mean adding a 3rd bedroom will tend to raise price of houseby $13,269.98? Not necessarily, other factors inuence house prices.

    () Introductory Econometrics: Topic 1 70 / 90

  • Houses with three bedrooms also tend to be desirable in other ways(e.g. bigger, with larger lots, more bathrooms, more storeys, etc.).Call these good houses.

    Simple regression notes good houses tend to be worth more thanothers.

    Number of bedrooms is acting as a proxy for all these good housecharacteristics and hence its coe cient becomes very big (13,269.98)in simple regression.

    Multiple regression can estimate separate eects due to lot size,number of bedroom, bathrooms and storeys.

    Tell your friend: Adding a third bedroom will tend to raise yourhouse price by $2,824.61.

    Multiple regressions which contains all (or most) of housecharacteristics will tend to be more reliable than simple regressionwhich only uses one characteristic.

    () Introductory Econometrics: Topic 1 71 / 90

  • Take a look at the correlation matrix for this data set:

    Table 2.3: Correlations Matrix for House Price Data SetPrice Lot Size # bed # bath # storey

    Price 1Lot Size 0.54 1# bed 0.37 0.15 1# bath 0.52 0.19 0.37 1# storey 0.42 0.08 0.41 0.32 1

    Positive correlations between explanatory variables indicate thathouses with more bedrooms also tend to have larger lot size, morebathrooms and more storeys.

    () Introductory Econometrics: Topic 1 72 / 90

  • Omitted Variable Bias

    Omitted variable bias is a statistical term for these issues.

    IF

    1. We exclude explanatory variables that should be present in theregression,

    AND

    2. these omitted variables are correlated with the includedexplanatory variables,

    THEN

    3. the OLS estimates of the coe cients on the includedexplanatory variables will be biased.

    () Introductory Econometrics: Topic 1 73 / 90

  • Example: Explaining House Prices (continued)

    Simple regression used Y = house prices and X = number ofbedrooms.

    Many important determinants of house prices omitted.

    Omitted variables were correlated with number of bedrooms.

    Hence, the OLS estimate from the simple regression of 13, 269.98 wasbiased.

    () Introductory Econometrics: Topic 1 74 / 90

  • Practical Advice for Selecting Explanatory Variables

    Include (insofar as possible) all explanatory variables which you thinkmight possibly explain your dependent variable. This will reduce therisk of omitted variable bias.

    However, including irrelevant explanatory variables reduces accuracyof estimation and increases condence intervals.

    So do t-tests (or other hypothesis tests) to decide whether variablesare signicant. Run a new regression omitting the explanatoryvariables which are not signicant.

    () Introductory Econometrics: Topic 1 75 / 90

  • Multicollinearity

    Intuition: If explanatory variables are highly correlated with oneanother then regression model has trouble telling which individualvariable is explaining Y .

    Symptom: Individual coe cients may look insignicant, butregression as a whole may look signicant (e.g. R2 big, F-stat big,but t-stats on individual coe cients small).

    Looking at a correlation matrix for explanatory variables can often behelpful in revealing extent and source of multicollinearity problem.

    () Introductory Econometrics: Topic 1 76 / 90

  • Example of Multicollinearity

    Y = exchange rate

    Explanatory variable(s) = interest rate

    X1 = bank prime rate

    X2 = Treasury bill rate

    Using both X1 and X2 will probably cause multicollinearity problem

    Solution: Include either X1 or X2 but not both.

    In some cases this solutionwill be unsatisfactory if it causes you todrop out explanatory variables which economic theory says should bethere.

    () Introductory Econometrics: Topic 1 77 / 90

  • Multiple Regression with Dummy Variables

    Dummy variable is either 0 or 1.

    Use to turn qualitative (Yes/No) data into 1/0.

    Example: Explaining House Prices (continued)

    Data set has 5 potential dummy explanatory variables

    D1 = 1 if the house has a driveway (= 0 if it does not)

    D2 = 1 if the house has a recreation room (= 0 if not)

    D3 = 1 if the house has a basement (= 0 if not)

    D4 = 1 if the house has gas central heating (= 0 if not)

    D5 = 1 if the house has air conditioning (= 0 if not)

    () Introductory Econometrics: Topic 1 78 / 90

  • Simple Regression with a Dummy Variable

    One dummy explanatory variable, D:

    Yi = + Di + i

    for i = 1, ..,N observations.

    OLS estimation produces b and b, and tted regression line is:bYi = b+ bDi

    Since Di is either 0 or 1, we either have bYi = b or bYi = b+ b.

    () Introductory Econometrics: Topic 1 79 / 90

  • Example: Explaining House Prices (continued)

    Regress Y = house price on D = dummy for air conditioning (=1 ifhouse has air conditioning, = 0 otherwise).

    Fitted regression line is:

    bYi = 59884.85+ 25995.74DiAverage price of house with air conditioning is $85, 881Average price of house without air conditioning is $59, 885Remember, however, omitted variables bias (this simple regression nodoubt suers from it)

    () Introductory Econometrics: Topic 1 80 / 90

  • Multiple Regression with Dummy Variables

    Yi = + 1D1i + ..+ kDki + i

    Example: Explaining House Prices (continued)

    Regress Y = house price on D1 = driveway dummy and D2 = recroom dummy.

    () Introductory Econometrics: Topic 1 81 / 90

  • Fitted regression line:

    bYi = 47099.08+ 21159.91D1i + 16023.69D2iPutting in either 0 or 1 values for the dummy variables, we obtain thetted values for Y for the four categories of houses:

    Houses with a driveway and recreation room (D1 = 1 and D2 = 1)have bYi = 47099+ 21160+ 16024 = $84, 283.Houses with a driveway but no recreation room (D1 = 1 and D2 = 0)have bYi = 47099+ 21160 = $68, 259.Houses with a recreation room but no driveway (D1 = 0 and D2 = 1)have bYi = 47099+ 16024 = $63, 123.Houses with no driveway and no recreation room (D1 = 0 andD2 = 0) have bYi = $47, 099.Multiple regression with dummy variables may be used to classify thehouses into dierent groups and to nd average house prices for eachgroup.

    () Introductory Econometrics: Topic 1 82 / 90

  • Multiple Regression with Dummy and non-DummyExplanatory Variables

    E.g. one dummy variable (D) and one regular non-dummyexplanatory variable (X ):

    Yi = + 1Di + 2Xi + i

    Example: Explaining House Prices (continued)Regress Y on D = air conditioning dummy and X = lot size.Obtain b = 32, 693, b1 = 20175 and b2 = 5.638.Get two dierent tted regression linesbYi = b+ b1 + b2Xi = 52868+ 5.638Xiif Di = 1 (i.e. the ith house has an air conditioner) andbYi = b+ b2Xi = 32693+ 5.638Xiif Di = 0 (i.e. the house does not have an air conditioner).Note that the two regression lines have the same slope and only dierin their intercepts.

    () Introductory Econometrics: Topic 1 83 / 90

  • Interacting Dummy with non-Dummy ExplanatoryVariables

    Consider the following regression model:

    Yi = + 1Di + 2Xi + 3Zi + i

    where D and X are dummy and non-dummy explanatory variablesand Z = DX .How do we interpret results from a regression of Y on D,X and Z?Note that Zi is either 0 (for observations with Di = 0) or Xi (forobservations with Di = 1).Fitted regression lines for individuals with Di = 0 and Di = 1 are:If Di = 0 then bYi = b+ b2XiIf Di = 1 then bYi = b+ b1+ b2 + b3XiTwo dierent regression lines corresponding to D = 0 and D = 1exist and have dierent intercepts and dierent slopes.Marginal eect of X on Y is dierent for observations with Di = 0than with Di = 1.

    () Introductory Econometrics: Topic 1 84 / 90

  • Example: Explaining House Prices (continued)

    Regress Y = house price on D = air conditioner dummy, X = lot sizeand Z = D Xb = 35684, b1 = 7613, b2 = 5.02 and b3 = 2.25.Marginal eect of lot size on housing is 7.27 (i.e. b2 + b3) for houseswith air conditioners and only 5.02 for houses without.

    () Introductory Econometrics: Topic 1 85 / 90

  • Working with Dummy Dependent Variables

    Example: Dependent variable is a transport choice.

    1 = Yes I take my car to work

    0 = No I do not take my car to work

    We will not discuss this case in this course.

    Note only the following points:

    There are some problems with OLS estimation. But OLS estimationmight be adequate in many cases.

    Better estimation methods are Logit and Probit available inmany software packages.

    () Introductory Econometrics: Topic 1 86 / 90

  • Chapter Summary

    This non-technical introduction to regression, you should be able toget started in actually doing some empirical work (at least withcross-sectional data).The major points covered in this chapter include:Economic data comes in many forms. Common types are time series,cross-sectional and panel data.Simple graphical techniques, including histograms and XY-plots, areuseful ways of summarizing the information in a data set.Many numerical summaries can be used. The most important are themean, a measure of the location of a distribution, and the standarddeviation, a measure of how spread out or dispersed a distribution is.Correlation is a numerical measure of the relationship or associationbetween two variables.There are many reasons two variables might be correlated with eachother. However, correlation does not necessarily imply causalitybetween two variables.

    () Introductory Econometrics: Topic 1 87 / 90

  • Simple regression quanties eect of an explanatory variable, X , on adependent variable, Y , through a regression line Y = + X .

    Estimation of and involves choosing estimates which produces the"best tting" line through an XY graph. These are called ordinaryleast squares (OLS) estimates, are labelled b and b and are obtainedby minimizing the sum of squared residuals (SSR).

    Regression coe cients should be interpreted as marginal eects (i.e.as measures of the eect on Y of a small change in X ).

    R2 is a measure of how well the regression line ts the data.

    The condence interval provides an interval estimate of any coe cient(e.g. an interval for in which you can be condent lies).

    A hypothesis test of = 0 used to nd out whether explanatoryvariable belongs in regression. Hypothesis test can either be done bycomparing a test statistic (i.e. the t-stat) to a critical value takenfrom statistical tables or by examining P-value. If P-value is less than0.05 then you can reject the hypothesis at the 5% level of signicance.

    () Introductory Econometrics: Topic 1 88 / 90

  • The multiple regression model has more than one explanatoryvariable. The basic intuition (e.g. OLS estimates, condenceintervals, etc.) is the same as for the simple regression model.However, with multiple regression the interpretation of regressioncoe cients is subject to ceteris paribus conditions.

    If important explanatory variables are omitted from the regression andare correlated with included explanatory variables, omitted variablesbias occurs.

    If explanatory variables are highly correlated with one another,coe cient estimates and statistical tests may be misleading. This isreferred to as the multicollinearity problem.

    () Introductory Econometrics: Topic 1 89 / 90

  • The statistical techniques associated with the use of dummyexplanatory variables are exactly the same as with non-dummyexplanatory variables.

    A regression involving only dummy explanatory variables classies theobservations into various groups (e.g. houses with air conditionersand houses without). Interpretation of results is aided by carefulconsideration of what the groups are.

    A regression involving dummy and non-dummy explanatory variablesclassies the observations into groups and says that each group willhave a regression line with a dierent intercept. All these regressionlines have the same slope.

    Regression involving dummy, non-dummy and interaction (i.e.dummy times non-dummy variables) explanatory variables classiesthe observations into groups and says that each group will have adierent regression line with dierent intercept and slope.

    () Introductory Econometrics: Topic 1 90 / 90