econometrics lecture 1 introduction and review on...

69
Econometrics Lecture 1 Introduction and Review on Statistics Chau, Tak Wai Shanghai University of Finance and Economics Spring 2014 1 / 69

Upload: others

Post on 26-Jan-2021

5 views

Category:

Documents


2 download

TRANSCRIPT

  • EconometricsLecture 1

    Introduction and Review on Statistics

    Chau, Tak Wai

    Shanghai University of Finance and Economics

    Spring 2014

    1 / 69

  • Introduction

    � This course is about Econometrics.� Metrics means ways of measurement.� Econometrics is about measurement methods used byeconomists.

    � We understand our world from data using statisticalmethods, together with economic theory.

    � We investigate the comovement of di¤erent variables by themethod of regression.

    � We make use of the econometrics tools on empirical data totest theories about causal relationships, quantify the sizeof e¤ects, or to make forecast about something not yethappened.

    2 / 69

  • Econometrics

    � Through estimating econometrics model with data, we cananswer the following questions:

    � How variables vary across individuals, across places, overtime?

    � e.g. How does wage di¤er across people with di¤erent age,education, gender or across country?

    � e.g. How does unemployment rate change in di¤erent pharsesof business cycle?

    � We are interested about the sign (positive or negative) andsize (how large) the e¤ects are.

    � Is more education associated with a higher or lower wage? Ifso, by how large?

    � Making prediction or forecast: average wage of universitygraduates. GDP growth next year.

    3 / 69

  • Econometrics

    � Further, is there any causal e¤ect between variables?� Example: What causes people to earn higher wages? Doeshaving more education cause a higher wage generally?

    � Example: What factors causes the price increase/decrease ofapartments?

    � Caution: in general, comovement (correlation) does notnecessary mean causation.

    � Example: Signaling hypothesis of education: Well-educatedpeople earn more just because those who get into highereducation have higher ability.

    � We need to combine theory, statistical methods andsometimes research design to ensure that a causal e¤ect isobtained.

    4 / 69

  • Econometrics

    � In this course, we will introduce the statistical techniques(Econometrics) to answer the above questions.

    � Statistics involves something random or stochastic. Theyvary across di¤erent realizations, unknown beforehead, andthe chances of di¤erent outcomes are described by aprobability distribution.

    � In handling real world data, economic variables vary overindividuals, places or time, and it is not totally determinedby some other observable variables.

    � So, it is natural to use statistical techniques, where we treatthe variations due to unknown/unobserved determinantsas coming from a draw from a probability distribution.

    5 / 69

  • Econometrics

    � Data� Cross-sectional data: data from a number of di¤erent unitsin a particular period of time. (e.g. A survey of households orrms in a month.)

    � Time series: data from the same unit observed repeatedly indi¤erent periods of time. (e.g. GDP, stock price of acompany.)

    � Panel/ Longitudinal data: data from a number of units, andeach unit is observed for multiple periods of time. (e.g. Surveyof the same households or rms once a year for a few years.)

    � Di¤erent types of data can be analysis with di¤erent methodsand models.

    6 / 69

  • Cross-sectional Data

    7 / 69

  • Time Series Data

    8 / 69

  • Panel Data

    9 / 69

  • Econometrics

    � Observational data vs Experimental Data� Experimental data: Experiment can control other variables,leaving only the one under study to di¤er across groups. Suchdata is easier to analyze.

    � However, often we cannot conduct experiments, so we have toobtain the data as it is, called observational data.

    � Then we need to control for variations due to other factorsusing statistical/ econometric techniques.

    � Example: it is hard to randomly force people not to haveeducation when they have a chance, so experiments are notfeasible.

    � So we have to depend on real-life data, where those who havethe chance to have more education may be quite di¤erent fromthose who do not have the chance.

    � Thus, some econometric tools are required.

    10 / 69

  • Econometrics

    � A typical econometric model:

    yi = f (xi , ε i )

    � yi is the dependent variable, which is the outcome variableof individual i (use t for time);

    � xi is a vector of regressors, independent variables orexplanatory variables, which explains the variations of yi .

    � ε i is the disturbance/error, which represents the componentthat cannot be explained by the regressors. It is unobserved,and is the key stochastic component of the model.

    � f is a function describing how x and ε a¤ect y .� e.g. y : wage; x : education, age, gender, etc.

    11 / 69

  • Econometrics

    � x and f are mainly determined by theory, and f is alsoa¤ected by convenience in estimation and inference.

    � The most fundamental functional form is the linearregression model

    yi = β1 + x2i β2 + ...+ xki βk + ε i , i = 1, ..., n

    where β is a vector of parameters to be estimated.

    � We will start with this model and the OLS approach, thendiscuss some complications when some basic assumptions areviolated.

    � Then, we will consider special approaches for specic types ofdata (time series, panel data, binary dependent variables).

    12 / 69

  • Procedure of an Econometric Analysis� Determine your research question, understand relatedeconomic theory and collect data.

    � Choose appropriate econometric models based on economictheory and nature of data.� Mainly a Linear Regression Model for this class� Determine what explanatory variables x should be included inthe model, based on theory and data.

    � Estimate the models using appropriate methods.� If we suspect of some problems, reestimate the model in waysthat can remedy the problems. Carry out specication tests ifnecessary.

    � Interpret the parameter estimates of your model, performhypothesis testing (related to economic theory) and answerthe research question.

    � Further analysis such as forecast or policy analysis.13 / 69

  • Matrix algebra

    � Notations of matrix (linear) algebra are used to shortenexpressions.

    � In case of no confusion, I do not bold vectors or matrices here.� We usually work with column vectors, e.g.

    x =

    [email protected]

    1CCCAWe also call this an n� 1 matrix.

    � Sometimes we have row vectors, e.g.

    b = ( b1 b2 � � � bn )

    14 / 69

  • Matrix algebra� When we have two dimensions, we call it a matrix.

    A =

    0BBB@a11 a12 � � � a1na21 a22 � � � a2n...

    .... . .

    ...am1 am2 � � � amn

    1CCCAwe call it an m� n matrix.

    � We call a matrix with just one element a scalar, where it isthe same as a real number in usual context.

    � The transpose of a matrix is formed by switching rows andcolumns. The transpose of A is denoted by A0 (sometimesAT .) An element of the tranposed matrix a0ij = aji . Thedimension of A0 is n�m.

    � Transpose of a column vector becomes a row vector.� (AB)0 = B 0A0

    15 / 69

  • Matrix algebra� Matrix addition / subtraction is just done element byelement. Both matrices (or vectors) must have the same size.

    � Multiplication of a scalar and a matrix: multiply the scalarto each element of the matrix. ((pA)ij = paij for a scalar p)

    � Multiplcation of two matrices: For two matrices An�p andBp�m , AB is an n�m matrix where the ij element is

    (AB)ij =p

    ∑k=1

    aikbkj

    � Note that AB 6= BA in general, even when they are bothdened and of the same size.

    � Identity matrix I consists of 1 on the main diagonal and zerofor other elements.

    � For the identity matrix I and a square matrix A of the samesize, IA = AI = A.

    16 / 69

  • Matrix algebra� Multiplication of two vectors: if a and b are columnvectors of the same size n, then

    a0b =n

    ∑j=1ajbj ; ab0 =

    0BBB@a1b1 a1b2 � � � a1bna2b1 a2b2 � � � a2bn...

    .... . .

    ...anb1 anb2 � � � anbn

    1CCCAThe former is a scalar. The latter is an n� n matrix.

    � Therefore xi1β1 + x2i β2 + ...+ xki βk can be written as x 0i β.� So, for the same column vector a,

    a0a =n

    ∑j=1a2j

    which is sum of squares of the elements, and is alwaysnon-negative.

    17 / 69

  • Matrix algebra� If V is an n� n square matrix, c is an n� 1 column vector,then c 0Vc is a scalar.

    � If c 0Vc > 0 for any non-zero vector c , we call V a positivedenite matrix.

    � If c 0Vc � 0 for any vector c , we call V a positivesemi-denite matrix.

    � Note: the diagonal elements of a positive denite matrix mustbe positive. Why?

    � A square matrix A is invertible or non-singular if its inverseA�1 exists, where A�1A = AA�1 = I .

    � (AB)�1 = B�1A�1 if both A and B are square invertiblematrices.

    � A square matrix is invertible if it is of full rank, or any one ofthe column (row) cannot be expressed as a linear combinationof other columns (rows).

    18 / 69

  • Statistics Review

    � A random variable (r.v.) takes a numerical value thatcorresponds to a certain set of random outcomes. As theoutcome is random, the random variable is di¤erent fordi¤erent realizations (draws from the distribution).

    � It can be discrete, where it is possible to take nite orcountable number of values.

    � It can be continuous, where it can take any value on aninterval.

    � e.g. X = 1 if it rains tomorrow, X = 0 if it does not rain.� e.g. X is the highest temperature of tomorrow.� For practical purposes, we usually take r.v. as continuous inthis class unless it mainly takes a small number of discretevalues.

    19 / 69

  • A Discrete Distribution

    20 / 69

  • A Continuous Distribution

    21 / 69

  • Statistics Review� The distrbution function, or cumulative distributionfunction (cdf) of an r.v. X is

    F (x) = Pr(X � x)

    � The probability density function (pdf) of a continuous r.v.X is

    f (x) =dF (x)dx

    � So, the probability between a and b is

    Pr(a < X � b) =Z baf (x)dx = F (b)� F (a)

    � Note that for a continuous r.v., at each particular point,P(X = a) = 0

    � It is only meaningful to talk about the probability of aninterval, say (a, b].

    22 / 69

  • Statistics Review� The joint distribution of two random variables X and Y

    F (x , y) = Pr(X � x ,Y � y)� The corresponding density function is

    f (x , y) =∂2F∂x∂y

    � Conditional distribution (density)

    fY jX (y jx) =f (x , y)fX (x)

    so the joint density can be expressed as

    f (x , y) = fX (x)fY jX (y jx)� Two variables are statistically independent if

    f (x , y) = fX (x)fY (y), 8x , y .23 / 69

  • Statistics Review� Expectation (general): a function g of an r.v. X ,

    E (g(X )) =Zg(x)f (x)dx or ∑ g(x)Pr(X = x)

    � Mean (First Moment)

    E (X ) =Zxf (x)dx or ∑ x Pr(X = x) = µX

    � Variance (Second Central Moment)

    Var(X ) = E (X � E (X ))2 = E (X 2)� [E (X )]2 = σ2X

    � Skewness (Third Moment) E (X � E (X ))3/σ3X (non-zero ifasymmetrical: two sides are not the same.)

    � Kurtosis (Fourth Moment) E (X � E (X ))4/σ4X (large ifheavy tails)

    24 / 69

  • 25 / 69

  • Statistics Review� Covariance

    Cov(X ,Y ) = E [(X � E (X ))(Y � E (Y ))] = σXYThis is a measure of linear relation between two variables.

    � Correlation

    ρXY = Corr(X ,Y ) =Cov(X ,Y )pVar(X )Var(Y )

    =σXY

    σX σY

    which lies between -1 and 1.� If two variables always lie on an upward sloping straight line,then ρXY = 1.

    � If two variables always lie on a downward sloping straight line,then ρXY = �1.

    � If covariance/correlation are zero, X and Y are uncorrelated.� Covariance/Correlation only measures linear relationship.There may be non-linear relationship that gives rise to zerocorrelation.

    26 / 69

  • 27 / 69

  • Statistics Review

    � Conditional Expectation: The expectation of Y given thevalue of X .

    E (Y jX = x) =Z ∞�∞yfY jX (y jx)dy or ∑ y Pr(Y = y jX = x)

    � If there is some comovement between these variables, theabove conditional distribution would change with x .

    � Finding out how and by how much the conditional mean of Yvaries with x is one of the important issues of this class.

    � e.g E (wagejeduc).� If the two random variables are independent,E (Y jX = x) = E (Y ) = µY .

    28 / 69

  • 29 / 69

  • Statistics Review

    � Properties of Expectation� E (aX + bY ) = aE (X ) + bE (Y )� Var(aX + bY ) = a2Var(X ) + b2Var(Y ) + 2ab� cov(X ,Y )� By applying the above iteratively, addition of arbitrarily nitenumber of terms can be done similarly.

    � More generally, E (ag(X ) + bh(Y )) = aE (g(X )) + bE (h(Y ))� However, the expectation cannot be passed into non-linearfunctions

    E (g(X )) 6= g(E (X ))

    30 / 69

  • Statistics Review

    � Linear function of random variables in matrix form� Note that for a constant (column) vector a = (a1, ..., aK )0 anda random (column) vector x = (x1, ..., xK )0

    a0x =K

    ∑k=1

    akxk

    which is a scalar.

    E (a0x) = E

    K

    ∑k=1

    akxk

    !=

    K

    ∑k=1

    akE (xk )

    = a0E (x) = a0µ =K

    ∑k=1

    akµk

    31 / 69

  • Statistics Review� Variance-Covariance Matrix

    Var(x) = E (x � µ)(x � µ)0 = Σ

    So the (i , j)th element of the matrix is

    Σij = E [(xi � µi )(xj � µj )] = cov(xi , xj )Σii = E (xi � µi )2 = var(xi )

    � Thus, the variance-covariance matrix of a vector of size K is aK �K square matrix.

    � The variance matrix is symmetric, and positive denite.� Outer product of the form E (u u0) is the variance of a vectoru (with E (u) = 0).

    � Note the meaning of Σ: whether it means a variance matrix,or it means summation. Summation involves the range(i = 1, ..., n), but we sometimes omit that.

    32 / 69

  • Statistics Review� For a linear combination in vector form, the variance is

    Var(a0x) = E (a0(x � µ)(x � µ)0a)= a0E [(x � µ)(x � µ)0]a= a0Var(x)a = a0Σa

    where Σ = Var(X ) = E [(x � µ)(x � µ)0] is the K �Kvariance-covariance matrix of x .

    � Σ is positive denite implies the variance of the above linearcombination of x must be positive.

    � The above variance expression can be expressed in summationform:

    a0Σa =K

    ∑k=1

    a2kVar(xk ) + ∑j ,k ;j 6=k

    ajakcov(xjxk )

    which agrees with what we have before.33 / 69

  • Statistics Review

    � Consider a matrix A which consist of L rows of K -vectors:

    A =

    0B@ a1...aL

    1CAL�K

    =

    0BBB@a11 a12 � � � a1Ka21 a22 � � � a2K...

    .... . .

    ...aL1 aL2 � � � aLK

    1CCCA� An L-vector of linear combinations of elements in x is

    Ax =

    0B@ a1x...aLx

    1CAL�1

    =

    0B@ ∑Kj=1 a1jxj...

    ∑Kj=1 aLjxj

    1CA� Since a1, ..., aL are row vectors, we do not need to transpose.

    34 / 69

  • Statistics Review

    � So,E (Ax) = AE (x) = Aµ

    which is a L� 1 vector

    Var(Ax) = E (A(x � µ)(x � µ)0A0)= AE ((x � µ)(x � µ)0)A0

    = AΣA0

    which is an L� L matrix.� Remember, if the random vector is of dimension L, itsvariance matrix is L� L.

    35 / 69

  • Statistics Review

    A Few Common Distributions

    � Normal Distribution: N(µ, σ2)

    f (x) =1p2πσ

    exp�� (x � µ)

    2

    2σ2

    �� When µ = 0, σ2 = 1 (denoted as N(0, 1)), we call this aStandard Normal Distribution.

    � Its density is in a bell shape, and symmetrical around themean.

    � If x � N(µ, σ2), then

    z =x � µ

    σ� N(0, 1)

    36 / 69

  • 37 / 69

  • Statistics Review

    A Few Common Distributions

    � Chi-Square Distribution with degree of freedom p : χ2(p)� It can be constructed by

    χ2 =p

    ∑i=1Z 2i

    where p is a positive integer, Zi � N(0, 1) and Zi and Zj areindependent.

    � As it is a sum of squares, it always takes non-negative values.

    38 / 69

  • 39 / 69

  • Statistics ReviewA Few Common Distributions

    � T distribution (Students T distribution) with degree offreedom ν: T (ν)

    � It can constructed by

    T =ZpP/ν

    where Z � N(0, 1) and P � χ2(ν) and Z ,P are independent.� It is like standard normal normal: bell-shaped and symmetricaround zero, but it has a thicker tail (higher densities at twoends.)

    � When v ! ∞, it converges to the Standard NormalDistribution.

    � When ν > 100, it is practically close to a normal distribution.

    40 / 69

  • 41 / 69

  • Statistics Review

    A Few Common Distributions

    � F Distribution with degrees of freedom p and q : F (p, q)� It can be contructed by

    F =P/pQ/q

    where P � χ2(p) and Q � χ2(q) and P,Q are independent.� As it is a ratio of two Chi-Square variables, it takes onlynon-negative values.

    � When q ! ∞, Q/q ! 1 and so, pF ! P � χ2(p).� We will see how these distributions are useful in constructingstatistical tests later on in this class.

    42 / 69

  • 43 / 69

  • Statistics Review

    Population and Sample

    � Population is the world/ nature we want to learn about.� Traditional view: it is known as all observations in thedomain we investigate.(e.g. the population can be all people in a country, when westudy the height, weight, wage distribution of a country.)

    � Statistical view: data generating process: There is anunderlying statistical distribution or model that generates eachobservation. Each observation is a realization of this datagenerating process.(e.g. each persons height is a draw from a height distribution,say with mean µ and variance σ2. Or the wage is generated bya regression model.)

    � It is the properties and relationship in the population thatwe want to know about.

    44 / 69

  • Statistics Review

    � We learn about our world through data.� Very often we dont have all the data about the world (in thetranditional view), so we need to draw a sample from thepopulation.

    � A Sample is the part of the data we draw from thepopulation to understand the world. (say obtaining the heightof 100 people)

    � From this sample, we want to estimate parameters (meanµ, β in regression model, etc) of the population and to testsome hypotheses about the population.

    � Because it is a random draw from the population, each samplewould be di¤erent, and so are the statistics from the sample.

    45 / 69

  • Statistics Review� Consider we want to estimate the (population) mean µ of acertain variable. (e.g. the height of adult male)

    � We denote the random variable X .� Population Mean E (X ) = µ and Population VarianceVar(X ) = σ2.

    � Assume a random sample is drawn from the population,which means each sample is an independent and identicaldistributed (iid) from the population.

    � For each observation i , E (Xi ) = µ and Var(Xi ) = σ2.� A straightforward estimator for the mean µ is the samplemean X̄ :

    X̄ =1n

    n

    ∑i=1Xi

    � Since Xi are random variables, X̄ is also a random variable,and it takes di¤erent values from di¤erent samples.

    46 / 69

  • Statistics Review� Properties of the estimator

    E (X̄ ) =1n ∑ E (Xi ) =

    1nnµ = µ

    Var(X̄ ) =1n2 ∑Var(Xi ) =

    1nnσ2 =

    σ2

    n

    note that the covariance terms disappear, why?� We usually looks for some good properties of an estimator.� Unbiasedness: E (X̄ ) = µ. (Its expectation is at the truevalue of the parameter we want to estimate.)

    � Consistency: When the sample size n goes to innity, X̄n(sample mean of size n) converges in probability to the truemean µ. (Notation: X̄n !p µ.)

    � It means when n becomes larger and larger, the distribution isdenser and denser around the true value µ, and at the limitwhen n! ∞, it collapses to the point µ.

    47 / 69

  • 48 / 69

  • Statistics Review

    � A su¢ cient condition for X̄n !p µ is that E (X̄n) = µ andVar(X̄n)! 0 as n! ∞, which are clearly satised.

    � An estimator is consistent if both are satised:1. The distribution narrows down to a point (converges) whenn goes to innity;2. This point is the true value.

    � Inconsistent estimators:� Example: only using the rst observation X1 no matter howlarge the sample size is.

    � Example: (∑ni=1 1.2Xi ) /n if µ 6= 0.

    49 / 69

  • Statistics Review

    � Law of Large Numbers (LLN)The most basic form is, if the sample Xi are iid from thepopulation with nite variance, then

    1n

    n

    ∑i=1Xi !p E (Xi )

    So, sample mean converges in probability to populationmean.

    � As we can dene another r.v. by letting Yi = g(Xi ) for somecontinuous functions g , we have

    1n

    n

    ∑i=1g(Xi )!p E (g(Xi ))

    50 / 69

  • Statistics Review� E¢ ciency� Recall that X̄ as an estimator for µ, we have Var(X̄ ) = σ2/n.� The lower the variance of the estimator, the more likely thesample value to be closer to the true value.

    � An estimator θ̂ is more e¢ cient than another estimator eθ,given unbiased (or consistent) to θ, if

    Var(θ̂) < Var(eθ)� If it involves a vector of estimators, we replace the aboveinequality by

    D = Var(eθ)� Var(θ̂) is positive denite.so that a0Da > 0 for any non-zero vector a, which meansVar(a0eθ)� Var(a0 θ̂) = a0Var(eθ)a� a0Var(θ̂)a > 0.

    � So the variance of any linaer combination of θ̂ is smaller thanthe same linear combination with eθ.

    51 / 69

  • Statistics Review

    � Estimating Standard Error� Recall Var(X̄ ) = σ2/n. The standard deviation of theestimator is σ/

    pn.

    � However, we usually do not know σ2 to start with.� An unbiased and consistent estimator for σ2 is the samplevariance:

    s2 =1

    n� 1n

    ∑i=1(Xi � X̄ )2

    It is divided by n� 1 to adjust for the loss in one degree offreedom in estimating X̄ .

    � An (estimated) standard error is

    sX̄ = se(X̄ ) =q\Var(X̄ ) =

    rs2

    n=

    spn

    52 / 69

  • Statistics Review� Sampling Distribution / Distribution of the Estimator� The knowledge of the distribution of estimator is essential forus to perform hypothesis testing and constructcondence interval later in this section.

    � Need to distinguish the cases where population is normallydistributed or not.

    � If the population is normally distributed so thatX � N(µ, σ2), then by the property that linear combinationof normally distributed random variables is still normal,we have

    X̄n =

    1n

    n

    ∑i=1Xi

    !� N(µ, σ

    2

    n) or

    pn(X̄n � µ

    σ) � N(0, 1)

    � However, we cannot use the above to justify the use of normaldistribution if the population is not from a normal distribution.

    53 / 69

  • Statistics Review

    � Central Limit Theorem� Luckily, we have a very amazing result about sample means.� The Central Limit Theorem says that, if the observationsX1,X2, ...,Xn drawn from the population are iid with nitevariance, when the sample size n gets large, the samplemean X̄n would become closer and closer to a normaldistribution, and at the limit when n! ∞, it is exactlynormally distributed, regardless of the populationdistribution.

    � This is true even if the population itself is far from normal,say, uniform or binomial.

    54 / 69

  • Samples from Bernolli distribution with P(X = 1) = 0.7.n = 2, 5, 10, 100

    0.5

    11.

    5D

    ensi

    ty

    -.5 0 .5 1 1.5r(mean)

    0.5

    11.

    52

    Den

    sity

    0 .5 1r(mean)

    01

    23

    Den

    sity

    0 .2 .4 .6 .8 1r(mean)

    02

    46

    8D

    ensi

    ty

    .5 .6 .7 .8 .9r(mean)

    55 / 69

  • Statistics Review� Technically, we may use the notation of convergence indistribution.

    � After normalization, Central Limit Theorem impliespn(X̄n � µ)!d N(0, σ2) or (

    X̄n � µσ/pn)!d N(0, 1)

    when Xi are iid, regardless of the distribution of Xi� By making use of this, for a large but nite n, we canapproximate the sampling distribution by

    X̄n �app N(µ,σ2

    n)

    � The resulting distribution is the same as the exact result fornormal population, but it is an approximation for largesample if the population is not normal.

    � For simple problems, a sample size of 30 is regarded as large.But, for more complicated problems, it is harder to say.

    56 / 69

  • Statistics Review� Similarly, for a vector of random variables. If we have an iidsample of K random variables, then

    pn(X̄n � µ)!d N(0,Σ)

    where

    X̄n =1n

    n

    ∑i=1

    [email protected]

    1CCCA ; µ =0BBB@

    µ1µ2...

    µK

    1CCCAand Σ is the variance matrix of the population variables(X1, ...,XK )

    � If a is a column vector of non-random scalars, thenpn(a0X̄n � a0µ)!d N(0, a0Σa)

    Q: What is the asy. distribution of X̄1 � X̄2?57 / 69

  • Statistics Review

    � So, we know that if σ2 is known, then

    z =X̄ � µσ/pn� N(0, 1)

    � It is exact if the population X follows normal distribution, andapproximately true for large sample if the population is fromother distribution.

    � However, for most of the time, σ2 is unknown.� Consider the t-ratio

    t =X̄ � µs/pn=X̄ � µse(X̄ )

    where we replace σ by the sample estimate s we introducedbefore.

    58 / 69

  • Statistics Review

    � When the population normally distributed, it can be shownthat

    t =(X̄ � µ)/(σ/

    pn)q

    (n� 1)( s2σ2)/(n� 1)

    � T (n� 1)

    � The numerator is distributed as standard normal, while thedenominator can be shown to be the square root of aChi-square random variable with degree of freedom (n� 1)(i.e. χ2(n� 1)) divided by (n� 1).

    � It can also be shown that these two are independent, and soby denition t is Student-t distributed with degree offreedom (n� 1).

    � Notice that when n is large (e.g > 100), T and normal areclose, so you may use normal distribution directly.

    � This is true regardless of the sample size.

    59 / 69

  • Statistics Review� For non-normal population, we can (only) have asymptoticapproximation:

    t =

    X̄�µpσ2/nps2/σ2

    � The numerator converges in distribution to N(0, 1), and thedenominator converges in probability to 1, since s2 !p σ2 andsquare root is a continuous function.

    � Thereforet !d N(0, 1)

    � So, for large enough sample, we have t approximatelydistributed as standard normal.

    � We may also use T distribution if the sample size is between30 to 100 as an approximation. This allows a thicker tail thanthe standard normal distribution.

    � May directly use normal for n > 100.60 / 69

  • Statistics Review� Hypothesis Testing� Besides the point estimate, say X̄ , we may also like to testwhether some belief (hypothesis) we have about thepopulation is true or not.

    � By making use of the knowledge about the distribution of thestatistic (e.g. t above), we can perform hypothesis testing.

    � We always talk about hypothesis about the population.� Two-sided hypothesis:H0 : µ = µ0 vs H1 : µ 6= µ0

    � One-sided hypothesis:H0 : µ = µ0 vs H1 : µ < (>)µ0

    � The idea is that, if the null is true, and it is too extreme forthe sample to show the X̄ we see, then we have evidence thatthe null is not likely to be true. Otherwise, we dont haveenough evidence that the null is false.

    61 / 69

  • Statistics Review

    � H0 : µ = µ0H1 : µ 6= µ0

    � We use information of our sample, in particular X̄ or thecorresponding t, to judge whether the null is true.

    � In this case, as X̄ or t are approximately normally distributed,we do not have a region that X̄ cannot take if the null is true.

    � So, we have to allow for some errors.� We choose α proportion of the distribution under the null(H0) that is the most favorable to H1, then we set this as therejection region.

    � α is the probability of Type I error (H0 is true, but decide toreject H0.)

    � α is also commonly known as the signicance level, or sizeof the test.

    62 / 69

  • Statistics Review

    � In this case, we make use of t = (X̄ � µ0)/(s/pn).

    � If null is true, the the true mean is µ0, and t is distributed asT (n� 1).

    � So, if X̄ is so much away from µ0 that jtj is so big that it lieson the most extreme α/2 area on either side (i.e.jtj > tα/2,n�1), then we reject the null (H0).

    � Otherwise, we cannot reject H0. We dont have enoughevidence that H0 is false.

    � We call tα/2,n�1 the critical value for the signicance level α.� Usually we use α = 5%. We sometimes use 10% or 1%.

    63 / 69

  • Statistics Review

    � Example: If we want to know about the height of femalebetween the age 20-25 in Shanghai.

    � Suppose we have a random sample of the height of 100female at this age.

    � From the sample, X̄ = 165.3, s2 = 50.23, n = 100� We want to test the hypothesis that H0 : µ = 162 vs.H1 : µ 6= 162.

    � T statistics is now

    t =165.3� 162p50.23/100

    = 4.656

    � The critical value for 2-sided test at α = 5% is about 1.96 ifusing standard normal, and 1.99 if using T (99), so this isclearly too extreme, and so we can reject H0 at 5% level.

    64 / 69

  • Statistics Review� Another way to make judgement is to use p-value.� This is the probability of observing a statistic that is as ormore extreme than what we have actually observed,given H0 is true.

    � If Tn�1 � T (n� 1), thenp = Pr(jTn�1j � jtj) = Pr(Tn�1 � jtj) + Pr(Tn�1 � � jtj)where t is the t-ratio in the observed sample.

    � The larger the p, the less extreme the sample is under null.Thus we reject when p is very small. The rule is to rejectwhen p � α.

    � In the above example,

    p = Pr(jT99j � 4.656) ' 1� 10�5

    so we reject the null.� Usually it is easier to obtain p-value through computersoftware.

    65 / 69

  • Statistics Review

    � For One-sided hypothesis:H0 : µ = µ0 vs H1 : µ < (>)µ0

    � The only di¤erence is that, we only reject if what we observeis on the side favorable to the alternative.

    � If we use t-ratio, we need to adjust the critical value.� If we use p-value, we only use the probability on the sidefavourable to the alternative.

    � Example: if H1 : µ > 162 instead, we use the critical valuetα,n�1 = 1.66 (or zα = 1.645), and clearly we reject the nulland we have evidence that it is larger than 162.

    � p-value is now Pr(T99 � 4.656) ' 5� 10�6. Again, muchsmaller than α = 0.05, and so we reject the null hypothesis.

    � What if the alternative is µ < 162?

    66 / 69

  • Statistics Review� There is also a Type II error, which is that given the H1 istrue, we fail to reject the null.

    � Given we have limited data, there is a trade-o¤ between TypeI error against Type II error.

    � If we want to reduce Type I error by making it harder toreject, it is also more likely to commit Type II error.

    � If the standard error is large, it is hard to distinguish whetherthe null is indeed satised.

    � e.g. If the standard error s for X̄ is 10, it is di¢ cult todistinguish whether it comes from a distribution with µ = 160against µ = 165.

    � The only way to reduce both types of errors is to reduce thestandard error of your estimator, either by using your datamore e¢ ciently (a more e¢ cient estimator) or to increase yoursample size. (Recall se = s/

    pn.)

    67 / 69

  • Statistics Review� We can also construct condence interval for the populationparameter µ using X̄ and the related critical values.

    � In 1� α proportion of the times the condence intervalconstructed this way would include the true value µ.

    Pr(�tα/2,n�1 <X̄ � µs/pn< tα/2,n�1) = 1� α

    rearranging, we have

    Pr(X̄ � tα/2,n�1spn< µ < X̄ + tα/2,n�1

    spn) = 1� α

    � This gives you an idea the likely range of the true parameter.� If the (1� α) condence interval does not include H0, thenthe we reject the null of the test.

    � Example: 95% condence interval of the height of female is

    165.3� 1.99(r50.23100

    ) = (163.9, 166.7)

    68 / 69

  • Statistics Review

    � In this lecture we go through the statistical techniques we useto know about the population mean µ using X̄ .

    � We will do something similar, but concerning relationsbetween variables.

    � In this course,� 1. We introduce the basic models and their underlyingassumptions.

    � 2. The method of estimation: Ordinary Least Squares (OLS)� 3. Statistical Hypothesis Testing on parameters thatrepresents economic relations.

    � 4. What should be done if some basic assumptions areviolated.

    69 / 69