what is econometrics - rice university

87
Chapter 1 What Is Econometrics 1.1 Economics 1.1.1 The Economic Problem Economics concerns itself with satisfying unlimited wants with limited resources. As such it proposes solutions that involve the production and consumption of goods using a particular allocation of resources. At the micro-economic level, these solutions are often the results of optimizing behavior by economics agents such as consumers or producers and possibly impacted by regulators. At the macro-economic level, they can also be the results of decisions made my policy- makers such as central bankers, the executive branch and the legislative branch. 1.1.2 Some Economic Questions As economic agents arrive at their decisions they encounter and must answer a host of economic questions: Should the Eurozone countries coordinate their fiscal as well as monetary policies? Relatedly, should Greece default on its’ debt and form a new Greek currency? What is the best time for Professor Brown to retire? Should an undegraduate at Rice major in economics or mathematical economics? Why did the savings rate decline during the Great Expansion of the nineties? As economists we are in the business of finding answers to such questions. Hopefully, the answers are justifiable and replicable so we are in some sense consistent. Moreover, we hope that our answers will be born out by subsequent experience. In this chapter, we discuss a structure for going about constructing such answers. This stucture is how we do business in economics. 1.1.3 Economics as a Science Science can be defined as the intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural 1

Upload: others

Post on 18-Oct-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What Is Econometrics - Rice University

Chapter 1

What Is Econometrics

1.1 Economics

1.1.1 The Economic Problem

Economics concerns itself with satisfying unlimited wants with limited resources.As such it proposes solutions that involve the production and consumption ofgoods using a particular allocation of resources. At the micro-economic level,these solutions are often the results of optimizing behavior by economics agentssuch as consumers or producers and possibly impacted by regulators. At themacro-economic level, they can also be the results of decisions made my policy-makers such as central bankers, the executive branch and the legislative branch.

1.1.2 Some Economic Questions

As economic agents arrive at their decisions they encounter and must answera host of economic questions: Should the Eurozone countries coordinate theirfiscal as well as monetary policies? Relatedly, should Greece default on its’ debtand form a new Greek currency? What is the best time for Professor Brown toretire? Should an undegraduate at Rice major in economics or mathematicaleconomics? Why did the savings rate decline during the Great Expansion ofthe nineties?

As economists we are in the business of finding answers to such questions.Hopefully, the answers are justifiable and replicable so we are in some senseconsistent. Moreover, we hope that our answers will be born out by subsequentexperience. In this chapter, we discuss a structure for going about constructingsuch answers. This stucture is how we do business in economics.

1.1.3 Economics as a Science

Science can be defined as the intellectual and practical activity encompassingthe systematic study of the structure and behavior of the physical and natural

1

Page 2: What Is Econometrics - Rice University

2 CHAPTER 1. WHAT IS ECONOMETRICS

world through observation and experiment. The purpose of the experiment is tocontrol for various factors that might vary when the environment is not managedas in an experiment. This allows the observer to focus on the relationshipbetween a limited number of variables.

So to be a science economics must be founded on the notion of studying andexplaining the world through observation and experimentation. Economics isclassified as a social science since it is concerned with social phenomena such asorganization and operation of the household, firm, or society.

1.1.4 Providing Scientific Answers

As such, economics would attempt to provide answers to the questions posedabove, for example, using scientific approaches that would yield similar answersin similar circumstances. In that sense, it attempts to bring the replicabilityof scientific experiments into the analysis and formulation of answers. This isaccomplished by combining economic theory with the analysis of pre-existingrelevant data using appropriate statistical techniques.

1.2 Data

1.2.1 Accounting

Most of the data used in econometrics is accounting data. Accounting datais routinely recorded data (that is, records of transactions) as part of marketor institutional activities. Accounting data has many shortcomings, since it isnot collected specifically for the purposes of the econometrician. Oftentimes theavailable data may not measure the phenomenon we are interested in explainingbut a related one.

Of course this characterization is not universal. For example, we may beinterested in the relationship between birth weight or mortality to the incomelevel. In both cases, the variable we are looking at is not transactional andhence accounting data.

1.2.2 Nonexperimental

Moreover, the data is typically nonexperimental. The econometrician does nothave any control over nonexperimental data. In this regard economics is inmuch the same situation as meteorology and astronomy. This nonexperimentalnature of the data causes various problems. Econometrics is largely concernedwith how to manage these problems in such a fashion that we can have the sameconfidence in our findings as when experimental data is used.

Here again, the characterization is not universal. For example there havebeen income maintenance experiments where the impact on labor effort in re-sponse to guaranteeing a base level of income was examined directly with ap-plication of a negative income tax to participants. And there are numerous

Page 3: What Is Econometrics - Rice University

1.3. MODELS 3

laboratory experiments that have helped advance our understanding of gametheoretic equilibria.

1.2.3 Data Types

It is useful to give a typeology of the different forms that economic data cantake:

Definition 1.1. Time-series data are data that represent repeated observa-tions on some variable in adjacent time periods. A time-series variable is oftensubscripted with the letter t. Examples are the familiar aggregate economicvariables such as interest rates, gross domestic product, and the consumer priceindex, which are available at the annual and quarterly, and monthly intervals.Stock prices are another good example, available at the tic interval.

Definition 1.2. Cross-sectional data are data that represent a set of observa-tions of some variable at one specific instant for a number of agents or orga-nizations. A cross-sectional variable is often indexed with the letter i. Mostoften, this data is microeconomic. For example, a firm’s employment data at aparticular point in time has observations for each employee.

Definition 1.3. Time-series cross-sectional data are data that are both time-series and cross-sectional. Such data is typically indexed by t for the time periodand i for the agent.

Definition 1.4. A special case of time-series cross-sectional data is panel data.Panel data are observations on the same set of agents or a panel over time.A leading example is the Survey of Income and Program Participation whichprovides information about the incomes of American individuals and householdsand their participation in income transfer programs.

1.2.4 Empirical Regularities

We need to look at the data to detect regularities or stylized facts. Theseregularities that persist across time or individuals beg for explanation of somesort. Why do they occur and why do they persist? An example would be theobservation that the savings rate in the U.S. was very low during the recordexpansion of the nineties and rose during the subsequent great recession. Ineconomics we seek to explain the regularities with models.

1.3 Models

1.3.1 Simplifying Assumptions

Models are simplifications of the real world. As such they introduce simplify-ing assumptions that hold a number of factors constant and allow the analysis

Page 4: What Is Econometrics - Rice University

4 CHAPTER 1. WHAT IS ECONOMETRICS

of the relationship between a limited number of variables. These simplifyingassumptions are the factors that would be controlled if we were conducting acontrolled experiment to verify the model.

Models are sometimes criticized as unrealistic but if they were realistic theywould not be models but in fact represent reality. The real question is whetherthe simplifications introduced by the assumptions are useful for dealing withthe phenomena we attempt to explain. Hopefully, we will provide some metricof what is a useful model.

1.3.2 Economic Theory

As a result of optimizing behavior by consumers, producers, or other agents inthe economy, economic theory will suggest how certain variables will be related.For example demand for a consumer good can be represented as a function ofits own price, prices of complimentary goods, prices of competing goods, andthe income level of the consumer.

1.3.3 Equations

Economic models are usually stated in terms of one or more equations. Forexample, in a simple macroeconomic model we might represent Ct (aggregateconsumption at time t) as a linear function of Dt (aggregate disposable income):

Ct = α+ βDt

where α and β are parameters to be interest with their values targets of inquiry.The choice of the linear model is usually a simplification but the variables thatgo into the model will typically follow directly from economic theory.

1.3.4 Error Component

As an example, consider the following scatter plot of aggregate disposable in-come versus aggregate consumption for annual data over the period 1946-1958.The data reveal a clear increasing relationship between income and consump-tion. Moreover, average consumption, which is represented by a ray from theorigin declines as income increases. These both suggest that the linear relation-ship has a positive intercept and a slope less than one. However, the relationshipis not exact for any choice of the parameters α and β

Because none of our models are exact, we include an error component in ourequations, usually denoted ui. Thus we have

Ct = α+ βDt + ut

This error term could be viewed as an approximation error, a misspecificationerror, or result from errors in optimization by economic agents. In any event,the linear parametric relationship is not exact for any choice of the parameters.

Page 5: What Is Econometrics - Rice University

1.4. STATISTICS 5

Figure 1.1: Aggregate Consumption vs. Income, 1946-1958

1.4 Statistics

1.4.1 Stochastic Component

In order to conduct analysis of the model once the error term has been intro-duced, we assume that it is a random variable and subject to distributionalrestrictions. For example, we might assume that ut is an i.i.d. variable with

E[ut|Dt] = 0

E[u2t |Dt] = σ2.

Again, this is a modeling assumption and may be a useful simplification ornot. We have combined economic theory to explain the systematic part of thebehavior and statistical theory to characterize the unsystematic part.

1.4.2 Statistical Analysis

The introduction of the stochastic assumptions on the error component allowsthe use of statistical techniques to perform various tasks in analyzing the com-bined model. This analysis will involve statistical inference whereby we use thedistributional behavior implied by our economic model and statistical assump-tions to make inferences about properties of the underlying model based on theobserved behavior generated by the model.

Page 6: What Is Econometrics - Rice University

6 CHAPTER 1. WHAT IS ECONOMETRICS

Implicitly, in the conditional expectations assumptions used in the aggregateconsumption example above, the explanatory variable is taken as fixed or non-stochastic. Sometimes the explanatory variables are just as obviously randomas the dependent variables, in which case we will need to make make stochas-tic assumptions regarding their joint behavior. This will sometimes complicatematters considerably.

1.4.3 Tasks

In the sequel we will concern ourselves with three main tasks in applying sta-tistical analysis to the model in question:

1. Estimating the parameters of the model. These parameters would includethe parameters of the systematic part of the model and the parametersof the distribution of the error component. These estimates are typicallypoint estimates though they can be interval estimates.

2. Hypothesis tests concerning the behavior of the model. This is where theapproach really becomes scientific. We might be interested in whetheraccumulation of wealth is a an adequate explanation for the low savingrate situation mentioned above. This could be tested using an hypothesistest.

3. Forecast the behavior of the dependent variable outside the sample us-ing the estimated model. Such forecasts can be point forecasts such asexpected behavior or interval forecasts which give a range of values withan attached probability of occurrence. Usually, these forecasts will beconditional on knowledge of the values of the explanatory variables.

1.5 Econometrics

Econometrics involves the integration of these three components: data, eco-nomic theory, and statistical theory to model an economic phenomenon. Sinceour data is usually nonexperimental, econometrics makes use of economic andstatistical theory to adjust for the lack of proper data. Economic theory allowsus to focus on a reduced set of variables and possible knowledge of how theyinteract. If we were conducting an experiment, we would be able to control thevarious other factors and focus on the variables whose relationship interests us.And statistical assumptions allow us to effectively control for the effect of otherfactors not suggested by economic theory. This integration of data, economictheory, and statistical theory to answer economic questions is the essence ofeconometrics.

1.5.1 Interactions

Often the phenomenon we are seeking to explain is a set of stylized facts thatcharacterize how the various economic variables move together. This may lead

Page 7: What Is Econometrics - Rice University

1.5. ECONOMETRICS 7

to economic theories that possibly explain these stylized facts. The question ofwhether or not the theory adequately explains the data is measured throughthe application of approriate statistical techniques to estimate the model basedon the available data. Hopefully, there are hypothesis (behaviors) that arepredicted by the economic theory that can be tested using statistical techniques.Based on the outcome of this exercise the theory may need modification oradditional or different data gathered to address the problem at hand.

Economists speak of “bringing the model to the data” in this exercise. Itis the responibility of the econometrician to choose or develop an appropriatestatistical technique that matches the needs of the data and the theory. Aswas mentioned above, economic data are largely nonexperimental. And oureconomic models typically have mutliple variables determined simultaneouslyby the model. Consequently, we focus on and develop techniques for dealingwith these properties and others that may need to be handled.

Figure 1.2: Interactions

Some of this interaction is illustrated in Figure 1.2. Observations on aggre-

Page 8: What Is Econometrics - Rice University

8 CHAPTER 1. WHAT IS ECONOMETRICS

gate consumption and aggregate disposable income show a generally positiverelationship in the plot in the Data circle. This leads to the Duesenberry-typelinear consumption equation: Ct = α + βDt in the Theory circle. Since themodel does not exactly fit the data an error term ut is appended. Assumptionsare made about the statistical properties of the observable data and the errorterm and an appropriate statistical technique is appled to obtain the estimatedmodel which is represented as the fitted green line in the Statistics circle. Thereare feedbacks between the three circles with the needs or results of the sta-tistical analysis leading to modifications in the data and/or theory. Likewisethe theoretcial analysis may lead to changes in which and what type of data isgathered.

1.5.2 Scientific Method

In the scientific method, we take an hypothesis, construct an experiment withwhich to test the hypothesis, and based on the results of the hypothesis weeither reject the hypothesis or not. The purpose of the experiment is to controlfor various factors that might impact the outcome but are not of interest. Thisallows us to focus our attention on the relationship between the variables ofinterest that result from the experiment. Based on the outcomes for thesevariables of interest, we either reject or not.

In econometrics, we let the systematic model and the statistical model orsome specified component of them to play the role of the hypothesis. The avail-able data, ex post, plays the role of the outcome of an experiment. Statisticalanalysis of the data using the combined model enables us to reject the hypoth-esis or not. So econometrics is what allows us to use the scientific method andhence is critical in economics being treated as a science.

1.5.3 Choosing A Model

In testing an hypothesis above we are implicitly choosing between a model whichsatisfies the hypothesis and one which doesn’t. So the hypothesis test implicitlyis a criterion for choosing between models. If we reject the hypothesis then weare choosing the model which doesn’t satisfy the hypothesis. This a clear-cutdecision framework. In any event, we are choosing the model which best agreeswith the data.

Sometimes, we are faced with a situation where the data do not give a strongpreference between two competing models. In this case, the simplest model thatfits the data is often the best choice. This is known as using Ocam’s razor.

1.6 Example

As an illustration of the concepts discussed above, we now consider an extendedexample. A question of interest that was mentioned above was the falling savingrate during the “Great Expanion” of the 90’s. In fact, as Figure 1.3 reveals, the

Page 9: What Is Econometrics - Rice University

1.6. EXAMPLE 9

saving rate began to trend downward in 1976 and continued to fall with cyclicalfluctuations until about 2005. There was a great deal of alarm toward the end ofthe great expansion concerning the long-run implications on the growth of theaggregate economy of such a low saving rate due to the link between aggregateinvestment and saving. It was suggested that we had a new more profligategeneration of consumers that was the source of this trend.

Figure 1.3

We will attempt to model aggregate consumption and hence saving behaviorand see if the observed movements can be explained. We will not conducta formal specification search at this point but just “eye-ball” how well ourmodel matches up with the data. When confronted with a poor match, we willuse economic theory to suggest additional refinements for the model. We willestimate each of the alternative models using least squares regression to obtainparameter estimates and thereby generate a series of fitted values to compareto the actual values that occurred. In the sequel, we will study the statisticaltheory and application of least squares and how to formally choose between

Page 10: What Is Econometrics - Rice University

10 CHAPTER 1. WHAT IS ECONOMETRICS

models.

From before, consider first the Duesenberry consumption function

Ct = α+ βDt + ut

where Ct is aggregate consumption expenditure at time t and Dt is aggregatedisposable income. Substituting this into the saving identity St = Dt − Ct, weobtain

St = −α+ (1− β)Dt − ut

and the saving rate is given by

St/Dt = (1− β)− α(1/Dt)− ut/Dt.

Ignoring the error term, for 1 > β > 0 and α > 0, then as Dt increases overtime we would expect the saving rate to rise, as happened between 1947-1974.If α < 0 then the saving rate will fall, as happened for 1975-2005.

We will now quantify and verify this explanation with regression analysis.Using annual data for the period 1947-2014 yields the following (standard errorsin parentheses)

Ct = −49.949(11.959)

+ 0.91607(0.00218)

Dt + et

with SSE = 331545 and R2 = 0.99962, where et is the regression residual.Plugging these estimates of α and β into to saving rate equation above andevaluating over the sample yields the following plot:

Although the fitted value decline steadily during the period 1975-2005, they donot match the historical rise during the period 1947-1975. We also present re-sults for regressions over the two sub-samples 1947-1975 and 1976-2014. Thesesub-sample results capture the respective trends quite well. Obviously some-thing has changed between the two periods.

In making the decision to save or increase wealth, consumers consider thecurrent stock of wealth. If the stock is substantial, they are less inclined to savefrom current income. This leads to the following modified specification

Ct = α+ βDt + γW + ut

where Wt is a measure (household net wealth) of current wealth holdings. Infact, this specification was put forward in Ando and Modigliani’s life-cycle modelof aggregate consumption (with intercept suppressed). A regression on thisspecification yields:

Ct = −27.528(7.78132)

+ 0.79601(0.01183)

Dt + 0.02024(0.00198)

Wt + et

with SSE = 127240 and R2 = 0.99986. Ignoring the error term and plottingthe implied saving rate over the sample yields:

Page 11: What Is Econometrics - Rice University

1.6. EXAMPLE 11

Figure 1.4: Bivariate Regression

Figure 1.5: Trivariate Regression

Page 12: What Is Econometrics - Rice University

12 CHAPTER 1. WHAT IS ECONOMETRICS

The saving rate values implied by this regression do a much better job matchingthe actual values for the last half of the sample but still miss badly for the 1946-1975 period. In particular, the fluctuations around the trend in the latter partof the sample are captured well. Regressions over the two sub-samples nowcapture much of both the trend and cyclical movements for both.

An important factor in deciding between current consumption and saving orfuture consumption is the return on the saving. The opportunity cost of currentconsumption is the interest earned on saving and the resulting increase in futureconsumption. This leads to a further modification of the specification

Ct = α+ βDt + γWt + δRt + ut

where Rt is a measure (federal funds rate) of the interest rate. A regression onthis specification yields:

Ct = −7.5759(15.127)

+ 0.80123(0.01208)

Dt + 0.01924(0.00202)

Wt −4.2058(1.8566)

Rt + et

with SSE = 110446 and R2 = 0.99986. Since the intercept in this regression isinsignificantly different from zero, we follow Ando-Modigiliani and suppress theintercept to obtain:

Ct = 0.80008(0.01178)

Yt + 0.01933(0.00200)

Wt − 4.95789(1.08462)

Rt + et

with SSE = 110941 and R2 = .99986. Note that this regression uses 8 lessobservations for which Rt is not available. Plotting the saving rates implied bythis regression yields:

Page 13: What Is Econometrics - Rice University

1.6. EXAMPLE 13

Figure 1.6: Quadravariate Regression

The fitted values from this regression do a much better job explaining the fluc-tuations in the saving rate. The rising saving rate during the 1955-1975 periodis explained by the rising interest rate and the falling saving rate during the1976-2005 period is explained by the falling interest rate and the increases inwealth. Far from being a puzzle, the declining saving rate during this period isvery predictable. Note that the two sub-sample regressions yield similar plots,particularly for the second sub-sample.

Page 14: What Is Econometrics - Rice University

Chapter 2

Some Useful Distributions

Usually, interest in an economic phenomenon is piqued from viewing samplemoments of some important variable. So naturally, we will give a great deal ofattention to sample moments in the sequel. Ultimately, though, to understandthe behavior of such sample moments in either large or small samples we willneed to resort to distributional theory. We will only consider continuous distri-butions, in this chapter, but the discrete cases are easily handled in a similarfashion. For reasons that become apparent, the normal distribution and distri-butions associated with it play a central role in much of our statistical analysisin both small and large samples.

2.1 Introduction

2.1.1 Inference

As statisticians, we are often called upon to answer questions or make statementsconcerning certain random variables. For example: is a coin fair (i.e. is theprobability of heads = 0.5) or what is the expected value of GNP for the quarter.

Typically, answering such questions requires knowledge of the underlying(population) distribution of the random variable. Unfortunately, we usually donot know this distribution (although we may have a strong idea, as in the caseof the coin).

In order to gain knowledge of the distribution, we draw several realizations ofthe random variable. The notion is that the observations in this sample containinformation concerning the population distribution.

Definition 2.1. The process by which we make statements concerning thepopulation distribution based on the sample observations is called inference. �

Example 2.1. We decide whether a coin is fair by tossing it several times andobserving whether it seems to be heads about half the time. �

14

Page 15: What Is Econometrics - Rice University

2.2. THE SAMPLE MEAN 15

2.1.2 Random Samples

Definition 2.2. Suppose we draw n observations of a random variable, de-noted {x1, x2, ..., xn} and each xi is independent and has the same (marginal)distribution, then {x1, x2, ..., xn} constitute a simple random sample. �

Example 2.2. We toss a coin three times. Supposedly, the outcomes are in-dependent. If xi counts the number of heads for toss i, then we have a simplerandom sample. �

Note that not all samples are simple random. They can be either non-independent or non-identical or both.

Example 2.3. We are interested in the income level for the population ingeneral. The n observations available in this class are not identical since thehigher income individuals will tend to be more variable. �

Example 2.4. Consider the aggregate consumption level. The n observationsavailable in this set are not independent since a high consumption level in oneperiod is usually followed by a high level in the next. �

2.1.3 Sample Statistics

Definition 2.3. Any function of the observations in the sample which is thebasis for inference is called a sample statistic. �

Example 2.5. In the coin tossing experiment, let S count the total numberof heads and P = S

3 count the sample proportion of heads. Both S and P aresample statistics. �

2.1.4 Sample Distributions

A sample statistic is a random variable — its value will vary from one experimentto another. As a random variable, it is subject to a distribution.

Definition 2.4. The distribution of the sample statistic is the sample distribu-tion of the statistic. �

Example 2.6. The statistic S introduced above has a multinomial sampledistribution. Specifically Pr(S = 0) = 1/8, Pr(S = 1) = 3/8, Pr(S = 2) = 3/8,and Pr(S = 3) = 1/8. �

2.2 The Sample Mean

2.2.1 Sample Mean

Definition 2.5. Given a sample {x1, x2, ..., xn} of observations on a randomvariable x, then

xn =1

n

n∑i=1

xi

Page 16: What Is Econometrics - Rice University

16 CHAPTER 2. SOME USEFUL DISTRIBUTIONS

is the sample mean or average of the sample. �This form plays a predominate role in statistics and econometrics. All sample

moments can be expressed as sample averages of functions of the data. And, atleast in large samples, many of the estimators’ and statistics’ behavior dependson their relationship to sample averages. Consequently, the behavior of the formis well studied and well known.

2.2.2 Sample Sum

Consider the simple random sample {x1, x2, ..., xn}, where x measures, say, theheight of an adult female. We will assume that E[xi] = µ and Var(xi) = σ2 ,for all i = 1, 2, ..., n.

Let S =∑ni=1xi = x1 +x2 + · · ·+xn denote the sample sum. It follows that

E[S] = E[(x1 + x2 + · · ·+ xn )] = E[x1] + E[x2] + · · ·+ E[xn] = nµ.

Moreover

Var(S) = E[(S − E[S])2]

= E[(∑ni=1xi − nµ)

2]

= E[(∑ni=1(xi − µ))

2]

(2.1)

= E[∑n

i=1

∑nj=1(xi − µ)(xj − µ)

]=

∑ni=1

∑nj=1E[(xi − µ)(xj − µ)]

= nσ2,

since E[(xi − µ)(xi − µ)] = σ2 for i = 1, 2, . . . , n and E[(xi − µ)(xj − µ)] = 0 fori 6= j, by independence.

2.2.3 Moments Of The Sample Mean

The mean of the sample mean is

E[xn] = E[S/n] = E[S]/n = nµ/n = µ. (2.2)

The variance of the sample mean is

Var(xn) = E[(x− µ)2]

= E[(S/n− µ)2]

= E[((S − nµ)/n]2]

= E[(S − nµ)2]/n2

= nσ2/n2

= σ2/n. (2.3)

Page 17: What Is Econometrics - Rice University

2.3. THE NORMAL DISTRIBUTION 17

2.2.4 Sampling Distribution

We have been able to establish the mean and variance of the sample mean.However, in order to know its complete distribution precisely, we must know theprobability density function (pdf) of the random variable x. This distributionis easier to obtain for some distributions than others.

2.3 The Normal Distribution

2.3.1 Density Function

Definition 2.6. A continuous random variable x with the density function

f (x ) =1√

2πσ2e−

12σ2

( x−µ )2 (2.4)

follows the normal distribution with mean µ and variance σ2 and is denotedx ∼ N(µ, σ2). �

The normal density function is the familiar “bell-shaped” curve, as is shownin Figure 2.1 for µ = 0 and σ2 = 1. It is symmetric about the mean µ, whichtherefore indicates the center or location of the distribution. The standarddeviation σ indicates the spread of the distribution with approximately 2

3 of theprobability mass lying within ±σ of µ and about .95 lying within ±2σ. Thedensity has inflection points at ±σ. We say the normal is characterized by itsmean and variance, since the complete behavior of the density depends only onthese parameters.

The normal distribution plays a central role in econometrics and statistics,where it is oftentimes referred to as the “Gaussian” distribution. There are atleast three reasons given to justify this prominence. First, there are numerousexamples of random variables that have this shape or something close to it.Second, it is analytically tractable as we will see below. Third, and perhaps mostimportant, sample averages are well approximated by the normal distributionin large samples.

2.3.2 Linear Transformation

Consider the linearly transformed random variable

y = a+ bx

where E[x] = µx and Var(x) = σ2. It is easy to see that

E[y] = a+ bµx = µy

andVar(y) = E[(y − µy)2] = b2σ2

x = σ2y

Moreover, if x is normally distributed, then it turns out that y ∼ N(µy, σ2y).

Page 18: What Is Econometrics - Rice University

18 CHAPTER 2. SOME USEFUL DISTRIBUTIONS

Figure 2.1: The Standard Normal Distribution

Now, consider the linear combination of two independent random variables

y = a+ bx+ cz

where E[x] = µx and Var(x) = σ2x and E[z] = µz and Var(z) = σ2

z then

E[y] = a+ bµx + cµz = µy

andVar(y) = b2σ2

x + c2σ2z = σ2

y

Similarly to above, if x ∼ N(µx, σ2x) and z ∼ N(µz, σ

2z) and independent, then

it can be shown that y ∼ N(µy, σ2y). Note that this result means that any linear

combination of any number of independent normals is itself normal.These results concerning the preservation of normality for linear transfor-

mations and linear combinations of normals will be formally demonstrated in amore general setting in the next chapter.

2.3.3 Distribution Of The Sample Mean

If, for each i = 1, 2, . . . , n, the xi’s are independent, identically distributednormal random variables with mean µ and variance σ2, write xi ∼ i.i.d.N(µ, σ2),then

xn =1

nx1 +

1

nx2 + . . .+

1

nxn ∼ N(µ, σ2/n).

Page 19: What Is Econometrics - Rice University

2.3. THE NORMAL DISTRIBUTION 19

2.3.4 The Standard Normal

The distribution of xn will vary with different values of µ and σ2, which isinconvenient. Rather than dealing with a unique distribution for each case, weperform the following transformation:

z =xn − µ√σ2/n

=

√n

σ(xn − µ).

Now,

E[z] = E

[√n

σ(xn − µ)

]=

√n

σE[xn − µ]

= 0.

Also,

E[z2] = E

[(√n

σ(xn − µ)

)2]

=n

σ2E[(xn − µ)

2]

=n

σ2

σ2

n= 1.

Thus z ∼ N(0, 1). If we know the variance σ2, then the z-transformation canbe used to test the mean. This transformation is commonly called the “z-transformation”.

Definition 2.7. The N( 0, 1 ) distribution is the standard normal and is well-tabulated. The probability density function for the standard normal distributionevaluated at point z is

f(z) =1√2πe−

12 z

2

= ϕ(z). (2.5)

Example 2.7. Suppose that xi ∼ i.i.d.N(µ, σ2) with σ2 known and we seek totest the null hypothesis H0 : µ = µ0 against the alternative H1 : µ = µ1 6= µ0,then

z =xn − µ0√σ2/n

∼ N(0, 1)

under the null hypothesis but will have nonzero mean (µ1 − µ0)/√σ2/n with

unit variance under the alternative. Choosing a size α = .05 and correspondingcritical value 1.96 for a two sided-test, we have .05 chance of obtaining a realiza-tion of z that exceeds the critical value under the null but an increased chanceunder the alternative due to the shifted distribution. The behavior under thealternatives will be treated carefully in the next chapter. �

Page 20: What Is Econometrics - Rice University

20 CHAPTER 2. SOME USEFUL DISTRIBUTIONS

2.4 The Central Limit Theorem

2.4.1 Normal Theory

As stated above, one of the reasons for the prominence of the normal distributionis because most any sample mean appears normal as the sample size increases.Subject to a few weak conditions, this result applies for the average of evenhighly non-normal distributions. Specifically, we have

Theorem 2.1. Suppose (i) {x1, x2, . . . , xn} is a simple random sample, (ii)E[xi] = µ, and (iii) E[x2

i ] = σ2, then as n −→∞, the distribution of z = xn−µ√σ2/n

converges to the standard normal. �

What this means is that for fn(·) the probability density function (PDF) ofxn, then

limn→∞

fn ( c ) = ϕ ( c ) (2.6)

point-wise for any point of evaluation c. Likewise, for Fn(·) denoting the analo-gous cumulative distribution function (CDF) of xn and Φ ( c ) denoting the CDFof the standard normal, then

limn→∞

Fn ( c ) = Φ ( c )

point-wise at any point of evaluation c. This is the Lindberg-Levy form ofthe central limit theorem and will be treated more extensively in Chapter 4.Notationally, we write

z =xn − µ√σ2/n

=

√n(xn − µ)

σ→d N(0, 1)

or equivalently√n(xn − µ)→d N(0, σ2).

Example 2.8. The panels in Figure 2.2 demonstrate the power of the resultfor an x drawn from a triangular distribution, normalized to have mean zeroand unit variance. We draw 10,000,000 replications of samples of size 25. Thefirst panel displays a histogram of the first observation of each replication. Thisis the triangular distribution we are working with. It has limited support andis highly asymmetric. The second panel is a histogram of the z-transformationof the means for the 10,000,000 replications. At a sample size of 25 it is clearthat the standard normal distribution is already a very good approximation tothe distribution. �

Page 21: What Is Econometrics - Rice University

2.5. DISTRIBUTIONS ASSOCIATED WITH THE NORMALDISTRIBUTION 21

Figure 2.2: Central Limit Theorem and Triangular Distribution

2.5 Distributions Associated With The NormalDistribution

2.5.1 The Chi-Squared Distribution

Definition 2.8. Suppose that {z1, z2, . . . , zm} is a simple random sample, andzi ∼ i.i.d.N(0, 1). Then

w =

m∑i=1

z2i ∼ X 2

m,

or w has the Chi-squared distribution with m degrees of freedom. �

The probability density function for the X 2m is

fχ2(x,m) =1

2n/2Γ(m/2)xm/2−1e−x/2, x > 0 (2.7)

where Γ(x) is the gamma function. See Figure 2.3 for plots of the chi-squareddistribution for various degrees of freedom. The mean of w is m and the varianceis 2m and the mode is max(m−2, 0) . Note how the distribution begins to attainthe familiar bell-shape but is stretched out to the right as the degrees of freedomgrows larger. This a manifestation of the central limit theorem since w/m wouldbe an average. In fact, (w−m)/

√2m is approximately standard normal for large

m.The chi-squared distribution will prove useful in testing hypotheses on both

the variance of a single variable and the (conditional) means of several. Typ-ically, we are only interested in the content of the right-hand side tail in con-ducting tests, for reasons that will become apparent with the multivariate usage,which will be explored in the next chapter. But the left-hand side tail couldbe interesting if we are testing whether the true variance is a specific value, say

Page 22: What Is Econometrics - Rice University

22 CHAPTER 2. SOME USEFUL DISTRIBUTIONS

Figure 2.3: Some Chi-Squared Distributions

σ20 and are interested in the possibility that the true value is either larger or

smaller than this value.

Example 2.9. Consider the unbiased estimate of σ2

s2 =

∑ni=1(xi − x )2

n− 1,

then for xi ∼ i.i.d.N(µ, σ2), we can show that

(n− 1 )s2

σ2∼ X 2

n−1. (2.8)

independent of x. �

2.5.2 The t Distribution

Definition 2.9. Suppose that z ∼ N(0, 1), y ∼ χ2m, and that z and y are

independent. Thenz√y/m

∼ tm, (2.9)

where m are the degrees of freedom of the t distribution. �

Page 23: What Is Econometrics - Rice University

2.5. DISTRIBUTIONS ASSOCIATED WITH THE NORMALDISTRIBUTION 23

Figure 2.4: Some t Distributions

The probability density function for a t random variable with m degrees offreedom is

ft(x,m) =Γ(m+1

2

)√mπΓ

(m2

) (1 + x2

m

)(m+1)/2, (2.10)

for −∞ < x <∞. See Figure 2.4 for plots of the distribution for various choicesof m.

The t (also known as Student’s t) distribution, is named after William Gos-sett, who published under the pseudonym “Student.” It is useful in testinghypotheses concerning the (conditional) mean when the variance is estimated.Note that the distribution is symmetric around zero and becomes less fat-tailedas the degrees of freedom increase. In fact, as m → ∞, the distribution ap-proaches the standard normal. This just reflects the fact that the denominatory/m converges in probability to one.

Example 2.10. Consider the sample mean from a simple random sample ofnormals. We know that x ∼N(µ,σ2/n) and

z =x− µ√σ2/n

∼ N(0, 1).

Also, we know that

y = (n− 1)s2

σ2∼ X 2

n−1,

Page 24: What Is Econometrics - Rice University

24 CHAPTER 2. SOME USEFUL DISTRIBUTIONS

where s2 is the unbiased estimator of σ2. Thus, if z and y are independent(which, in fact, is the case), then

z√y/(n− 1)

=

x−µ√σ2/n√

(n− 1) s2

σ2 /(n− 1)

=x− µ√

(σ2/n)(s2/σ2)(2.11)

=x− µ√s2/n

∼ tn−1 (2.12)

2.5.3 The F Distribution

Definition 2.10. Suppose that y∼ χ2m, w ∼ χ2

n, and that y and w are inde-pendent. Then

y/m

w/n∼ Fm,n, (2.13)

where m,n are the degrees of freedom of the F distribution. �

The probability density function for a F random variable with m and ndegrees of freedom is

fF(x,m, n) =Γ(m+n

2

)(m/n)m/2

Γ(m2

)Γ(n2

) x(m/2)−1

(1 +mx/n)(m+n)/2(2.14)

The F distribution is named after the great statistician Sir Ronald A. Fisher,and is used in many applications, most notably in the analysis of variance. Thissituation will arise when we seek to test multiple (conditional) mean parameterswith estimated variance. Note that when x ∼ tn then x2 ∼ F1,n. Also note thatas the denominator degrees of freedom grows large, which is frequently the case,the denominator converges in probability to one and the ratio is approximatedby the behavior of the numerator: a chi-squared divided by its’ degrees offreedom. Some plots of the F distribution for various choices of numerator anddenominator degrees of freedom can be seen in Figure 2.5.

Page 25: What Is Econometrics - Rice University

2.5. DISTRIBUTIONS ASSOCIATED WITH THE NORMALDISTRIBUTION 25

Figure 2.5: Some F Distributions

Page 26: What Is Econometrics - Rice University

Chapter 3

Multivariate Distributions

In economics we are typically interested in relationships that involve a number ofvariables. For example aggregate consumption, disposable income, and wealthin a macroeconomic context or a list of stock prices that make up a portfolio in amicroeconomic situation. Consequently, the statistical analysis will of necessitybe multivariate in nature. In this chapter we will develop a number of conceptsthat prove useful in multivariate statistics. As for the univariate case, thenormal distribution plays a central role in multivariate statistics and will befully developed. In this treatment, the list of random variables of interest willbe arrayed into a vector for notational convenience.

3.1 Matrix Algebra Of Expectations

3.1.1 Expectations

Consider an m× 1 vector of random variables

x =

x1

x2

...xm

.

Each element of the vector is a scalar random variable of the type discussedin the previous chapter. Since we have multiple random variables, probabilitystatements regarding more than one of the random variables must be based onthe joint density

f(x;θ) = f(x1, x2, ..., xm; θ1, θ2, ..., θp)

where θ′ = (θ1, θ2, ..., θp) is an p × 1 vector of shape parameters of the den-sity. Oftentimes the probability statement of interest can be represented as the

26

Page 27: What Is Econometrics - Rice University

3.1. MATRIX ALGEBRA OF EXPECTATIONS 27

expectation of a function of the variables, say g(x), which is given by

E[g(x)] =

∫g(x)f(x;θ)dx

=

∫[

∫[

∫g(x1, x2, ..., xm;θ)dx1]dx2]...dxm.

3.1.2 Moments of Random Vectors

The leading example of such expectations are the moments of the random vectorx. For many densities, such as the multivariate normal studied below, some orall of the parameters of the density may be interpreted as moments of the data.

The first moment or expectation of the random vector is

E[x] =

E[x1]E[x2]

...E[xm]

=

µ1

µ2

...µm

= µ. (3.1)

Note that µ is also an m×1 column vector. We see that the mean of the vectoris the vector of the means.

Next, we evaluate the centered second moments:

E[(x− µ)(x− µ)′]

= E

(x1 − µ1)2 (x1 − µ1)(x2 − µ2) · · · (x1 − µ1)(xm − µm)

(x2 − µ2)(x1 − µ1) (x2 − µ2)2 · · · (x2 − µ2)(xm − µm)...

.... . .

...(xm − µm)(x1 − µ1) (xm − µm)(x2 − µ2) · · · (xm − µm)2

=

σ11 σ12 · · · σ1m

σ21 σ22 · · · σ2m

......

. . ....

σm1 σm2 · · · σmm

= Σ. (3.2)

The matrix Σ, is an m × m matrix of variance and covariance terms andcalled the covariance matrix. The variance σ2

i = σii of xi is along the diagonal,while the off-diagonal cross-product terms σij = E[(xi − µi)(xj − µj)] representthe covariance between xi and xj .

Page 28: What Is Econometrics - Rice University

28 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS

3.1.3 Properties Of The Covariance Matrix

3.1.3.1 Symmetric

The variance-covariance matrix Σ is a symmetric matrix. This can be shownby noting that

σij = E(xi − µi)(xj − µj) = E(xj − µj)(xi − µi) = σji.

Due to this symmetry Σ will only have m(m+ 1)/2 unique elements.

3.1.3.2 Positive Semidefinite

Σ is a positive semidefinite matrix. Recall that any symmetric m×m matrix ispositive semidefinite if and only if it meets any of the following three equivalentconditions

1. λ′Σλ ≥ 0, for all λ︷︸︸︷m×1

6= 0;

2. Σ = PP′, for some P︷︸︸︷m×m

.

3. All the principle minors are nonnegative;

The first condition is the usual definition of positive semidefiniteness and thelatter two are equivalent. The last condition (actually we use negative definite-ness) is useful in the study of utility maximization while the first two are usefulin econometric analysis.

The necessity of the first condition for a covariance matrix is the easiest todemonstrate. Let λ 6= 0. Then, we have

λ′Σλ = λ′E[(x− µ)(x− µ)′]λ

= E[λ′(x− µ)(x− µ)′λ]

= E[{λ′(x− µ)}2] ≥ 0,

since the term inside the expectation is a quadratic. Hence, Σ is a positivesemidefinite matrix. The expression λ′Σλ =

∑mi=1

∑mj=1 σijλiλj is called a

quadratic form in λ with Σ as the weight matrix, since it involves the squaresand cross-products of λ′ = (λ1, λ2, . . . , λn).

Note that P satisfying the second relationship is not unique. Let D be anym × m orthonormal martix, then DD′ = Im and P∗ = PD yields P∗P∗′ =PDD′P′ = PImP′ = Σ. A consequence of this decomposition is det(Σ) =det(PP ′) = det(P ) ·det(P ) = det(P )2 ≥ 0. Usually, we will choose P to be anupper or lower triangular matrix with m(m+ 1)/2 nonzero elements, which willthen make such P unique if Σ nonsingular. If the elements of Σ are all real andwe choose lower triangularity then Σ = PP′ is the Cholesky decomposition ofthe matrix.

Page 29: What Is Econometrics - Rice University

3.2. CHANGE OF VARIABLES 29

Necessity of the third condition follows directly. The principal minors are thedeterminants of all the principal submatrices obtained by symmetrically deletingtrailing rows and columns. Since such submatrices are also covariance matricesthey must also be positive semidefinite and have a non-negative determinant.

3.1.3.3 Positive Definite

The matrix Σ is positive definite if and only if λ′Σλ > 0, for all λ 6= 0. But thismeans λ′Σλ = λ′P ′Pλ > 0 and hence Pλ 6= 0 for all λ 6= 0. Thus det(P) 6= 0and det(Σ) = det(PP ′) = det(P )·det(P ) = det(P )2 > 0. And all the principleminors will be positive.

3.1.4 Linear Transformations

Let y︷︸︸︷m×1

= b︷︸︸︷m×1

+ B︷︸︸︷m×m

m×1︸︷︷︸x . Then

E[y] = b + BE[x]

= b + Bµ

= µy (3.3)

Thus, the mean of a linear transformation is the linear transformation of themean.

Next, we have

E[(y − µy)(y − µy)′] = E{[B(x− µ)][(B(x− µ))′]}= BE[(x− µ)(x− µ)′]B′

= BΣB′

= BΣB′ (3.4)

= Σy (3.5)

where we use the result (ABC)′ = C ′B′A′, if conformability holds.

3.2 Change Of Variables

3.2.1 Univariate

Let x be a scalar random variable and fx(·) be the probability density functionof x. Now, define y = h(x), where

h′(x) =d h(x)

d x> 0.

That is, h(x) is a strictly monotonically increasing function and so y is a one-to-one transformation of x. Now, we would like to know the probability density

Page 30: What Is Econometrics - Rice University

30 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS

function of y, fy(y). To find it, we note that

Pr(y ≤ h(a)) = Pr(x ≤ a), (3.6)

Pr(x ≤ a) =

∫ a

−∞fx(x) dx = Fx(a), (3.7)

and,

Pr(y ≤ h(a)) =

∫ h(a)

−∞fy(y) dy = Fy(h(a)), (3.8)

for all a.

Assuming that the cumulative density function is differentiable, we use (3.6)to combine (3.7) and (3.8), and take the total differential, which gives us

dFx(a) = dFy(h(a))

fx(a)da = fy(h(a))h′(a)da

for all a. Thus, for small perturbations,

fx(a) = fy(h(a))h′(a) (3.9)

for all a. Also, since y is a one-to-one transformation of x, we know that h(·)can be inverted for any y. Choosing y = h(a) and hence a = h−1(y), we canrewrite (3.9) as

fx(h−1(y)) = fy(y)h′(h−1(y)).

Therefore, the probability density function of y is

fy(y) = fx(h−1(y))1

h′(h−1(y)). (3.10)

Note that fy(y) has the properties of being nonnegative, since h′(·) > 0. Ifh′(·) < 0, (3.10) can be corrected by taking the absolute value of h′(·), whichwill assure that we have only positive values for our probability density function.

Example 3.1. Let y = b0 + b1x, where x, b0, and b1 are scalars. Then

x =y − b0b1

anddy

dx= b1.

Therefore,

fy(y) = fx

(y − b0b1

)1

|b1|. �

Page 31: What Is Econometrics - Rice University

3.2. CHANGE OF VARIABLES 31

y

h(x)h(b)h(a)

a b x

Figure 3.1: Change of Variables

3.2.2 Geometric Interpretation

Consider the graph of the relationship shown in Figure 3.1. We know that

Pr[h(b) > y > h(a)] = Pr(b > x > a).

Also, we know that

Pr[h(b) > y > h(a)] ' fy[h(b)][h(b)− h(a)],

andPr(b > x > a) ' fx(b)(b− a).

So,

fy[h(b)][h(b)− h(a)] ' fx(b)(b− a)

fy[h(b)] ' fx(b)1

[h(b)− h(a)]/(b− a)](3.11)

Now, as we let a→ b, the denominator of (3.11) approaches h′(b). Substitutingy = h(x) for h(b) and h−1(y) = x for b yields the same formula as (3.10).

3.2.3 Multivariate

Letx︷︸︸︷

m×1

∼ fx(x).

Page 32: What Is Econometrics - Rice University

32 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS

Define a one-to-one transformation

y︷︸︸︷m×1

=

m×1︸︷︷︸h (x).

Since h(·) is a one-to-one transformation, it has an inverse:

x = h−1(y).

We also assume that ∂h(x)∂x′ exists. This is the m×m Jacobian matrix, where

∂h(x)

∂x′=

h1(x)h2(x)

...hm(x)

∂(x1x2 · · ·xm)

=

∂h1(x)∂x1

∂h1(x)∂x2

· · · ∂h1(x)∂xm

∂h2(x)∂x1

∂h2(x)∂x2

· · · ∂h2(x)∂xm

......

. . ....

∂hm(x)∂x1

∂hm(x)∂x2

· · · ∂hm(x)∂xm

= Jh(x) (3.12)

Given this notation, the multivariate analog to (3.11) can be shown to be

fy(y) = fx[h−1(y)]1

|det(Jh[h−1(y)])|(3.13)

Since h(·) is differentiable and one-to-one, we assume det(Jx[h−1(y)]) 6= 0.

Example 3.2. Let y = b + Bx, where y is an m × 1 vector and det(B) 6= 0.Then

x = B−1(y − b)

and∂y

∂x′= B = Jh(x).

Thus,

fy(y) = fx(B−1(y − b)

) 1

|det(B)|. �

3.3 Multivariate Normal Distribution

3.3.1 Spherical Normal Distribution

Definition 3.1. An m × 1 random vector z is said to be spherically normallydistributed if

f(z) =1

(2π)n/2e−

12z′z. �

Page 33: What Is Econometrics - Rice University

3.3. MULTIVARIATE NORMAL DISTRIBUTION 33

Such a random vector can be seen to be a vector of independent standardnormals. Let z1, z2, . . . , zm, be i.i.d. random variables such that zi ∼ N(0, 1).That is, zi has pdf given in (2.8), for i = 1, ...,m. Then, by independence, thejoint distribution of the zi’s is given by

f(z1, z2, . . . , zm) = f(z1)f(z2) · · · f(zm)

=

n∏i=1

1√2πe−

12 z

2i

=1

(2π)m/2e−

12

∑ni=1 z

2i

=1

(2π)m/2e−

12z′z, (3.14)

where z′ = (z1 z2 ... zm).

3.3.2 Multivariate Normal

Definition 3.2. The m× 1 random vector x with density

fx(x) =1

(2π)m/2[det(Σ)]1/2e−

12 (x−µ)′Σ−1(x−µ) (3.15)

is said to be distributed multivariate normal with mean vector µ and positivedefinite covariance matrix Σ. �

Such a distribution for x is denoted by x ∼ N(µ,Σ). The spherical normaldistribution is seen to be a special case where µ = 0 and Σ = Im.

There is a one-to-one relationship between the multivariate normal randomvector and a spherical normal random vector. Let z be an m × 1 sphericalnormal random vector and

x︷︸︸︷m×1

= µ+ Az,

where z is defined above, and det(A) 6= 0. Then,

E[x] = µ+ AE[z] = µ, (3.16)

since E[z] = 0.Also, we know that

E[zz′] = E

z2

1 z1z2 · · · z1zmz2z1 z2

2 · · · z1zm...

.... . .

...zmz1 zmz2 · · · z2

m

=

1 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

= Im, (3.17)

since E[zizj ] = 0, for all i 6= j, and E[z2i ] = 1, for all i. Therefore,

E[(x− µ)(x− µ)′] = E(Azz′A′)

= AE(zz′)A′

= AImA′ = Σ, (3.18)

Page 34: What Is Econometrics - Rice University

34 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS

where Σ is a positive definite matrix (since det(A) 6= 0).

Next, we need to find the probability density function fx(x) of x. We knowthat

z = A−1(x− µ) ,

z′ = (x− µ)′A−1′ ,

and

Jh(z) = A,

so we use (3.13) to get

fx(x) = fz[A−1(x− µ)]

1

|det(A)|

=1

(2π)m/2e−

12 (x−µ)′A−1′A−1(x−µ) 1

|det(A)|

=1

(2π)m/2e−

12 (x−µ)′(AA′)−1(x−µ) 1

|det(A)|(3.19)

where we use the results (ABC)−1 = C−1B−1A−1 and A′−1 = A−1′. However,

Σ = AA′, so det(Σ) = det(A)·det(A), and |det(A)|= [ det(Σ)]1/2

. Thus wecan rewrite (3.19) as

fx(x) =1

(2π)m/2[det(Σ)]1/2e−

12 (x−µ)′Σ−1(x−µ) (3.20)

and we see that x ∼ N(µ,Σ) with mean vector µ and covariance matrix Σ.Since this development is completely reversible the relationship is one-to-one.

The multivariate normal distribution has a number of useful special prop-erties, which will be stated without proof: (i) The conditional distribution ofa subvector, say x1, given the remainder x2, is normally distributed, specif-ically x1 |x2 ∼ N(µ1 + Σ12Σ

−122 Σ′12(x2 − µ2),Σ11 − Σ12Σ

−122 Σ′12). (ii) The

unconditional distribution of any subvector, say x1, has a multivariate normaldistribution, so x1 ∼ N(µ1,Σ11). (iii) Consequently, the elements of any sub-vector, say x1, are jointly normal and independent of the remainder x2, if andonly if Σ12 = 0. This also means that independent normals are jointly normalwith zero covariance.

3.3.3 Linear Transformations

Theorem 3.1. Suppose x ∼ N(µ,Σ) with det(Σ) 6= 0 and y = b + Bx withB square and det(B) 6= 0, then y ∼ N(µy,Σy) for µy = E[y] = b + Bx andΣy = E[(y − µy)(y − µy)′] = BΣB′. �

Proof: To find the probability density function fy(y) of y, we again use (3.13),

Page 35: What Is Econometrics - Rice University

3.3. MULTIVARIATE NORMAL DISTRIBUTION 35

which gives us

fy(y) = fx[B−1(y − b)]1

|det(B)|

=1

(2π)m/2[det(Σ)]1/2e−

12 [B−1(y−b)−µ]′Σ−1[B−1(y−b)−µ] 1

|det(B)|

=1

(2π)m/2[det(BΣB′)]1/2e−

12 (y−b−Bµ)′(BΣB′)−1(y−b−Bµ)

=1

(2π)m/2[det(Σy)]1/2e−

12 (y−µy)′Σ−1

y (y−µy) (3.21)

So, y ∼ N(µy,Σy). �Thus we see, as asserted in the previous chapter, that linear transformations

of multivariate normal random variables are also multivariate normal randomvariables. And any linear combination of independent normals will also benormal, since by (iii) in the previous subsection, independent normals are alsojointly normal.

3.3.4 Quadratic Forms

Theorem 3.2. Let x ∼ N(µ,Σ), where det(Σ) 6= 0, then (x−µ)′Σ−1(x−µ) ∼X 2m. �

Proof : Let Σ = PP′, which is possible since Σ positive definite.. Then

(x− µ) ∼ N(0,Σ),

andz = P−1(x− µ) ∼ N(0, Im).

Therefore,

z′z =

n∑i=1

z2i ∼ X 2

m

= P−1(x− µ)′P−1(x− µ)

= (x− µ)′P−1′P−1(x− µ)

= (x− µ)′Σ−1(x− µ) ∼ X 2m. � (3.22)

Σ−1 is called the weight matrix. Below, we will use the sample mean andthe X 2

m to make inferences about the vector mean µ.

3.3.5 Singular Multivariate Normal

On occasion we will encounter an m × 1 vector of random variables x withmean vector µ that is jointly normal but with a singular covariance matrix Σ.Suppose Σ has rank m1, then, without loss of generality, we can reorder the

Page 36: What Is Econometrics - Rice University

36 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS

columns (and rows) of Σ and correpondingly the elements of x so the first m1

are linearly independent. Thus the last m2 columns (rows) of Σ are linearlydependent on the first m1 and we can write

Σ =

(Σ11 Σ12

Σ′12 Σ22

)=

(Σ11 Σ11B

BΣ11 BΣ11B′

)=

(Im1

B

)Σ11

(Im1 B′

)for some m2 × m1 matrix B and Σ11 is m1 × m1 nonsingular. Define, con-formably, x′ = (x′1,x

′2) and µ′ = (µ′1,µ

′2). This singularity implies that

(x2 − µ2) = B(x1 − µ1) since the covariance of v = (x2 − µ2) −B(x1 − µ1),which must also be multivariate normal, has covariance matrix 0 and hencemust be identically zero.

Consider a square matrixA, thenA+ is a generalized inverse ifAA+A = A.Such generalized inverses are not unique when A is singular so we typicallyappend additional restrictions to obtain a unique solution such as the Moore-Penrose pseudo-inverse. For the problem at hand, it is easily verified that

Σ+ =

(Im1

0

)Σ−1

11

(Im1 0

)=

(Σ−1

11 00 0

)is a generalized inverse of Σ. Now consider the quadratic form of the previoussection but with a generalized inverse used as the weight matrix

(x− µ)′Σ+(x− µ) = ( (x1 − µ1)′ (x2 − µ2)′ )

(Σ−1

11 00 0

)((x1 − µ1)(x2 − µ2)

)= (x1 − µ1)′Σ−1

11 (x1 − µ1) ∼ X 2rank(Σ). (3.23)

Thus weighting by the generalized inverse reduces the quadratic form to aquadratic form involving the linearly independent sub-vector x1. The valueof the quadratic form will be the same regardless of which generalized inversewe choose. Note that (3.22), the result of the previous section where Σ isnonsingular with rank m, is a special case of (3.23).

Writing the probability density function of a singular normal takes somecare. The joint density of x1, which is multivariate normal with nonsingularcovariance, is

fx1(x1) =1

(2π)m1/2[det(Σ11)]1/2e−

12 (x1−µ1)′Σ11

−1(x1−µ1).

For x satisfying B(x1−µ1)− (x2−µ2) = 0, then the joint probability densityfunction becomes

fx(x) =1

(2π)m1/2[det+(Σ)]1/2e−

12 (x−µ)′Σ+(x−µ)

where det+(Σ) = det(Σ11) is the pseudo-determinant. For x not satisfying thisrestriction the density is zero. The joint density does not have support every-where in the m-space of x, but only on an m1-dimensional manifold. Typically,we will dispense with the linear dependent variables x2 and focus on the linearindependent variables x1. Using the generalized inverse does this automaticallywith the quadratic forms.

Page 37: What Is Econometrics - Rice University

3.4. NORMALITY AND THE SAMPLE MEAN 37

3.4 Normality and the Sample Mean

3.4.1 Moments of the Sample Mean

Consider the m× 1 vector xi ∼ i.i.d. jointly, with m× 1 vector mean E[xi] = µand m×m covariance matrix E[(xi−µ)(xi−µ)′] = Σ. Define xn = 1

n

∑ni=1 xi

as the vector sample mean which is also the vector of scalar sample means. Themean of the vector sample mean follows directly:

E[xn] =1

n

n∑i=1

E[xi] = µ.

Alternatively, this result can be obtained by applying the scalar results elementby element. The second moment matrix of the vector sample mean is given by

E[(xn − µ)(xn − µ)′] =1

n2E

[(∑n

i=1xi − nµ

)(∑n

i=1xi − nµ

)′]=

1

n2E[{(x1 − µ) + (x2 − µ) + ...+ (xn − µ)}

·{(x1 − µ) + (x2 − µ) + ...+ (xn − µ)}′]

=1

n2nΣ =

1

since the covariances between different observations are zero.

3.4.2 Distribution of the Sample Mean

Suppose xi ∼ i.i.d.N(µ,Σ) then they are all jointly normal with zero covariancebetween observations. It follows from joint multivariate normality that xn mustalso be multivariate normal since it is a linear transformation. Specifically, wehave

xn ∼ N(µ,1

nΣ)

or equivalently

xn − µ ∼ N(0,1

nΣ)

√n(xn − µ) ∼ N(0,Σ)

√nΣ−1/2(xn − µ) ∼ N(0, Im)

where Σ = Σ1/2Σ1/2′ and Σ−1/2 = (Σ1/2)−1 for Σ nonsingular. This transfor-mation is the multivariate generalization of the “z-transformation” for scalars.

In order to make a probability statement about this vector, we must considera scalar transformation. Using the results from above, we find that the quadratic

Page 38: What Is Econometrics - Rice University

38 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS

form

n · (xn − µ)′Σ−1(xn − µ) = n · (xn − µ)′(Σ1/2Σ1/2′)−1(xn − µ)

= [n · (xn − µ)′Σ−1/2′Σ−1/2(xn − µ)

= [√nΣ−1/2(xn − µ)]′[

√nΣ−1/2(xn − µ)]

∼ χ2m.

Under a null hypothesis regarding all the means, say H0 : µ = µ0 , we haven · (xn −µ0)′Σ−1(xn −µ0) ∼ χ2

m. Under an alternative we will have a shift ornoncentrality which is studied below. This quadratic form approach is centralto testing hypothesis involving multiple parameters.

3.4.3 Multivariate Central Limit Theorem

Theorem 3.3. Suppose that (i) xi ∼ i.i.d jointly, (ii) E[xi] = µ, and (iii)E[(xi − µ)(xi − µ)

′] = Σ, then

√n(xn − µ)→d N(0,Σ) �.

Equivalently, we can write, analogous to the scalar z-transformation,

z =√nΣ−1/2(xn − µ)→d N(0, Im)

for Σ = Σ1/2Σ1/2′. These results apply even if the original underlying distri-bution is not normal and follow directly from the scalar results applied to anylinear combination of xn.

Correspondingly, under the assumptions of the theorem, the following quadraticform

n · (xn − µ)′Σ−1(xn − µ)→d χ2m.

This quadratic form is convenient for testing an hypothesis that specifies thevalue of all the means, i.e. H0 : µ = µ0. If

√n(xn − µ) →d N(0,Σ) and Σ

is singular, then we can use the generalized inverse as the weight matrix in thequadratic form and the limiting distribution will be χ2

rank(Σ).

3.5 Noncentral Distributions

When we conduct an hypothesis test we have a stated null hypotheis H0 and a,at least implicitly, stated alternative hypothesis H1. As a matter of convention,we carefully establish the behavior of our test statistic under the null, such asthe standard normal, for example. The fact is that it is quite easy to developa statistic with a known distribution and hence known size, which is the prob-ability of rejecting under the null. What is interesting is what happens to thestatistic when the alternative applies. Hopefully, the behavior of the test underthe alternative is different enough to to suggest the null is incorrect. This means

Page 39: What Is Econometrics - Rice University

3.5. NONCENTRAL DISTRIBUTIONS 39

we need to establish the power of a test, which is the probability of rejectingunder the alternative. We choose between equal sized tests on the basis of whichhas the most power. And we prefer consistent tests whose power improves withsample size and reject false nulls with probability one in the limit.

3.5.1 Noncentral Scalar Normal

Definition 3.3. Suppose x ∼ N(µ, σ2), then, for µ∗ = µ/σ,

z∗ =x

σ∼ N(µ∗, 1) (3.24)

has a noncentral normal distribution with noncentrality parameter µ∗. �

Example 3.3. Suppose xi v i.i.d. N(µ, σ2) for σ2 known and i = 1, 2, ..., n.When we do a hypothesis test of the mean, with known variance, we have, underthe null hypothesis H0 : µ = µ0,

Tn =xn − µ0√σ2/n

∼ N(0, 1) (3.25)

and, under the alternative H1 : µ = µ1 > µ0,

Tn =xn − µ0√σ2/n

=xn − µ1√σ2/n

+µ1 − µ0√σ2/n

= N(0, 1) +µ1 − µ0√σ2/n

∼ N

(µ1 − µ0√σ2/n

, 1

). (3.26)

Thus, the behavior of Tn under the the null hypothesis follows the standardnormal distribution but under the alternative hypothesis follows a noncentralnormal distribution. Choosing a size α = .05 and corresponding critical valuec.05 = 1.645 for a one-sided test, we have .05 chance of obtaining a realizationof Z that exceeds the critical value under the null but an increased chanceunder the alternative since the distribution is shifted right by the noncentralityµ∗ = (µ1 − µ0)/

√σ2/n. This situation is depicted graphically in Figure 3.2.

Note how the noncentrality is increasing at the rate√n so the test will be

consistent and reject a false null with certainty in large samples. �As this example makes clear, the noncentral normal distribution is especially

useful when carefully exploring the behavior of the alternative hypothesis. Italso demonstrates the choices we face in a standard one-sided hyposthesis test.If we obtain a large realization of the statistic that exceeds the chosen criticalvalue, it could result from a rare event under the null hyposthesis or a typicalevent under the alternative. How rare and how typical depends on the distri-bution, the choice of size α, and the noncentrality under the alternative. Weusually choose α small enough that we prefer to reject the null hypothesis be-cause the large value is so unlikely. But whether or not it is typical under thealternative depends on the noncentrality and the distribution that applies underthe alternative.

Page 40: What Is Econometrics - Rice University

40 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS

-5 -4 -3 -2 -1 0 * 3 4 5

z

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

f(z)

Pr[ reject | H1 ]

Pr[ reject | H0 ]

c.05

f(z | H0)

f(z | H1)

Figure 3.2: Shift with known variance

3.5.2 Noncentral t

Definition 3.4. Suppose z∗∼N(µ∗, 1), w ∼ χ2k, and let z∗ and w be indepen-

dent, then

T =z∗√w/k

∼ tk(µ∗) (3.27)

has a noncentral t distribution with noncentrality parameter µ∗. �

The noncentral t distribution is used in calculating the power of tests of themean, when the variance is unknown and must be estimated.

The density function of the noncentral t is quite complicated and will not bepresented here but its’ salient properties will be discussed. Unlike the noncentralnormal, the noncentral t is more than a central t, which has zero mean, shiftedby µ∗. Let z = z∗ − µ∗, then, as before, we can write

z∗√w/k

=z√w/k

+µ∗√w/k

.

The distribution after the shift is now a rather complicated function of both µ∗

and k. The first term is central t but the second or shift term is now stochasticso the distribution involves more than a shift by the constant µ∗.

Specifically, the distribution of the second term is bell-shaped but skewed tothe right so the sum is also asymmetric with right-hand tail heavier than the leftunless µ∗ = 0. Consequently, the median exceeds µ∗ with the positive differenceconverging toward zero as k becomes large. In this sense, the noncentral tdistribution is more right-shifted than the noncentral normal, which has medianµ∗. For small values of k, however, both the first and second terms will havefatter distributions as will their sum so the net shift in probability to the right ofa critical value will not increase. As k grows large, the distribution becomes moresymmetric about µ∗ and the distribution converges to the noncentral normalsince w/k →p 1.

Page 41: What Is Econometrics - Rice University

3.5. NONCENTRAL DISTRIBUTIONS 41

Example 3.4. Consider the Example 3.3 above but we don’t know σ2 and usethe unbiased estimator s2=

∑ni=1(xi − xn)2/(n− 1). Now from the last chapter

we have

Y = (n− 1)s2

σ2∼ X 2

n−1

independent of xn while from above

Z =xn − µ0√σ2/n

∼ N

(µ1 − µ0√σ2/n

, 1

),

thus under the alternative hypothesis

Tn =Z√

Y/(n− 1)

=(xn − µ0)/

√σ2/n√

(n− 1) s2

σ2 /(n− 1)

=xn − µ0√s2/n

∼ tn−1

(µ1 − µ0√σ2/n

).

Note that the alternative distribution uses σ2 rather than s2 in the noncentralityparameter, which increases at the rate

√n. Geometrically, as depicted in Figure

3.3, the situation is much as before in the case of known variance σ2 except weare now dealing with a central t-distribution under the null and a noncentralt under the alternative. Consequently, both distributions are fatter and thealternative distribution is skewed slightly to the right. The critical value for thet-distribution with 10 degrees of freedom is 1.812 as opposed to 1.645 for thenormal. �

-5 -4 -3 -2 -1 0 * 3 4 5

z

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

f(z)

Pr[ reject | H1 ]

Pr[ reject | H0 ]

c.05

f(z | H1)

f(z | H0)

Figure 3.3: Shift with estimated variance, d.f.=10

It is instructive to examine the power of a two-sided 5% test implied forthis example for various values of the noncentrality as given in Figure 3.4. Letd = (µ1−µ0)/σ then the noncentrality is

√nd and we can plot the probablility

Page 42: What Is Econometrics - Rice University

42 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS

of exceeding the critical value τ.025,n−1 in absolute value for various choices of dand n. Notice for each choice of n the curves have a minimum around d = 0 atthe ostensible size under the null of .05. And they asymptote to 1 as d divergesfrom zero in the positive or negative direction. This is the typical shape ofthe power curve for a two-sided test. Also note that the curves become morecompact around zero as the sample size n increases. In the limit it will approacha spike at zero which reflects the fact that the test is consistent.

-3 -2 -1 0 1 2 3

d = (1 -

0)/

0

0.2

0.4

0.6

0.8

1

1.2

Pr(

abs(T

n)>

d)

Power: Sample Mean, 5% Size

n=5

n=10

n=25

n=100

Figure 3.4: Power of t-test

3.5.3 Noncentral Chi-Squared

Definition 3.5. Suppose z∗∼N(µ∗, Im), then

V = z∗’z∗ ∼ X 2m(δ), (3.28)

has a noncentral chi-squared distribution, where δ = µ∗′µ∗ is the noncentralityparameter. �

As with the noncentral t, the noncentral chi-squared is more than a centralchi-squared shifted by a constant. Let z = z∗ − µ∗, then

z∗’z∗ = (z + µ∗)′(z + µ∗) = z′z + 2z′µ∗ + µ∗′µ∗.

The first term is a central chi-squared and the last is the noncentality parameterδ which shifts the distribution to the right, but the middle term is N(0, 4δ) andcan shift the distribution to the right or left with the sum restricted to be

Page 43: What Is Econometrics - Rice University

3.5. NONCENTRAL DISTRIBUTIONS 43

non-negative. The density is given by

fV (v; k, δ) =

∞∑i=0

e−δ/2 (δ/2)i

i!fYk+2i

(v),

where fYq (·) is the density of a central chi-square with q degress of freedom.The weights in the summation are the weights of a Poisson distribution withmean and variance δ/2. For a given k, increases in δ will shift more and moreof the distribution to the right as the larger weights of the Poisson, which hasmode at floor[δ/2], shift to the right. The mean of v is k + δ and the varianceis 2(k + 2δ).

Example 3.5. Consider a test of the means for xi ∼ i.i.d.N(µ,Σ) for Σ non-singular. For H0 : µ = µ0, we have

n · (xn − µ0)′Σ−1(xn − µ0) ∼ X 2m (3.29)

Let z∗=P−1√n(xn−µ0) for Σ = PP′, then for H1 : µ = µ1 6= µ0, we have

z∗ ∼ N(√nP−1(µ1−µ0), Im)

and

n · (xn − µ0)′Σ−1(xn − µ0) =√n(xn − µ0)′P−1′P−1

√n(xn − µ0)

= z∗′z∗

∼ X 2m[n · (µ1 − µ0)′Σ−1(µ1 − µ0)]. (3.30)

Note that the noncentrality is always non-negative so the probability mass ofthe distribution under the alternative is shifted to the right. Consequently, wefocus only on the right-hand side tail of the null distribution. Moreover, thenoncentrality will increase at the rate n, so the test will be consistent. �

3.5.4 Noncentral F

Definition 3.6. Let Y ∼ χ2m(δ), W ∼χ2

n, and let Y and W be independentrandom variables. Then

Y/m

W/n∼ Fm,n(δ), (3.31)

has a noncentral F distribution, where δ > 0 is the noncentrality parameter. �

The density is given by

fFm.n.δ(x) =

∞∑i=0

e−δ/2 (δ/2)i

i!

[Γ(

2i+m+n2

)(m/n)(2i+m)/2

Γ(

2i+m2

)Γ(n2

) x(2i+m/2)−1

(1 +mx/n)(2i+m+n)/2

],

which simplifies to the central Fm,n for δ = 0. Notice that the summation hasthe same Poisson weights as appeared in the noncentral chi-squared density. For

Page 44: What Is Econometrics - Rice University

44 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS

i = 0, the term in brackets in the summation is the density of the central Fm,n,so this density reduces to the central when δ = 0. For i > 0 the additional termsshift the density to the right when δ > 0.

The noncentral F distribution is used in tests of mean vectors, where thescale of the covariance matrix is unknown and must be estimated. As withthe other noncentral distributions the noncentrality parameter determines theamount the distribution is shifted (to the right in this case) when nonzero.Such a shift would arise under an alternative hypothesis as with the chi-squaredexample above. Examples for both the central and noncentral F distributionare deferred until Chapter 8.

Page 45: What Is Econometrics - Rice University

Chapter 4

Asymptotic Theory

In Chapter 2 we found, via the central limit theorem that the behavior of anaverage, under very general assumptions and when properly normalized, tendstoward the standard normal distribution even when the underlying distributionis non-normal. In Chapter 3, we extended this result to multivariate averages.These are both examples of depending on the large-sample limiting or asymp-totic behavior of the average to simplify our analysis. As we develop appropriatetools for more complicated models in later chapters we will encounter numerouscases of estimators that, as a result of non-normality of the underlying vari-ables or nonlinearity with respect these variables, have distributions sufficientlycomplicated to render finite sample inference infeasible. In most of these cases,however, the large-sample limiting or asymptotic behavior is tractable. Accord-ingly, in this chapter, we develop a number of asymptotically appropriate toolsthat will be useful for these cases.

4.1 Convergence Of Random Variables

4.1.1 Sequences Of Random Variables

In the following we will consider the behavior of an estimator or statistic, basedon a sample of size n, as the sample size grows large. Let yn denote a realizationof this statistic. This realization is a random draw from the underlying distribu-tion of yn which may well change as the sample size n increases. For the variousvalues of n, these draws form a sequence of random variables {y1, y2, ..., yn, ...}.An example of such a sequence is the sample average yn = xn = 1

n

∑ni=1 xi.

Each realization from this sequence is a random draw and cannot be known apriori, however, probability measures based on the distribution of each realiza-tion are real numbers and hopefully exhibit regularity and even simplicity asthe sample size increases. Since such probability measures map each draw inthe sequence onto the real line, they form a sequence of real numbers, whosebehavior we now examine.

45

Page 46: What Is Econometrics - Rice University

46 CHAPTER 4. ASYMPTOTIC THEORY

4.1.2 Convergence and Limits

We first examine the familiar concept of the limit from calculus.

Definition 4.1. A sequence of real numbers a1, a2, . . . , an, . . ., is said to beconvergent to or have a limit of α if for every δ > 0, there exists a positive realnumber N such that for all n > N , | an − α | < δ. This is denoted as

limn→∞

an = α . �

Example 4.1. Let { an } = 3 + 1/n, then clearly

limn→∞

an = 3,

since |an − 3| = |1/n| can be made arbitrarily small for sufficiently large n. �

We frequently encounter sequences of functions of variables whose value atconvergence depends on the value of the variables. For example, the sequence offunctions { an(x) } = x(3 + 1/n) converges to α(x) = 3x for any given x. Here,the value of n needed to make | an(x) − α(x) | = |x/n| < δ will depend on x.This dependence of the limiting process on the point of evaluation necessitatesthe introduction of the following uniformity definition

Definition 4.2. A sequence of functions {fn}, n = 1, 2, 3, ..., is said to beuniformly convergent to the function f for a set E of values of x if, for eachδ > 0, an integer N can be found such that |fn(x)− f(x)| < δ for n > N andx ε E.

Notationally, this means that we can define an = supx |fn(x)− f(x)| and α = 0and apply the standard definition of convergence to choose δ and the corre-sponding n. For the simple example { an(x) } = x(3 + 1/n) we define αn =supx ε E |x/n| = supx ε E |x| /n which will be finite for bounded x so E cansimply be a closed interval on the real line.

4.1.3 Orders of Sequences

We now introduce the notion of orders of a sequence, which will play an impor-tant role in the sequel. A sequence of interest may not have a limit and, in fact,may be divergent but a simple transformation may yield a limit and therebyyield insight into the rate at which the sequence diverges.

Definition 4.3. A sequence of real numbers { an } is said to be of order at mostnk, and we write { an } is O(nk), if

limn→∞

1

nkan = c,

where c is any real constant. �

Page 47: What Is Econometrics - Rice University

4.1. CONVERGENCE OF RANDOM VARIABLES 47

Example 4.2. Let { an } = 3 + 1/n, and { bn } = 4 − n2. Then, { an } isO(1) = O(n0), since, from the previous example,

limn→∞

an = 3

and { bn } is O(n2), since limn→∞1n2 bn=-1. �

Definition 4.4. A sequence of real numbers { an } is said to be of order smallerthan nk, and we write { an } is o(nk), if

limn→∞

1

nkan = 0. �

Example 4.3. Let { an } = 6/n. Then, { an } is o(1), since limn→∞1n0 an=0.

More informatively, we see that { an } is also O(n−1). Note that negative valuesof the exponent are possible, which indicates the rate at which the sequenceconverges to its’ limit point. �

The order of a sequence, like its’ limit, is a simplification relative to thesequence itself but more informative than the limit since it reveals the rate atwhich the divergence or convergence occurs. There is an algebra for orders ofsequences that will be useful later. Suppose an = O(nk) and bn = O(n`), thenit is straightforward to show that cn = an · bn = O(nk) · O(n`) = O(nk+`) andcn = an + bn = O(nk) + O(n`) = O(nmax(k,`)). Similarly, for an = o(nk) andbn = o(n`), then cn = an · bn = o(nk) · o(n`) = o(nk+`) and cn = an + bn =o(nk) + o(n`) = o(nmax(k,`)). Finally, if an = O(nk) and bn = o(n`), thencn = an · bn = O(nk) · o(n`) = o(nk+`) however cn = an + bn = O(nk) + o(n`) =1(l > k) · o(nl) + 1(l ≤ k) ·O(nk), where 1(·) is the indicator function.

4.1.4 Convergence In Probability

We now turn to the large-sample limiting behavior of sequences of randomvariables such as the average. A probability measure on such a random sequencemaps each realization onto the real line and forms a sequence of real numberssuch as those above. We first look at a sequence of random variables whosedistribution is collapsing about some limit point.

Definition 4.5. A sequence of random variables y1, y2, . . . , yn, . . ., with distri-bution functions F1(·), F2(·), . . . , Fn(·), . . ., is said to converge (weakly) in prob-ability to some constant c if

limn→∞

Pr[ | yn − c | > ε ] = 0 (4.1)

for every real number ε > 0. �

Weak convergence in probability is denoted by

plimn→∞

yn = c,

Page 48: What Is Econometrics - Rice University

48 CHAPTER 4. ASYMPTOTIC THEORY

or sometimes,

ynp−→ c,

oryn →p c.

This definition is equivalent to saying that we have a sequence of tail proba-bilities (of being greater than c+ε or less than c−ε), and that the tail probabili-ties approach 0 as n→∞, regardless of how small ε is chosen. Equivalently, theprobability mass of the distribution of yn is collapsing about the point c. Theleading example of such behavior is given by the laws of large numbers whichare discussed in a subsequent subsection.

Definition 4.6. A sequence of random variables y1, y2, . . . , yn, . . ., is said toconverge strongly in probability to some constant c if

Pr[ limn→∞

yn = c] = 1. � (4.2)

Strong convergence is also called almost sure convergence, which is denoted

yna.s.−→ c,

oryn →a.s. c.

Another obvious terminology is to say that yn converges to c with probabilityone.

This definition would seem to involve a probability statement about a non-probabilistic sequence of reals with the limit statement. We can think about allthe possible realizations of the sequence of random variables. Each realizationhas a probability measure attached to it. We can place the possible outcomesinto the set of those that converge to a limit and another set of those that donot. Strong convergence in probability states that the total probability measureof the set of realizations that do not possess a limit is zero.

The relationship between weak and strong convergence is best seen by com-parison to an alternative and equivalent (but not obviously) definition for strongconvergence. Suppose

limN→∞

Pr[ supn>N

| yn − c | > ε ] = 0 (4.3)

for any ε > 0. Consider the equivalent statement

limN→∞

Pr[ supn>N

| yn − c | < ε ] = 1

for any ε > 0, which, using the definition of a limit, can be rewritten as∣∣∣∣Pr[ supn>N

| yn − c | < ε ]− 1

∣∣∣∣ < δ

Page 49: What Is Econometrics - Rice University

4.1. CONVERGENCE OF RANDOM VARIABLES 49

for any δ > 0 and N > N∗ for N∗ sufficiently large. Now for any ε > 0 and Nsufficiently large, the condition within the Pr for this formulation can, by thedefinition of a limit, also be written limn→∞ yn = c. Thus for N sufficientlylarge and any δ > 0 we have∣∣∣Pr[ lim

n→∞yn = c]− 1

∣∣∣ < δ,

which implies (4.2). Since these steps can be reversed, then (4.3) implies and isimplied by the definition of strong convergence (4.2).

Comparison of the weak convergence condition (4.1) with the alternativestrong convergence condition (4.3), reveals that, within the limit, weak conver-gence concerns itself with the probability of a single realization yn meeting thecondition | yn − c | < ε while strong convergence involves the joint probabil-ity of yn and all subsequent realizations meeting the condition. Obviously, ifa sequence of random variables converges strongly in probability, it convergesweakly in probability. Equally obviously, the reverse is not necessarily true.Most of our attention in the sequel will focus on weak convergence so when asequence is said to converge in probability it can be taken as weak convergenceunless otherwise noted.

Definition 4.7. A sequence of random variables y1, y2, . . . , yn, . . ., is said toconverge in quadratic mean if

limn→∞

E[ yn ] = c

andlimn→∞

Var[ yn ] = 0. �

For a random variable x with mean µ and variance σ2, Chebyshev’s inequal-ity states Pr(|x− µ| ≥ kσ) ≤ 1

k2 for k > 0. For arbitrary ε > 0, let µn and σ2n

denote the mean and variance of yn, and for n sufficiently large that |c− µn| < ε,consider

Pr[ | yn − c | > ε ] = 1− Pr[ c− ε < yn < c+ ε ]

= 1− Pr[ (c− µn)− ε < yn − µn < (c− µn) + ε ]

≤ 1− Pr[ |c− µn| − ε < yn − µn < − |c− µn|+ ε ], since the interval is narrower

= 1− Pr[−(an + ε) < yn − µn < an + ε ], where an = − |c− µn|= Pr[ | yn − µn | > an + ε ]

= Pr[ | yn − µn | >an + ε

σnσn ]

(1

an+εσn

)2

=σ2n

(an + ε)2→ 0

by Chebyshev’s inequality. Thus convergence in quadratic mean implies conver-gence in probability. The reverse is not necessarily true. There is no directionof implication between strong convergence and convergence in quadratic mean.

Page 50: What Is Econometrics - Rice University

50 CHAPTER 4. ASYMPTOTIC THEORY

The random sequences we consider will sometimes be a function of a variable(and/or parameter) and the constants ε and δ will depend on the point ofevaluation of the function. Analogous to uniform convergence of a function weintroduce the following

Definition 4.8. A sequence of random functions {fn = fn(x)}, n = 1, 2, 3, ...,is said to be uniformly convergent in probability to the function f for a set Eof values of x if, for each ε > 0 and 1 > δ > 0, an integer N can be found suchthat Pr(|fn(x)− f(x)| > ε) < δ for n > N and x ε E. �

This definition differs from pointwise convergence in probability in that ε andδ do not depend on the value of x. In pointwise convergence these values candiffer depending on the point of evaluation for x. Uniform convergence willsometimes be denoted by

supx ε E

|fn(x)− f(x)| −→p 0.

which assures we are using the same constants ε and δ across the set E.

4.1.5 Orders In Probability

Beyond knowing that the distribution is collapsing around a limit point as thesample size grows large we may be interested in the rate at which it collapses.This is sometimes important in comparing estimators and also in determiningthe appropriate normalization for finding the limiting distribution in the nextsubsection.

Definition 4.9. Let y1, y2, . . . , yn, . . . be a sequence of random variables. Thissequence is said to be bounded in probability if for any 1 > δ > 0, there exist a∆ <∞ and some N sufficiently large such that

Pr(| yn | > ∆) < δ,

for all n > N . �

The value δ provides a bound for the probability content of the tail to theright of ∆ and the left of −∆ once we are far enough into the sequence. Theseconditions require that the tail behavior of the distributions of the sequencenot be pathological. Specifically, the tail mass of the distributions cannot bedrifting away into the tails we move out in the sequence. The leading caseoccurs with the central limit theorem discussed below which shows that thesample average, suitably normalized, has a limiting normal distribution, whichsatisfies the conditions for being bounded in probability.

Definition 4.10. The sequence of random variables { yn } is said to be oforder in probability at most nλ, and is denoted Op(n

λ), if n−λyn is bounded inprobability. �

Page 51: What Is Econometrics - Rice University

4.1. CONVERGENCE OF RANDOM VARIABLES 51

Example 4.4. Suppose z ∼ N(0, 1) and yn = 3 +n · z, then n−1yn = 3/n+ z isa bounded random variable since the first term is asymptotically negligible andwe see that yn = Op(n). �

Example 4.5. Suppose zi ∼ i.i.d. N(0, 1) for i = 1, 2, ..., n then yn =∑ni=1 z

2i ∼

χ2n, which has E[yn] = n and var[yn] = 2n. The sequence yn does not sat-

isfy the definition of bounded in probability since the probability mass is be-ing strung out to the right as n → ∞. However, by the univariate CLT√n(yn/n − 1) →d N(0, 2), which is bounded in probability and hence Op(1).

Thus (yn/n − 1) = Op(1/√n), (yn − n) = Op(

√n), and yn = (yn − n) + n =

Op(√n) +O(n) = Op(n). �

Definition 4.11. The sequence of random variables { yn } is said to be of order

in probability smaller than nλ, and is denoted op(nλ), if n−λyn

p−→ 0. �

Example 4.6. Convergence in probability can be represented in terms of order

in probability. Suppose that ynp−→ c or equivalently yn− c

p−→ 0, then n0(yn−c)

p−→ 0 and yn − c = op(1). �

As with the order of sequences of real numbers, there is an algebra forthe order in probability of sequences of random variables. Specifically, for therandom sequences yn = Op(n

k) and zn = O(n`), then wn = yn · zn = Op(nk) ·

Op(n`) = Op(n

k+`) and wn = yn + zn = Op(nk) + Op(n

`) = Op(nmax(k,`)).

Similarly, for zn = op(nk) and wn = op(n

`), then yn = zn ·wn = op(nk)·op(n`) =

op(nk+`) and yn = zn + wn = op(n

k) + op(n`) = op(n

max(k,`)). Finally, ifyn = Op(n

k) and zn = op(n`), then wn = zn · wn = Op(n

k) · op(n`) = op(nk+`)

and yn = zn + wn = Op(nk) + op(n

`) = 1(l > k) · op(nl) + 1(l ≤ k) · Op(nk).Moreover, for mixtures of sequences of reals and randoms, a similar algebraholds except the results are always orders in probability since the mixtures willbe random.

4.1.6 Convergence In Distribution

The previous sections have provided tools for describing the limiting point ofthe distributions of a random sequence and also the rate at which the sequenceis tending toward a bounded random variable. Inevitably, we also want to knowthe specific distribution toward which suitably transformed random sequence istending, if such is the case.

Definition 4.12. A sequence of random variables y1, y2, . . . , yn, . . ., with cu-mulative distribution functions F1(·), F2(·), . . . , Fn(·), . . ., is said to converge indistribution to a random variable y with a cumulative distribution functionF (y), if

limn→∞

Fn(·) = F (·),

for every point of continuity of F (·). The distribution F (·) is said to be thelimiting distribution of this sequence of random variables. �

Page 52: What Is Econometrics - Rice University

52 CHAPTER 4. ASYMPTOTIC THEORY

For notational convenience, we often write ynd−→ F (·) or yn →d F (·) if a

sequence of random variables converges in distribution to F (·). Note that themoments of elements of the sequence do not necessarily converge to the momentsof the limiting distribution. There are well-known examples where the limitingdistribution has moments of all orders but certain moments of the finite-sampledistribution do not exist.

Note that the pointwise convergence in distribution above allows for theconstant δ used in the limit to depend on the point of evaluation, say c, and(possibly) the values of the underlying parameters of the generating process,say θ. It will be sometimes useful to utilize cases where the same constant canbe utilized across the set of possible c and (possibly) θ. For simplicity,we ignorethe possible dependence on θ since the definition easily generalizes.

Definition 4.13. A sequence of random variables y1, y2, . . . , yn, . . ., with cumu-lative distribution functions F1(·), F2(·), . . . , Fn(·), . . ., is said to converge uni-formly in distribution to a random variable y with a cumulative distributionfunction F (y), if

supc ε C

|Fn(c)− F (c)| −→p 0,

for C being the set of feasible points c of evaluation for the function F (·). �

Uniform convergence implies pointwise convergence but the reverse is not gener-ally true. One important exception occurs when F (c) is continuous over compactC, whereupon pointwise convergence of Fn(c) to F (c) implies uniform conver-gence.

4.1.7 Some Useful Propositions

In the sequel, we will sometimes need to work with combinations of two ormore sequences or transformations of a sequence. Consequently, the followingpropositions will prove useful, where xn and yn denote two sequences of randomvectors.

Proposition 4.1. If xn − yn converges in probability to zero, and yn has alimiting distribution, then xn has a limiting distribution, which is the same. �

Proposition 4.2. If yn has a limiting distribution and plimn→∞

xn = 0, then for

zn = yn′xn, plim

n→∞zn = 0. �

Proposition 4.3. Suppose that yn converges in distribution to a random vari-able y, and plim

n→∞xn = c. Then xn

′yn converges in distribution to c′y. �

Proposition 4.4. If g(·) is a continuous function, and if xn−yn converges inprobability to zero, then g(xn)− g(yn) converges in probability to zero. �

Page 53: What Is Econometrics - Rice University

4.2. ESTIMATION THEORY 53

Proposition 4.5. If g(·) is a continuous function, and if xn converges in prob-ability to a constant c, then zn = g(xn) converges in distribution to the constantg(c). �

Proposition 4.6. If g(·) is a continuous function, and if xn converges in dis-tribution to a random variable x, then zn = g(xn) converges in distribution toa random variable g(x). �

The first two propositions follow from the third, which follows from the algebraof Op, and the fourth and fifth follow from the sixth, which is known as thecontinous mapping theorem.

4.2 Estimation Theory

4.2.1 Estimators

Typically, our interest centers on certain properties of the data that can becharacterized as parameters of the model used. An example is the marginalpropensity to consume in a model of aggregate consumption. Our econometricanalysis of this property usually begins with a point estimate of the parameterbased on the available data. The formula or function that was used to obtainthe point estimate is an estimator. The estimate is a realization drawn from thedistribution that follows when the estimator is applied to the sample distributionof the underlying variables of the model. Thus as the sample size is increased,the estimator forms a sequence of random variables. The simple example givenbefore was the sample mean xn as the sample size n is increased.

In the sequel, we will entertain various estimators that result from opti-mization of a function of the data with respect to parameters of interest. Suchestimators comprise the class of extremum estimators. Examples are: leastsquares, which finds a minimum; maximum likelihood, which finds the maxi-mum; and generalized method of moments (GMM), which finds a minimum. Ineach case, we obtain first order conditions for the objective function, in termsof the unknown parameters, and use the solution of these as estimators.

Without loss of generality, we will consider estimators that result from mini-mizing a function of the data. Denote this function by ψn(θ) =ψ(z1, z2, ...zn;θ),where z1, z2, ...zn are n observations on an m× 1 vector of random variables zand θ is the p× 1 vector of unknown parameters defined on a set Θ. Formally,our estimator is

θn = argminθ∈Θ

ψn(θ). (4.4)

Typically, this estimator will be obtained by setting the first-order conditions(FOC), namely the partial derivatives of the function to be minimized, to zero

0 =∂ψn(θn)

∂θ

Page 54: What Is Econometrics - Rice University

54 CHAPTER 4. ASYMPTOTIC THEORY

and solving for θn. In some cases such a solution is easy to obtain and in othersit is not. It is possible for the FOC to be nonlinear in θ and the resultingsolution to the FOC to be very complicated functions of the data.

In Appendix A to this chapter, we will develop results which establish thelarge-sample asymptotic properties of such estimators under very general con-ditions.

4.2.2 Properties Of Estimators

Definition 4.14. An estimator θn of the p×1 parameter vector θ is a functionof the sample observations z1, z2, ...zn. �

It follows that θ1, θ2, . . . , θn form a sequence of random variables.

Definition 4.15. The estimator θn is said to be unbiased if E θn = θ, for alln. �

Definition 4.16. The estimator θn is said to be asymptotically unbiased iflimn→∞

E θn = θ. �

Note that an estimator can be biased in finite samples, but asymptoticallyunbiased.

Definition 4.17. The estimator θn is said to be consistent if plimn→∞

θn = θ. �

Asymptotic unbiasedness implies that a measure of central tendency of theestimator, the mean, is the target value so in some sense the distribution ofthe estimator is centered around the target in large samples. Consistency im-plies that the distribution of the estimator is collapsing about the target as thesample size increases. Consistency neither implies nor is implied by asymptoticunbiasedness, as demonstrated by the following examples.

Example 4.7. Let

θn =

{θ, with probability 1− 1/nθ + nc, with probability 1/n

We have E θn = θ+ c, so θn is a biased estimator, and limn→∞ E θn = θ+ c, soθn is asymptotically biased as well. However, limn→∞ Pr(| θn − θ | > ε) = 0, so

θn is a consistent estimator. �

Example 4.8. Suppose xi ∼ i.i.d.N(µ, σ2) for i = 1, 2, ..., n, ... and let xn = xnbe an estimator of µ. Now E[xn] = µ so the estimator is unbiased but

Pr(|xn − µ| > 1.96σ) = .05

so the probability mass is not collapsing about the target point µ so the esti-mator is not consistent. �

Page 55: What Is Econometrics - Rice University

4.2. ESTIMATION THEORY 55

4.2.3 Laws Of Large Numbers And Central Limit Theo-rems

Most of the large sample properties of the estimators considered in the sequelderive from the fact that the estimators involve sample averages and that theasymptotic behavior of averages is well studied. In addition to the central limittheorems presented in the previous two chapters we have the following two lawsof large numbers:

Theorem 4.1. If x1, x2, . . . , xn is a simple random sample, that is, the xi’s arei.i.d., and Exi = µ and Var(xi) = σ2, then by Chebyshev’s Inequality,

plimn→∞

xn = µ �

Theorem 4.2. Suppose that x1, x2, . . . , xn are i.i.d. random variables, suchthat for all i = 1, . . . , n, Exi = µ, then,

plimn→∞

xn = µ �

Both of these results apply element-by-element to vectors of estimators. Notethat Khitchine’s Theorem does not require second moments for the underlyingrandom variable.

For sake of completeness we repeat the following scalar central limit theorem.

Theorem 4.3. Suppose that x1, x2, . . . , xn are i.i.d. random variables, such that

for all i = 1, . . . , n, Exi = µ and Var(xi) = σ2, then√n(xn − µ)

d−→ N(0, σ2),

or for any z limn→∞ fn(z) = 1√2πσ2

e1

2σ2z2 . �

This result is easily generalized to obtain the multivariate version given in The-orem 3.3.

Theorem 4.4. Suppose that m×1 random vectors x1,x2, . . . ,xn are (i) jointlyi.i.d., (ii) E xi = µ, and (iii) Cov(xi) = Σ, then

√n(xn − µ)

d−→ N(0,Σ) �

4.2.4 Asymptotic Linearity

The importance of the average and its’ asymptotic regularity as given by the lawsof large numbers and the central limit theorem arise because a large number ofestimators and most of those considered in the sequel are asymptotically linear.

Definition 4.18. An estimator is said to be asymptotically linear if, for zi i.i.d.,there exists v(·) such that

√n(θn − θ) =

1√n

n∑i=1

v(zi;θ) + op(1) �

Page 56: What Is Econometrics - Rice University

56 CHAPTER 4. ASYMPTOTIC THEORY

Typically, the form of v(zi;θ) arises from setting first-order conditions tozero that might arise from optimization of an objective function. Accordingly,we will have E[v(zi;θ)] = 0, and Cov(v(zi;θ)) = V at θ = θ0, whereupon wecan apply the central limit theorem to obtain

√n(θn − θ0)

d−→ N(0,V).

In order to obtain asymptotic linearity we will typically have to impose addi-tional assumptions on the data and v(·), as with the general consistency andasymptotic normality theorems given in the appendix to this chapter.

Example 4.9. Suppose xi ∼ i.i.d.N(µ, σ2) for i = 1, 2, ..., n and we obtainXn = 1

n

∑ni=1 xi, σ

2 = 1n

∑ni=1(xi − Xn)2, and form the feasible statistic

Tn =Xn − µ√σ2/n

=Xn − µ√σ2/n

√σ2/n√σ2/n

=Xn − µ√σ2/n

√σ2

σ2

=Xn − µ√σ2/n

+

(Xn − µ√σ2/n

)(√σ2

σ2− 1

).

Now the first term converges in distribution to a standard normal and is henceOp(1). It will be shown that σ2 →p σ

2, so the second term is the product ofOp(1) and op(1) factors and hence op(1). Thus,

Tn =Xn − µ√σ2/n

+ op(1)

=1√n

n∑i=1

xi − µσ

+ op(1)

and the statistic is seen to be asymptotically linear with the limiting behaviorgoverned by the central limit theorem applied to the normalized average ofv(zi;θ) = (xi − µ)/σ. �

4.2.5 CUAN And Efficiency

We often encounter more than one candidate estimator for a particular model.It behooves us to have some metric, which will motivate a choice between thealternatives. Such a comparison requires a little more smoothness of the esti-mator.

Definition 4.19. An estimator is said to be consistently uniformly asymp-totically normal (CUAN) if it is consistent, and if

√n(θn − θ) converges in

Page 57: What Is Econometrics - Rice University

4.3. ASYMPTOTIC INFERENCE 57

distribution to N(0,V), and if the convergence is uniform over some compactsubset of the parameter space Θ. �

Uniformity is introduced in order to rule out pathological cases where the limit-ing distribution changes as we vary the limit point θ. In such pathological cases,the preferred estimator can change as we move away from the pathological point.

Suppose that√n(θn−θ) converges in distribution to N(0,V). Let θn be an

alternative estimator such that√n(θn−θ) converges in distribution to N(0,W).

Definition 4.20. If θn is CUAN with asymptotic covariance V and θn is CUANwith asymptotic covariance W, then θn is asymptotically efficient relative to θn

if W −V is a positive semidefinite matrix. �

Among other properties asymptotic relative efficiency implies that the diagonalelements of V are no larger than those of W, so the asymptotic variances of θn,iare no larger than those of θn,i. And a similar result applies for the asymptoticvariance of any linear combination.

Sometimes an estimator can be shown to be efficient relative to all alterna-tives.

Definition 4.21. A CUAN estimator θn is said to be asymptotically efficientif it is asymptotically efficient relative to any other CUAN estimator. �

In the next chapter we find that when the Maximum Likelihood Estimator(MLE) is CUAN then it is efficient within the class.

4.3 Asymptotic Inference

4.3.1 Inference

Strictly speaking, inference is the process by which we infer properties of theunderlying model and distribution from the behavior of the observable variables.Point estimation of parameters of interest is certainly a leading example of suchinference. Beyond point estimation, we are interested in making probabilitystatements concerning the properties of interest or choosing between alterna-tive models. Such statements or choices are usually based on hypothesis tests,interval estimates, or probability values, all of which require the distribution ofthe estimator. In cases where the small sample distribution of the estimator isunavailable or formidably complex, we must rely on the large-sample limitingor asymptotic distribution of the estimator to provide answers.

4.3.2 Normal Ratios

To fix ideas, we begin with inference in the simple mean model. Under the con-

ditions of the central limit theorem, this model yields√n(xn − µ)

d−→ N(0, σ2),so

(xn − µ)√σ2/n

d−→ N(0, 1) (4.5)

Page 58: What Is Econometrics - Rice University

58 CHAPTER 4. ASYMPTOTIC THEORY

Suppose that σ2 is a consistent estimator of σ2. Then, by Proposition 4.3, wealso have

(xn − µ)√σ2/n

=

√σ2/n√σ2/n

(xn − µ)√σ2/n

=

√σ2

σ2

(xn − µ)√σ2/n

d−→ N(0, 1)

since the term under the square root converges in probability to one and theremainder converges in distribution to N(0, 1).

Most typically, such ratios will be used for inference in testing a hypothesis.Now, for H0 : µ = µ0, we have

(xn − µ0)√σ2/n

d−→ N(0, 1), (4.6)

while under H1 : µ = µ1 6= µ0, we find that

(xn − µ0)√σ2/n

=

√n(xn − µ1)√

σ2+

√n(µ1 − µ0)√

σ2

= N(0, 1) + Op(√n)

In conducting an hypothesis test, we choose the critical value c so that the sizeα, or probability our statistic exceeding c in absolute value, is sufficiently smallthat we prefer the alternative hypothesis where such extreme events are morelikely, particularly in large samples.

Such ratios are of interest in estimation and inference with regard to moregeneral parameters. Suppose that θ is an estimator of the parameter vector θand

√n( θp×1− θ)

d−→ N(0, Vp×p

).

Then, if θj is the parameter of particular interest, we consider

(θj − θj)√vjj/n

d−→ N(0, 1),

and duplicating the arguments for the sample mean

(θj − θj)√vjj/n

d−→ N(0, 1), (4.7)

where V is a consistent estimator of V. This ratio will have a similar behaviorunder a null and alternative hypothesis with regard to θj .

Page 59: What Is Econometrics - Rice University

4.3. ASYMPTOTIC INFERENCE 59

4.3.3 Asymptotic Chi-Square

The ratios just discussed are only appropriate for inference with respect to asingle parameter. Often, some question to be answered using inference involvesmore than one parameter at a time. In a leading example, we have one modelembedded in another as a special case then the more complex model reduces tothe special case if a set of parameters take on a specific value, such as zero.

Suppose that √n(θ − θ)

d−→ N(0,V),

where Ψ is a consistent estimator of the nonsingular p×p matrix Ψ. Then, fromthe previous chapter we have

√n(θ − θ)V−1

√n(θ − θ) = n · (θ − θ)V−1(θ − θ)

d−→ χ2p

andn · (θ − θ)V−1(θ − θ)

d−→ χ2p (4.8)

for V a consistent estimator of V.A similar result can also be obtained for any subvector of θ. Let

θ =

(θ1

θ2

),

where θ2 is a p2 × 1 vector. Then

√n(θ2 − θ2)

d−→ N(0,V22),

where V 22 is the lower right-hand p2 × p2 submatrix of Ψ and,

n · (θ2 − θ2)V−122 (θ2 − θ2)

d−→ χ2p1 (4.9)

where V 22 is the analogous lower right-hand submatrix.This result can be used to conduct inference by testing the sub-vector. If

H0 : θ2 = θ02, then

n · (θ2 − θ02)V−1

22 (θ2 − θ02)

d−→ χ2p2 ,

and large positive values are rare events. while for H1 : θ2 = θ12 6= θ0

2, we canshow (later)

n · (θ2 − θ02)V−1

22 (θ2 − θ02) = n · ((θ2 − θ1

2)

+(θ12 − θ

02))′V−1

22 ((θ2 − θ12) + (θ1

2 − θ02))

= n · (θ2 − θ12)V−1

22 (θ2 − θ12)

+2n · (θ12 − θ

02)V−1

22 (θ2 − θ12)

+n · (θ12 − θ

02)V−1

22 (θ12 − θ

02)

= χ2p2 + Op(

√n) + Op(n) (4.10)

Page 60: What Is Econometrics - Rice University

60 CHAPTER 4. ASYMPTOTIC THEORY

Thus, if we obtain a large value of the statistic, we may take it as evidence thatthe null hypothesis is incorrect.

The ratio approach of the previous section is a special case of the subvectorapproach. If the hypothesis to be tested only concerns the last element of theparameter vector so H0 : θp = θ0

p, then we can test using the ratio

(θj − θj)√vjj/n

d−→ N(0, 1),

or the quadratic form, which for this case is

n · (θp − θ0p)′v−1pp (θp − θ0

p) = n · (θp − θ0p)

2/vppd−→ χ2

1.

But the second is just the square of the first and the distribution consulted isthe chi-squared with 1 degree of freedom which results from a standard normalsquared. Note that large negative and positive excursions of θp from θ0

p impactthe quadratic statistic equally so it can only be used for two-sided tests.

4.3.4 Tests Of General Restrictions

We can use similar results to test general nonlinear restrictions. Suppose thatr(·) is a q × 1 continuously differentiable function, and

√n(θ − θ)

d−→ N(0,V).

By the mean value theorem we can obtain the exact Taylor’s series representa-tion

r(θ) = r(θ) +∂r(θ∗)

∂θ′(θ − θ)

or equivalently

√n(r(θ)− r(θ)) =

∂r(θ∗)

∂θ′√n(θ − θ)

= R(θ∗)√n(θ − θ)

where R(θ∗) = ∂r(θ∗)∂θ′ and θ∗ lies between θ and θ. Now θ →p θ so θ∗ →p θ

and R(θ∗)→p R(θ). Thus, we have

√n[ r(θ)− r(θ)]

d−→ N(0,R(θ)VR′(θ)).

Consider a test of H0 : r(θ) = 0. Assuming R(θ)VR′(θ) is nonsingular, wehave

n · r(θ)[R(θ)VR′(θ)]−1r(θ)d−→ χ2

q,

where q is the length of r(·). In practice, we substitute the consistent estimates

R(θ) for R(θ) and V for V to obtain, following the arguments given above

n · r(θ)[R(θ)VR′(θ)]−1r(θ)d−→ χ2

q, (4.11)

The behavior under the alternative hypothesis will be Op(n) as above.

Page 61: What Is Econometrics - Rice University

4.3. ASYMPTOTIC INFERENCE 61

4.A Appendix: Limit Theorems for ExtremumEstimators

In this section, we will develop results which establish the consistency andasymptotic normality of extremum estimators as defined by (4.4) under verygeneral conditions. These conditions are high-level, in the sense that they en-compass many models and must be verified for the specific model of interest.The assumptions on the specific model that lead to verification of the high-levelassumptions are low-level or primitive assumptions.

We first consider a result for consistency of the extremum estimator. Fornotational simplicity we will suppress the subscript n on the estimator θn.

Theorem 4.5. Suppose (C1) Θ ⊂Rp compact, (C2) ∃ψ0(θ) : θ −→ R s.t.ψn(θ) −→p ψ0(θ) uniformly over θ ∈ Θ, (C3) ψ0(θ) continuous over θ ∈ Θ,

(C4) for θ ∈ θ, ψ0(θ) attains a unique minimum for θ=θ0, then θ −→p θ0 �

Proof: The following proof uses the properties that the inf and sup of a compactset must exist, that a closed subset of a compact set is also compact, andthat the image set of a continuous function on a compact set is also compact.Consider any open neighborhood Θ0 of θ0, then the compliment Θ1 = Θ−Θ0

is a closed subset of Θ0 and hence compact by (C1). For θ∈ Θ1 we haveψ0(θ) − ψ0(θ0) > 0, by (C4). So, by continuity of ψ0(θ) and compactness ofΘ1, we find

δ∗ = infθ∈Θ1

(ψ0(θ)− ψ0(θ0))

exists and positive. Therefore, ∃δ s.t. δ∗ > δ > 0 and ψ0(θ) − ψ0(θ0) > δ forθ ∈ Θ1. Next, for any ε > 0, we find, by (C2), that |ψn(θ)− ψ0(θ)| ≤ ε forθ ∈ Θ with Pr approaching 1. Thus

ψ0(θ)− ψ0(θ0) = [ψ0(θ)− ψn(θ)] + [ψn(θ)− ψ0(θ0)]

= [ψ0(θ)− ψn(θ)] + [ψn(θ)− ψn(θ0)] + [ψn(θ0)− ψ0(θ0)]

≤ [ψ0(θ)− ψn(θ)] + [ψn(θ0)− ψ0(θ0)]

since [ψn(θ)− ψn(θ0)] ≤ 0 by definition

≤ |ψ0(θ)− ψn(θ)|+ |ψn(θ0)− ψ0(θ0)|≤ 2ε,

with Pr approaching 1. But if we choose ε = δ/2, then with Pr approaching 1,

ψ0(θ)−ψ0(θ0) ≤ δ and θ /∈ Θ1, which means θ ∈ Θ0, for any neighborhood ofθ0. �

The assumptions for this theorem are very high level and verification is some-times easy and sometimes not. The compactness assumption applies directlyto the parameter space and hence should not be difficult to verify. The conti-nuity assumption will usually be met if the functions of the underlying modelare themselves smooth but should be verified nonetheless. The global minimumassumption is not so restrictive as it might seem since we can always restrict θ

Page 62: What Is Econometrics - Rice University

62 CHAPTER 4. ASYMPTOTIC THEORY

to a smaller space where there is only one minimum. In practice, the uniformconvergence in probability assumption can become the most difficult to verifywhen the underlying model is complicated.

ψn(θ)

ψ0(θ)

ψ0(θ)+δ

θ0

θb

θa

ψmax ψ

0(θ)−δ

Figure 4.1: Consistency From Minimization

With a little more smoothness, we can give a geometric proof of the consis-tency theorem for the single parameter case. Consider Figure 4.1 above. Forsimplicity, we consider the limiting function ψ0(θ) to be approximately quadratic(more will be said on this below). For an arbitrary choice of δ > 0, we knowby the uniform convergence assumption that the function we are minimizingψn(θ), represented here by the slightly wiggly line, will lie in the sleeve betweenψ0(θ)− δ and ψ0(θ)+ δ with probability approaching one as n grow large. Now,if ψn(θ) lies in this region its’ minimum cannot exceed ψmax which is the min-imum of ψ0(θ) + δ since ψn(θ) lies below ψ0(θ) + δ. At the same time, thisminimum must also lie above ψ0(θ) − δ, which means that the value of θ that

minimizes ψn(θ), namely θ, must lie between θa and θb. For sufficiently largen, we can make δ arbitrarily small, whereupon the sleeve formed by ψ0(θ) − δand ψ0(θ) + δ about ψ0(θ) grows tighter and by continuity of ψ0(θ), θa and θbmust be closer and closer to θ0, with probability approaching one, which provesconsistency.

If the function ψ0(θ) is twice differentiable then in the neighborhood of theminimum it will be approximately quadratic and this proof will apply. Sincethere is a unique global minimum by assumption, the global minimum will be

Page 63: What Is Econometrics - Rice University

4.3. ASYMPTOTIC INFERENCE 63

in the neighborhood of θ0 and the proof applies.We now present a normality result for the extremum estimator.

Theorem 4.6. Suppose (N1) θ −→p θ0, θ0 interior Θ, (N2) ψn(θ) twicecontinuously differentiable in neighborhood N of θ0, (N3) ∃Ψ(θ), continuousat θ0, s.t. ∂2ψn(θ)/∂θ∂θ′ −→p Ψ(θ) uniformly over θ ∈ N , (N4) Ψ =

Ψ(θ0) nonsingular, (N5)√n∂ψn(θ0)/∂θ −→d N(0,V ), then

√n(θ−θ0) −→d

N(0,Ψ−1VΨ−1) �

Proof: By (N1) θ ∈ N with probability one, and we can use (N2) to expandthe first-order conditions in a Taylor’s series and add and subtract terms toyield

0 =∂ψn(θ)

∂θ

=∂ψn(θ0)

∂θ+∂2ψn(θ∗)

∂θ∂θ′(θ − θ0)

=∂ψn(θ0)

∂θ+ Ψ(θ0)(θ − θ0) + [Ψ(θ∗)−Ψ(θ0)](θ − θ0) +

[∂2ψn(θ∗)

∂θ∂θ′−Ψ(θ∗)

](θ − θ0)

for θ∗ between θ and θ0. It follows that θ∗ −→p θ0 so, by continuity and the

Slutzky Theorem, Ψ(θ∗) − Ψ(θ0) = op(1), and by (N3) ∂2ψn(θ∗)∂θ∂θ′ − Ψ(θ∗) =

op(1). Thus

−√n∂ψn(θ0)

∂θ= Ψ(θ0)

√n(θ − θ0) + op(

∥∥∥√n(θ − θ0)∥∥∥)

which, using (N4), can be multiplied through by Ψ(θ0)−1 and solving to obtain

√n(θ − θ0) = −Ψ(θ0)−1

√n∂ψn(θ0)

∂θ+ op(

∥∥∥√n(θ − θ0)∥∥∥)

−→ d N(0,Ψ−1VΨ−1) by (N5). �

The assumptions of this theorem are fairly standard but should be verified forthe model in question. The nonsingularity assumption will usually follow fromthe identification assumption introduced for the consistency theorem. In orderto perform inference based on the estimator θ, we need a consistent estimatorof the covariance matrix Ψ−1VΨ−1. The uniform convergence result for thesecond derivatives yields Ψ=∂2ψn(θ)/∂θ∂θ′ as an obvious consistent estimatorof θ. An appropriate consistent estimator of V will depend on the specifics ofthe model in question and may require additional assumptions.

4.B Appendix: Limit Theorems for DependentProcesses

In Section 4.2, we presented laws of large numbers (LLN) and central limittheorems CLT) for i.i.d. processes. Economic outcomes frequently depend on

Page 64: What Is Econometrics - Rice University

64 CHAPTER 4. ASYMPTOTIC THEORY

forward-looking decision-making which involves optimization over a future timehorizon that is predicted on the basis of past behaviour. To study such modelsinvolves analysis of data series that are repeated measures of the same processover time. The assumption of i.i.d. is questionable or sometimes obviouslyinappropriate for many of these models. Consequently, we will present limittheorems in this section that are appropriate for what are called dependentprocesses. We will only look at processes that involve discrete time so the timeindex is informative. A sequence of such random variables x1, x2, ... is calleda time-series in the statistical literature and a dependent process because therandom variables are not necessarily stochastically independent.

Since we will relax the i.i.d. assumption it will need to be replaced by somealternative assumptions. The secret of the limit theorems for i.i.d. processeswas that each additional observation brought more information to bear on theprocess, whose distribution was constant. In order to mimick these properties ina dependent process. we will need some notion of constancy and also additionalinformation from each observation.

Definition 4.22. A process is strictly stationary if for any fixed integer mthe joint distribution of (xn+1, xn+2, ..., xn+m) has the same distribution for allnonnegative integers n.

Thus if we have repeated observations on m continguous random variables,we have to notion of a constant target in the distribution.

Definition 4.23. A process having second moments is covariance stationaryif µ = E[xn] is the same for all positive integers n and for each nonnegativeinteger m, γm = E[(xn − µ)(xn+m − µ)] is the same for all n.

Note that strict stationarity implies covariance stationarity if the moments exist.Now we need some notion that additional information about the distribution

is obtained with additional time-series observations. Accordingly, we introducethe concept of ergodicity.

Definition 4.24. A stationary process is ergotic if the sample moments overtime converge in probability to the population moments at one point in time,which are always the same, due to stationarity.

This definition can also be stated as the following theorem:

Theorem 4.7. Suppose x1, x2, ... is strictly stationary and ergotic, then n−1∑nt=1 a(xt)

a.s.−→E[a(x1)] for any measurable functions a(x) with E[|a(x1)|] <∞.

There is a sense in which this theorem assumes away the problem. That is,ergodicity is equivalent to consistency, which is more or less true. Findingconditions for ergodicity to hold is the same as finding conditions for consistencyto hold so it is often just assumed.

More insight can be gained by looking at a mean ergotic process which isa stationary process with consistency of the first sample moment. We further

Page 65: What Is Econometrics - Rice University

4.3. ASYMPTOTIC INFERENCE 65

restrict our attention to covariance stationary processes. Note that E[xt] = µfor all t, so for xn = 1

n

∑nt=1 xt,we find

E[xn] = µ

and the sample mean is unbiased. If we can further show the variance convergesto zero then we will have convergence in quadratic mean and hence consistency.

Accordingly, consider the variance of the normalized and centered average√n(xn − µ), which has a more complicated variance than before, due to the

dependence,

σ2n = var[

√n(xn − µ)]

= E

[1√n

n∑t=1

(xt − µ)1√n

n∑t=1

(xt − µ)

]

=1

n

n∑t=1

n∑s=1

E[(xt − µ)(xs − µ)]

=1

n

n∑t=1

E[(xt − µ)2] +

n∑t=1

∑s6=t

E[(xt − µ)(xs − µ)]

.

From the definition above, we have γt−s = E[(xt − µ)(xs − µ)] for t > s, so

σ2n =

1

n

[nγ0 + 2

n∑t=2

t−1∑s=1

γt−s

]

=1

n

[nγ0 + 2

n−1∑s=1

(n− s)γs

]

= γ0 + 2

n−1∑s=1

n− sn

γs

= γ0 + 2

n−1∑s=1

(1− s

n)γs.

Since 2 > 2(1− s/n) > 0, we can apply the triangle inequality to find that

σ2n ≤ γ0 + 2

n−1∑s=1

(1− s

n) |γs|

≤ γ0 + 2

n−1∑s=1

|γs| .

Thus, the absolute summability condition∑ns=1 |γs| <∞ assures that σ2

n <∞,so var(xn) = σ2

n/n→ 0, whereupon we have proved the following result:

Page 66: What Is Econometrics - Rice University

66 CHAPTER 4. ASYMPTOTIC THEORY

Theorem 4.8. Suppose x1, x2, ... is covariance stationary and∑ns=1 |γs| <∞,

then xn = n−1∑nt=1 xt

p−→ µ = E[x1] .

This condition means the covariances are declining much faster that s, in somesense. The memory of the process is declining, in the second moment sense,so adding observations adds information that is not correlated with previousrealizations in the process. Existence of higher moments, say fourth, and strictstationarity enables us to apply this result to similarly consistently estimatesecond moments of the process such as γs = E[(xt − µ)(xt−s − µ)] for finite s.

Asymptotic normality requires a stronger notion of independence such asthe sequence being mixing of some form. Mixing places explicit restrictions onthe degree of dependence between two collections of random variables that aresufficiently far apart. The easiest form of mixing to understand is ρ- or rho-mixing which requires that the sequence {xt}∞t=−∞ be strictly stationary. LetAm1 = g(x1, x2..., xm) represent a family of measurable functions of the first mrandom variables in the stochastic process and A∞m+n = h(xm+n, xm+n+1, ...)represent a family of measurable functions of the remainder of the variablesstarting at m+ n, where both families are square integrable random variables.Define ρn = supw∈Am1 ,v∈A∞m+n

|cor(w, v)| , then the sequence is ρ−mixing if

limn→∞ ρn = 0. So the elements of the one collection are uniformly uncorrelatedwith the elements of the other collection when they are sufficiently far apart.Since g(·) and h(·) can be any measurable functions, the two collections areasymptoticially independent.

Given this form of mixing we can now state a central limit theorem. Specif-ically, we have, from Ibragamov [1975],

Theorem 4.9. Suppose x1, x2, ... is strictly stationary, ρ-mixing, E[|xt − µ| 2+δ] <∞ for some δ > 0, and nσ2

n → ∞, then σ2n converges with σ2 = lim

n→∞σ2n and

√n(xn − µ)

d−→ N(0, σ2).

Thus replacing the i.i.d. assumption with strict stationarity and ρ-mixing stillrequires a little more than existence of mean and variance. Specifically, we needa little more than second moments. This is a condition that also crops up incentral limit theorems for independent but not identical processes.

Page 67: What Is Econometrics - Rice University

Chapter 5

Maximum LikelihoodMethods

5.1 Maximum Likelihood Estimation (MLE)

5.1.1 Motivation

Suppose we have a model for the random variable yi, for i = 1, 2, . . . , n, withunknown (p × 1) parameter vector θ. In many cases, the model is sufficientlywell specified to imply a distribution f(yi|θ) for each realization of the variableyi and choice of the parameter vector θ.

A basic premise of statistical inference is to avoid unlikely or rare models as,for example, in hypothesis testing. If we have a realization of a statistic thatexceeds the critical value then it is a rare event under the null hypothesis. Underthe alternative hypothesis, however, such a realization is much more likely tooccur and we reject the null in favor of the alternative. Thus in choosing betweenthe null and alternative, we prefer the model that makes the realization of thestatistic more likely to have occurred.

Carrying this idea over to estimation, we select values of θ such that thecorresponding values of f(yi|θ) implied by the data are not unlikely. After all, wedo not want a model that disagrees strongly with the data. Maximum likelihoodestimation is merely a formalization of this notion that the model chosen shouldnot be unlikely. Specifically, we choose the values of the parameters that makethe realized data most likely to have occurred. This approach does, however,require that the model be specified in enough detail to imply a distribution forthe variable of interest.

67

Page 68: What Is Econometrics - Rice University

68 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

5.1.2 The Likelihood Function

Suppose that the random variables y1, y2, . . . , yn are i.i.d. Then, the joint den-sity function for n realizations is

f(y1, y2, . . . , yn|θ) = f(y1|θ) · f(y2, |θ) · . . . · f(yn|θ)

=

n∏i=1

f(yi|θ) (5.1)

for θ ∈ Θ, where Θ denotes the set of allowed θ. Given values of the parametervector θ, this function allows us to assign local probability measures for variousvalues of the random variables y1, y2, . . . , yn. This is the function which mustbe integrated to obtain probability statements concerning the joint outcomes ofy1, y2, . . . , yn.

Given a set of realized values of the random variables, we use this samefunction to establish the probability measure associated with various choices ofthe parameter vector θ.

Definition 5.1. The likelihood function of the parameters, for a particularsample of y1, y2, . . . , yn, is the joint density function considered as a function ofθ given the yi’s. That is,

L(θ|y) =

n∏i=1

f(yi|θ) � (5.2)

5.1.3 Maximum Likelihood Estimation

For a particular choice of the parameter vector θ, the likelihood function givesa probability measure for the realizations that occurred. As the values of θ arevaried, the probability measure associated with the realizations may change.Consistent with the approach used in hypothesis testing, and using this func-tion as the metric, we choose θ that make the realizations most likely to haveoccurred.

Definition 5.2. The maximum likelihood estimator of θ is the estimator ob-tained by maximizing L(θ|y1, y2, . . . , yn) with respect to θ. That is,

θ = argmaxθ∈Θ

L(θ|y), (5.3)

where θ is called the MLE of θ. �

Equivalently, since ln(·) is a strictly monotonic transformation, we may findthe MLE of θ by solving

θ = argmaxθ∈Θ

L(θ|y), (5.4)

whereL(θ|y) = ln L(θ|y)

Page 69: What Is Econometrics - Rice University

5.1. MAXIMUM LIKELIHOOD ESTIMATION (MLE) 69

denotes the log-likelihood function. In practice, we obtain θ by solving thefirst-order conditions (FOC)

∂L(θ; y)

∂θ=

n∑i=1

∂ ln f(yi; θ)

∂θ= 0.

The motivation for using the log-likelihood function is apparent since the sum-mation form will result, after division by n, in estimators that are approximatelyaverages, about which we know a lot. This advantage is particularly clear inthe following example.

Example 5.1. Suppose that yi ∼i.i.d.N(µ, σ2), for i = 1, 2, . . . , n. Then,

f(yi|µ, σ2) =1√

2πσ2e−

12σ2

(yi−µ)2 ,

for i = 1, 2, . . . , n. Using the likelihood function (5.2), we have

L(µ, σ2|y) =

n∏i=1

1√2πσ2

e−1

2σ2(yi−µ)2 . (5.5)

Next, we take the logarithm of (5.5), which gives us

lnL(µ, σ2|y) =

n∑i=1

[−1

2ln(2πσ2)− 1

2σ2(yi − µ)2

]

= −n2

ln(2πσ2)− 1

2σ2

n∑i=1

(yi − µ)2 (5.6)

We then maximize (5.6) with respect to both µ and σ2. That is, we solve thefollowing first order conditions:

(A) ∂ lnL(·)∂µ = 1

σ2

∑ni=1(yi − µ) = 0;

(B) ∂ lnL(·)∂σ2 = − n

2σ2 + 12σ4

∑ni=1(yi − µ)2 = 0.

By solving (A), we find that µ = 1n

∑ni=1 yi = yn. Substituting this result into

(b) and solving, gives us σ2 = 1n

∑ni=1(yi − yn)2. �

Note that σ2 = 1n

∑ni=1(yi − µ)2 6= s2, where s2 = 1

n−1

∑ni=1(yi − µ)2 is the

familiar unbiased estimator for σ2. Sothe maximum likelihood estimator σ2 isa biased.

Example 5.2. Some notion of the attraction of maximum likelihood can begained by a numerical example of the same problem just considered, whereyi ∼i.i.d.N(µ, σ2). Choosing µ = 10 and σ = 5 we randomly generate a sampleof size n = 100. In the absence of an explicit choice of a distribution we are leftwith non-parametric approaches such as the histogram to assess probabilities.

Page 70: What Is Econometrics - Rice University

70 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

For the sample generated the resulting histogram given below in Figure 5.1 isskewed and does not particularly appear normal. We then estimate the modelby MLE and plot the true pdf and the estimated pdf with µ = 10.4058 ands = 5.1463

Figure 5.1: Histogram and MLE of Normal Distribution

Obviously, the maximum likelihood estimate of the distribution is much closerto the truth than the histogram. One could adopt more sophisticated nonpara-metric techniques that would result in a smoother estimated distribution thanthe histogram but they could not begin to approach the accuracy of the MLEapproach when we have chosen the correct distribution to estimate.

5.2 Asymptotic Behavior of MLE

5.2.1 Assumptions

For the asymptotic results we will derive in the following sections, we will makefive assumptions. These assumptions are specific to maximum likelihood andare introduced for pedagogic reasons. They are shown to satisfy the higher-levelassumptions given in the Appendix to the previous chapter.

Page 71: What Is Econometrics - Rice University

5.2. ASYMPTOTIC BEHAVIOR OF MLE 71

(M1) The yi’s are iid random variables with density function f(yi|θ) > 0 fori = 1, 2, . . . , n;

(M2) ln f(y|θ) and hence f(y|θ) possess continuous derivatives with respect toθ up to the third order for θ ∈ Θ;

(M3) The range of yi is independent of θ and differentiation with repect to θto second order under the integral of f(y|θ) is possible;

(M4) The parameter vector θ ∈ Θ is globally identified by the density func-tion at the true parameter point θ0, which is interior to Θ, and I =E[∂2 ln f(y|θ0)/∂θ∂θ′] nonsingular.

(M5) ∂3 ln f(yi|θ)/∂θi∂θj∂θk is bounded in absolute value by some positivefunction Hijk(y) for all y and θ ∈Θ, which, in turn, has a finite expectationfor all θ ∈ Θ.

The first assumption is fundamental and the basis of the estimator. If it isnot satisfied then we are misspecifying the model and there is little hope forobtaining correct inferences, at least in finite samples. The second assumptionis a regularity condition that is usually satisfied and easily verified. The firstpart of the third assumption is also easily verified and guaranteed to be satisfiedin models where the dependent variable has smooth and infinite support. Thefourth assumption must be verified, which is easier in some cases than others.The last assumption is crucial and bears a cost and really should be verifiedbefore MLE is undertaken but is usually ignored. Conditions for interchangingthe order of differentiation and integration of f(y|θ) to satisfy the second part ofthe third assumption require boundedness of the integrated norms of ∂f(y|θ)/∂θand ∂2f(y|θ)/∂θ∂θ′ and are discussed in the Appendix to this chapter.

Example 5.3. It is instructive to examine these assumptions with respect tothe simple example just given where we estimated the mean µ and variance σ2

for yi ∼ i.i.d.N(µ, σ2). The first two assumptions are obviously satisfied and thethird is met since the random variable has infinite support. Since the normaldistribution is completely characterized by its mean and variance knowing thelatter means knowing the former and the fourth assumption is satisfied. The

Page 72: What Is Econometrics - Rice University

72 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

four unique third derivatives are:

∂3 ln f(yi|θ)/µ3 = 0

∂3 ln f(yi|θ)/∂(σ2)3 = − 1

(σ2)3+

3

(σ2)4 (yi − µ)2

= − 1

(σ2)3+

3

(σ2)4 ((yi − µ0)2 + 2(µ0 − µ)(yi − µ0) + (µ0 − µ)2)

∂3 ln f(yi|θ)/∂(σ2)2∂µ =1

(σ2)3 (yi − µ)

=1

(σ2)3 (µ0 − µ) +

1

(σ2)3 (yi − µ0)

∂3 ln f(yi|θ)/∂µ2∂(σ2) =1

(σ2)2.

For µ and σ2 bounded (above and below) these are all bounded in absolutevalue by a polynomial in |yi − µ0| , each of which has a finite expectation, soAssumption (M5) is also met. From this we see that Assumption (M5) mayimplicitly place restrictions on the parameter space Θ such as the compactnessassumption introduced in the Appendix. �

5.2.2 Some Preliminaries

In the following results we exploit Assumptions (M2) and (M3) to take differen-tiation inside the integral. Let N be a neighborhood about the true parametervalue θ0. Now, we know that, for any value of the parameter vector θ ∈ N ,∫

f(y|θ)dy = 1 (5.7)

and hence

0 =∂

∂θ

∫f(y|θ)dy

where integration is understood to be the multiple integral and the limits ofintegration are ±∞. Interchanging the order of differentiation and integrationand multiplying and dividing by the nonzero density yields

0 =

∫∂f(y|θ)

∂θdy

=

∫∂f(y|θ)/∂θ

f(y|θ)f(y|θ)dy

=

∫∂ ln f(y|θ)

∂θf(y|θ)dy, (5.8)

for any θ ∈ N and

0 = E

[∂ ln f(y|θ0)

∂θ

]

Page 73: What Is Econometrics - Rice University

5.2. ASYMPTOTIC BEHAVIOR OF MLE 73

for any value of the true parameter vector θ0 ∈ N , where E [·] indicates in-tegration of the expression with respect to the true density f(y|θ0). The firstderivative of the log-likelihood function is known as the score function and theresult given as the mean zero property of scores at the true values of the pa-rameters.

Differentiating (5.8) again and bringing the differentiation inside the integralyields

0 =

∫ [∂2 ln f(y|θ)

∂θ∂θ′f(y|θ) +

∂ ln f(y|θ)

∂θ

∂f(y|θ)

∂θ′

]dy. (5.9)

Since∂f(y|θ)

∂θ′=∂ ln f(y|θ)

∂θ′f(y|θ), (5.10)

then, for θ = θ0, we can rewrite (5.9) as

0 = E

[∂2 ln f(y|θ0)

∂θ∂θ′

]+ E

[∂ ln f(y|θ0)

∂θ

∂ ln f(y|θ0)

∂θ′

](5.11)

and define

ϑ(θ0) = E

[∂ ln f(y|θ0)

∂θ

∂ ln f(y|θ0)

∂θ′

]= −E

[∂ ln f(y|θ0)

∂θ∂θ′

](5.12)

for any value of the true parameter vector θ0 ∈ N . The matrix ϑ(θ0) is calledthe information matrix and the relationship given in (5.12) the informationmatrix equality.

Finally, we note, under Assumption (M1) that

E

[∂2 ln L(θ0|y)

∂θ∂θ′

]= E

[n∑i=1

∂2 ln f(yi|θ0)

∂θ∂θ′

]

=

n∑i=1

E

[∂2 ln f(yi|θ0)

∂θ∂θ′

]= −nϑ(θ0), (5.13)

since the distributions are identical, and

E

[∂ ln L(θ0|y)

∂θ

∂ ln L(θ0|y)

∂θ′

]= E

[n∑i=1

∂ ln f(yi|θ0)

∂θ

n∑i=1

∂ ln f(yi|θ0)

∂θ′

]

=

n∑i=1

E

[∂ ln f(yi|θ0)

∂θ

∂ ln f(yi|θ0)

∂θ′

]= nϑ(θ0), (5.14)

since the covariances between different observations is zero. These last resultswill suggest two alternative estimators for ϑ(θ0).

Page 74: What Is Econometrics - Rice University

74 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

5.2.3 Asymptotic Properties

5.2.3.1 Consistent Root Exists

We will first establish consistency of the maximum likelihood estimator underAssumptions (M1)-(M5) given above. These assumptions and the proof belowappear different from the general approach given in the Appendix of Chapter 4.However, as will be shown in the Appendix to this chapter these assumptions,suitably modified, satisfy those for the more general approach. Moreover, theapproach taken here is particularly instructive in revealing the role of nonlin-earity in estimation.

For notational simplicity, consider the case where p = 1, so θ is scalar. Then,dividing by n, Assumption (M2) allows us to expand in a Taylor’s series, anduse the mean value theorem on the quadratic term to yield

1

n

∂ ln L(θ|y)

∂θ=

1

n

n∑i=1

∂ ln f(yi|θ0)

∂θ

+1

n

n∑i=1

∂2 ln f(yi|θ0)

∂θ2(θ − θ0)

+1

2

1

n

n∑i=1

∂3 ln f(yi|θ∗)∂θ3

(θ − θ0)2 (5.15)

where θ∗ lies between θ and θ0. Since, by Assumption (M5), every term in theleft summation that follows is smaller in absolute value than the correspondingterm in the right summation, we have

1

n

n∑i=1

∂3 ln f(yi|θ∗)∂θ3

=k

n

n∑i=1

H(yi), (5.16)

for some |k| ≤ 1. So,

1

n

∂ ln L(θ|y)

∂θ= aδ2 + bδ + c, (5.17)

where

δ = θ − θ0,

a =k

2

1

n

n∑i=1

H(yi),

b =1

n

n∑i=1

∂2 ln f(yi|θ0)

∂θ2, and

c =1

n

n∑i=1

∂ ln f(yi|θ0)

∂θ. (5.18)

Page 75: What Is Econometrics - Rice University

5.2. ASYMPTOTIC BEHAVIOR OF MLE 75

We can apply a law of large numbers to the averages in (5.18) to obtain plimn→∞

c =

0, and plimn→∞

b = −ϑ(θ0) 6= 0 under the preliminary results that followed from As-

sumptions (M1)-(M4) and |a| = |k| 121n

∑ni=1H(yi) ≤ 1

21n

∑ni=1H(yi) = 1

2 E[H(yi)]+op(1) =Op(1) under Assumption (M5).

Now, since ∂ ln L(θ|y)/∂θ = 0, we have aδ2 + bδ + c = 0. There are twopossibilities. If a = 0, then the FOC are linear in δ whereupon δ = − cb and

δp−→ 0. If a 6= 0 with probability 1, which will occur when the FOC are

nonlinear in δ, then

δ =−b±

√b2 − 4ac

2a. (5.19)

Since ac = op(1), then δp−→ 0 for the plus root while δ

p−→ ϑ(θ0)/α forthe negative root if plim

n→∞a = α 6= 0 exists. If the F.O.C. are nonlinear but

asymptotically linear then ap−→ 0 so ac in the numerator of (5.19) will go to

zero faster than a in the denominator and again δp−→ 0 for the plus root but

will be unbounded for the negative root, which is ruled out if we consider acompact parameter space. Thus there exits at least one consistent solution θwhich satisfies

plimn→∞

(θ − θ0) = 0. (5.20)

and in the asymptotically nonlinear case there is also a possibly inconsistentsolution.

We can use the same style proof to show consistency for the case of θ a

vector. For θ′ = (θ1,θ2), for example, we expand 0 = 1n

∑ni=1

∂ ln f(yi|θ1,θ2)∂θ1

in a

quadratic around θ01 taking θ2 as given. By Assumption M5, the coefficient on

the quadratic term will still be bounded and all the above results still apply forθ1 given any θ2. Since the order of the coefficients was arbitrary it must applyfor all the elements of θ. The complication is the possible proliferation of roots,since θ1 may have multple roots for each θ2, which also may we have multipleroots. In any case, we have demonstrated the existence of a consistent root.

5.2.3.2 Consistent Root Is Global Maximum

In the event of multiple roots, we are left with the problem of selecting betweenthem. By Assumption (M4), the parameter θ is globally identified by the densityfunction. Formally, this means that

f(y|θ) = f(y,θ0), (5.21)

for all y implies that θ = θ0. Now,

E

[f(y,θ)

f(y,θ0)

]=

∫ +∞

−∞

f(y,θ)

f(y,θ0)f(y,θ0) = 1. (5.22)

Page 76: What Is Econometrics - Rice University

76 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

Thus, by Jensen’s Inequality, we have

E

[ln

f(y,θ)

f(y,θ0)

]< ln E

[f(y,θ)

f(y,θ0)

]= 0, (5.23)

unless f(y,θ) = f(y,θ0) for all y, or θ = θ0. Therefore, E [ln f(y,θ)] achieves amaximum if and only if θ = θ0. This well-known result is called the informationinequality.

However, we are seeking to maximize

1

n

n∑i=1

ln f(yi|θ)p−→ E [ln f(y|θ)] , (5.24)

where the convergence in probability is uniform for θ ∈ Θ. If we choose aninconsistent root, the objective function will converge in large samples to asmaller value. The consistent root will converge to a larger value. Accordingly,we choose the root with the larger value of the objective function, which willbe appropriate asymptotically. This choice of the global root has added appealsince it is, in fact, the MLE among the possible alternatives and hence the choicethat makes the realized data most likely to have occurred.

There are complications in finite samples since the value of the likelihoodfunction for alternative roots may cross over as the sample size increases. Thatis, the global maximum in small samples may not be the global maximum inlarger samples. An added problem is to identify all the alternative roots sowe can choose the global maximum. Sometimes a solution is available in asimple consistent estimator which may be used to start the nonlinear MLEoptimization.

5.2.3.3 Asymptotic Normality

We now establish the asymptotic normality of the maximum likelihood estima-tor. We first focus on the scalar case (p = 1). Recall that aδ2 + bδ + c = 0,so

δ = θ − θ0 =−c

aδ + b(5.25)

and

√n(θ − θ0) =

−√nc

a(θ − θ0) + b(5.26)

=−1

a(θ − θ0) + b

√nc.

Now since a = Op(1) and θ − θ0 = op(1), then a(θ − θ0) = op(1) and

a(θ − θ0) + bp−→ −ϑ(θ0). (5.27)

Page 77: What Is Econometrics - Rice University

5.2. ASYMPTOTIC BEHAVIOR OF MLE 77

And by the CLT we have

√nc =

1√n

n∑i=1

∂ ln f(yi|θ0)

∂θ

d−→ N(0, ϑ(θ0)). (5.28)

Substituting these two results in (5.26), we find

√n(θ − θ0)

d−→ N(0, 1/ϑ(θ0)).

A similar approach can be applied when p > 1. Expanding ∂L(θ)∂θ = 0 in a

Taylor’s series about θ0 and dividing by n, we obtain

1

n

∂L(θ)

∂θ=

1

n

∂L(θ0)

∂θ+

1

n

∂2L(θ0)

∂θ∂θ′(θ − θ0)

+1

2

∑j

1

n

∂3L(θ∗)

∂θ∂θ′∂θj(θ − θ0)(θj − θ0

j )

for θ∗ between θ and θ0. Let Aj = 12

∑j

1n∂3L(θ∗)∂θ∂θ′∂θj

, B = 1n∂2L(θ0)∂θ∂θ′ , c =

1n∂L(θ0)∂θ , and δ = θ − θ0, then we have the vector analog to the quadratic

formula

0 = c+Bδ +

p∑j=1

Ajδδj

which can be rearranged to obtain, following multiplication by√n,

√nδ = −

B +

p∑j=1

Ajδj

−1

√nc.

Now B →p −ϑ(θ0) and√nc →d N(0,ϑ(θ0)) and, by Assumption M5, Aj =

Op(1) elementwise and δj = op(1) so the summation is op(1). Thus we have

√n(θ − θ0)

d−→ N(0,ϑ−1(θ0)), (5.29)

5.2.3.4 Cramer-Rao Lower Bound

In addition to being the covariance matrix of the MLE, ϑ−1(θ0) defines a lower

bound for covariance matrices with certain desirable properties. Let θn(y) beany unbiased estimator, then

E θn(y) =

∫ +∞

−∞θn(y)f(y;θ0)dy = θ0, (5.30)

Page 78: What Is Econometrics - Rice University

78 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

for any underlying θ0. Note that the expectation is with respect to the jointdensity and involves integrating with respect to all the yi’s. Differentiating bothsides of this relationship with respect to θ yields

Ip =∂ E θn(y)

∂θ′

=

∫ +∞

−∞θn(y)

∂f(y;θ0)

∂θ′dy

=

∫ +∞

−∞θn(y)

∂ ln f(y;θ0)

∂θ′f(y;θ0)dy

= E

[θn(y)

∂ ln f(y;θ0)

∂θ′

]= E

[(θn(y)− θ0)

∂ ln f(y;θ0)

∂θ′

]. (5.31)

Next, we define

C(θ0) = E[(θn(y)− θ0)(θn(y)− θ0)

](5.32)

as the covariance matrix of θn(y) and note from (5.14) that E[∂ lnL∂θ

∂ lnL∂θ′

]=

nϑ(θ0), whereupon,

Cov

(θn(y)∂ lnL∂θ

)=

(C(θ0) nIpnIp nϑ(θ0)

), (5.33)

which, as a covariance matrix, is positive semidefinite.Now, for any (p× 1) vector a, we have

(a′ −a′ 1nϑ(θ0)−1

) ( C(θ0) nIpnIp nϑ(θ0),

) (a′

−a′ 1nϑ(θ0)−1

)= a′[C(θ0)− 1

nϑ(θ0)−1]a ≥ 0. (5.32)

Thus, any unbiased estimator θn(y) has a covariance matrix that exceeds 1nϑ(θ0)−1

by a positive semidefinite matrix. So if the MLE estimator is unbiased and has1nϑ(θ0)−1 as its’ covariance matrix, then it is efficient within the class of unbi-ased estimators.

It is entirely possible that the MLE will be biased in small samples butwill still have optimality properties in large samples. As we have shown, underthe Assumptions (M1)-(M5),

√n(θ − θ0) →d N(0,ϑ(θ0)−1) and is, moreover,

CUAN. In much the same fashon as above for the class of unbiased estimators,we can demonstrate that, in large samples, any other CUAN estimator willhave limiting distribution with a covariance exceeding ϑ(θ0)−1. Thus MLEis asymptotically efficient and ϑ(θ0)−1 is called the Camer-Rao lower bound,within this class.

Page 79: What Is Econometrics - Rice University

5.3. MAXIMUM LIKELIHOOD INFERENCE 79

Example 5.4. Consider again the maximum likelihood problem posed in Ex-ample 5.1, where yi ∼ i.i.d. N(µ, σ2), for i = 1, 2, . . . , n. Then,

ln f(y|µ, σ2) = −1

2ln(2πσ2)− 1

2σ2(y − µ)2,

with first derivatives

∂ ln f(·)∂µ

=1

σ2(y − µ)

∂ ln f(·)∂σ2

= − 1

2σ2+

1

2σ4(y − µ)2.

For θ′ = (µ,σ2), then

ϑ(θ0) = E

[∂ ln f(y|θ0)

∂θ

∂ ln f(y|θ0)

∂θ′

]=

(1/σ2

0 00 1/(2σ4

0)

)and

ϑ(θ0)−1 =

(σ2

0 00 2σ4

0

).

Thus σ20/n is the lower bound for unbiased estimation of the mean, and is

obtained by the sample average yn. The estimator σ2 is biased but asymptot-ically unbiased with

√n(σ2 − σ2

0

)→d N

(0, 2σ4

0

)and hence attains the bound

asymptotically. �

5.3 Maximum Likelihood Inference

5.3.1 Likelihood Ratio Test

Suppose we wish to test H0 : θ = θ0 against H1 : θ 6= θ0. Then, we define

Lu = maxθ

L(θ|y) = L(θ|y) (5.34)

andLr = L(θ0|y), (5.35)

where Lu is the unrestricted likelihood and Lr is the restricted likelihood. Wethen form the likelihood ratio

λ =LrLu

. (5.36)

Note that the restricted likelihood can be no larger than the unrestricted whichmaximizes the function.

As with estimation, it is more convenient to work with the logs of the like-lihood functions. It will be shown below that, under H0,

LR = −2 lnλ

= −2

[lnLrLu

]= 2[L(θ|y)− L(θ0|y)]

d−→ χ2p, (5.37)

Page 80: What Is Econometrics - Rice University

80 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

where θ is the unrestricted MLE, and θ0 is the restricted MLE. If H1 applies,then LR = Op(n). Large values of this statistic indicate that the restrictionsmake the observed values much less likely than the unrestricted and we preferthe unrestricted and reject the restrictions.

In general, for H0 : r(θ) = 0, and H1 : r(θ) 6= 0, we have

Lu = maxθ

L(θ | y) = L(θ |y), (5.38)

and

Lr = maxθ

L(θ | y) s.t. r(θ) = 0

= L(θ |y). (5.39)

where θ is the restricted maximum likelihood estimator. Under H0,

LR = 2[L(θ|y)− L(θ|y)]d−→ χ2

q, (5.40)

where q is the length of r(·).Note that, in the general case, the likelihood ratio test requires calculation

of both the restricted and the unrestricted MLE.

5.3.2 Wald Test

The asymptotic normality of MLE may be used to obtain a test based only onthe unrestricted estimates.

Now, under H0 : θ = θ0, we have

√n(θ − θ0)

d−→ N(0,ϑ−1(θ0)). (5.41)

Thus, using the results on the asymptotic behavior of quadratic forms from theprevious chapter, we have

W = n(θ − θ0)′ϑ(θ0)(θ − θ0)d−→ χ2

p, (5.42)

which is the Wald test. As we discussed for quadratic tests, in general, underH1 : θ1 6= θ0, we would have W = Op(n).

The statistic W, is generally infeasible since it requires knowledge of ϑ(θ0).In practice, since under Assumptions (M2) and (M5), we can show

1

n

∂2L(θ|y)

∂θ∂θ′=

n∑i=1

1

n

∂2 ln f(θ|y)

∂θ∂θ′p−→ −ϑ(θ0), (5.43)

then we instead use the feasible statistic

W∗ = −n(θ − θ0)′1

n

∂2L(θ|y)

∂θ∂θ′(θ − θ0) (5.44)

= −(θ − θ0)′∂2L(θ|y)

∂θ∂θ′(θ − θ0) −→ χ2

p.

Page 81: What Is Econometrics - Rice University

5.3. MAXIMUM LIKELIHOOD INFERENCE 81

Aside from having the same asymptotic distribution, the Likelihood Ratioand Wald tests are asymptotically equivalent in the sense that

plimn→∞

(LR−W∗) = 0. (5.45)

This is shown by expanding L(θ0|y) in a Taylor’s series about θ. That is,

L(θ0) = L(θ) +∂L(θ)

∂θ(θ0 − θ)

+1

2(θ0 − θ)′

∂2L(θ)

∂θ∂θ′(θ0 − θ)

+1

6

∑i

∑j

∑k

∂3L(θ∗)

∂θi∂θj∂θk(θ0i − θi)(θ0

j − θj)(θ0k − θk). (5.46)

where the third line applies the intermediate value theorem for θ∗ between θ

and θ0. Now ∂L(θ)∂θ = 0, and the third line can be shown to be Op(1/

√n) under

Assumptions (M2) and (M5), whereupon we have

L(θ)− L(θ0) = −1

2(θ0 − θ)′

∂2L(θ)

∂θ∂θ′(θ0 − θ) +Op(1/

√n) (5.47)

andLR = W∗ +Op(1/

√n). (5.48)

This means that, in large samples, the two tests will reject and accept togetherunder the null hypothesis.

In general, we may test H0 : r(θ) = 0 with

W∗ = −r(θ)′

R(θ)

(∂2L(θ)

∂θ∂θ′

)−1

R′(θ)

−1

r(θ)d−→ χ2

p. (5.49)

where R(θ) = ∂r(θ)/∂θ′.

5.3.3 Lagrange Multiplier

Alternatively, but in the same fashion, we can expand L(θ) about θ0 to obtain

L(θ) = L(θ0) +∂L(θ0)

∂θ(θ − θ0)

+1

2(θ − θ0)′

∂2L(θ0)

∂θ∂θ′(θ − θ0) +Op(1/

√n). (5.50)

Likewise, we can also expand 1n∂L(θ)∂θ about θ0, which yields

0 =1

n

∂L(θ)

∂θ=

1

n

∂L(θ0)

∂θ+

1

n

∂2L(θ0)

∂θ∂θ′(θ − θ0) +Op(1/n), (5.51)

Page 82: What Is Econometrics - Rice University

82 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

or

(θ − θ0) = −(∂2L(θ0)

∂θ∂θ′

)−1∂L(θ0)

∂θ+Op(1/n). (5.52)

Substituting (5.52) into (5.50) gives us

L(θ)− L(θ0) = −1

2

∂L(θ0)

∂θ′

(∂2L(θ0)

∂θ∂θ′

)−1∂L(θ0)

∂θ+Op(1/

√n), (5.53)

and LR = LM +Op(1/√n), where

LM = −∂L(θ0)

∂θ′

(∂2L(θ0)

∂θ∂θ′

)−1∂L(θ0)

∂θ, (5.54)

is the Lagrange Multiplier test.Thus, under H0 : θ = θ0,

plimn→∞

(LR− LM) = 0. (5.55)

andLM

d−→ χ2p. (5.56)

Note that the Lagrange Multiplier test only requires the restricted values of theparameters.

In general, we may test H0 : r(θ) = 0 with the analogous form

LM = −∂L(θ)

∂θ′

(∂2L(θ)

∂θ∂θ′

)+∂L(θ)

∂θ

d−→ χ2q, (5.57)

where L(·) is the unrestricted log-likelihood function, and θ is the restrictedMLE. Thus we are testing the unrestricted derivatives of the log-likelihood orscores for mean zero at the restricted estimates. Consequently, the LagrangeMultiplier test is sometimes called the score test. Note that we utilize a gener-alized inverse since the covariance matrix will generally be singular due to therestrictions.

5.3.4 Comparing the Tests

In the scalar case the relationship between the three tests can be seen graphi-cally in Figure 5.2. With the parameter values on the horizontal axis and thelog-likihood values on the vertical, we plot an example log-likelihood function.The maximum occurs at θ with the corresponding log-likelihood at L(θ). Therestricted parameter value, given by the null hypothesis is θ0 with corresponding

log-likelihood L(θ0), which is smaller. The score function ∂L(θ0)∂θ is represented

by the slope of the tangent line at L(θ0).

The Wald test is based on the (squared) difference θ − θ0 with an adjust-

ment made for the variability of θ as measured by the inverse of the estimated

Page 83: What Is Econometrics - Rice University

5.3. MAXIMUM LIKELIHOOD INFERENCE 83

sl ope= ∂L ( θ0 )/∂ θ

L ( θ ) = logL ( θ )

L ( θ )L ( θ 0 )

θθ0

Figure 5.2: Comparison of LR, W, and LM Statistics

information matrix (ignoring the n factor). The likelihood ratio is based on the

difference L(θ)−L(θ0) with no adjustment needed. Obviously, for a particular

value of (θ− θ0)2, the more pronounced the peak of the log-likelihood the moreprecise the estimates and the smaller the estimated variances we divide by inthe Wald statistic, which makes the adjusted statistic larger as would be thecase with the log-likelihood difference. The Lagrange multiplier test is based onthe slope of the log-likelihood function at the restricted value of the parameter,which will be larger for the more pronounced peak case. This (squared) slopewill be adjusted by dividing by second-order curvature, which is inversely relatedto the variance and hence larger for the more peaked case. In any event, larger

values of (θ − θ0)2 will occur with larger values of (∂L(θ0)∂θ )2, and L(θ)− L(θ0).

The above analysis demonstrates that the three tests: likelihood ratio, Wald,and Lagrange multiplier are asymptotically equivalent. In large samples, notonly do they have the same limiting distribution, but they will accept and rejecttogether under H0. This in not the case in finite samples where one can rejectwhen the other does not. This might lead a cynical analyst to use one rather

Page 84: What Is Econometrics - Rice University

84 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

than the other by choosing the one that yields the results (s)he wants to obtain.Making an informed choice based on their finite sample behavior is beyond thescope of this course.

In many cases, however, one of the tests is a much more natural choice thanthe others. Recall that the Wald test only requires the unrestricted estimateswhile the Lagrange multiplier test only requires the restricted estimates. In somecases the unrestricted estimates are much easier to obtain than the restrictedand in other cases the reverse is true. In the first case we might be inclinedto use the Wald test while in the latter we would prefer to use the Lagrangemultiplier.

Another issue is the possible sensitivity of the test results to how the restric-tions are written. For example, θ1 + θ2 = 0 can also be written −θ2/θ1 = 1.The Wald test, in particular, is sensitive to how the restriction is written. Thisis yet another situation where a cynical analyst might be tempted to choosethe “normalization” of the restriction to force the desired result. The Lagrangemultiplier test, as presented above, is also sensitive to the normalization of therestriction but can be modified to avoid this difficulty. The likelihood ratio testhowever will be unimpacted by the choice of how to write the restriction.

5.A Appendix

We will now establish that the conditions for Theorem 4.5, are met under themaximum likelihood Assumptions (M1)-(M5), with one addition. We need toexplicitly add the condition (C1), that the parameter space is compact, whichfor the scalar case and the space being the real line means that the space is closedand bounded. As was demonstrated in the simple example such an assumptionmay be implicit in bounding the third derivatives.

Again, for simplicity, we restrict attention to the scalar parameter case.Define

ψn(θ) =1

nL(θ|y1, y2, . . . , yn) =

1

n

n∑i=1

ln f(yi|θ)

andψ0(θ) = E [ln f(yi|θ)]

then the condition (C4) follows from Assumption (M4) by the arguments givenregarding the consistency of the global maximum given above.

This leaves us with the continuity (C3) and uniform convergence in proba-bility (C2) to establish. In proving these conditions are met, we will utilize thefollowing uniform law of large numbers from Newey and McFadden (N-M):

Lemma 2.4 (Newey and McFadden) If the data zi are i.i.d., a(zi,θ)is continuous at each θ ∈ Θ with probability one, and there is d(z) with‖a(zi,θ)‖ ≤ d(z) for all θ ∈ Θ and E[d(z)] <∞, then E[a(zi,θ)] is continuous

and supθ∈Θ

∥∥n−1∑ni=1 a(zi,θ)− E[a(z,θ)]

∥∥ p−→ 0.

Page 85: What Is Econometrics - Rice University

5.3. MAXIMUM LIKELIHOOD INFERENCE 85

Assumption (M2) allows us to expand in a Taylor’s series to obtain

ψn(θ) =1

n

n∑i=1

ln f(yi|θ0) +1

n

n∑i=1

∂ ln f(yi|θ0)

∂θ′(θ − θ0) +

1

n

n∑i=1

1

2

∂2 ln f(yi|θ0)

∂θ2(θ − θ0)2

+1

n

n∑i=1

1

6

∂3 ln f(yi|θ∗)∂θ3

(θ − θ0)3

where θ∗ lies between θ and θ0. As was shown in the preliminaries of thischapter, expectations of the first two terms being averaged exist under (M2) and(M3). And, by (M5) the third is bounded by a function with finite expectationso its’ expectation exists for all θ∗. Thus taking expectations of ψn(θ) andsubtracting yields

ψn(θ)− ψ0(θ) =1

n

n∑i=1

{ln f(yi|θ0)− E

[ln f(yi|θ0)

]}+

1

n

n∑i=1

{∂ ln f(yi|θ0)

∂θ′− E

[∂ ln f(yi|θ0)

∂θ

]}(θ − θ0)

+1

n

n∑i=1

1

2

{∂2 ln f(yi|θ0)

∂θ2− E

[∂2 ln f(yi|θ0)

∂θ2

]}(θ − θ0)2

+1

n

n∑i=1

1

6

{∂3 ln f(yi|θ∗)

∂θ3− E

[∂3 ln f(yi|θ∗)

∂θ3

]}(θ − θ0)3.

Applying the sup operator to this for θ ∈ Θ yields

supθ∈Θ ‖ψn(θ)− ψ0(θ)‖ ≤ supθ∈Θ

∥∥∥∥∥ 1

n

n∑i=1

{ln f(yi|θ0)− E

[ln f(yi|θ0)

]}∥∥∥∥∥+ supθ∈Θ

∥∥∥∥∥ 1

n

n∑i=1

{∂ ln f(yi|θ0)

∂θ′− E

[∂ ln f(yi|θ0)

∂θ

]}(θ − θ0)

∥∥∥∥∥+

1

2supθ∈Θ

∥∥∥∥∥ 1

n

n∑i=1

{∂2 ln f(yi|θ0)

∂θ2− E

[∂2 ln f(yi|θ0)

∂θ2

]}(θ − θ0)2

∥∥∥∥∥+

1

6supθ∈Θ

∥∥∥∥∥ 1

n

n∑i=1

{∂3 ln f(yi|θ∗)

∂θ3− E

[∂3 ln f(yi|θ∗)

∂θ3

]}(θ − θ0)3

∥∥∥∥∥ .

Now the average in the first line is a function only of yi and converges in prob-ability to zero. Likewise the averages in the second and third line plus the factthat (θ−θ0) is bounded means each line converges in probability to zero. For the

last line, we define a(z, θ) = ∂3 ln f(y,θ)∂θ3 , which by Assumption (M5) is bounded

in absolute value by H(y) with finite expectation. Applying the Lemma to thisresult together with (θ − θ0) being bounded means the last line will converge

in probability to zero. But this means that supθ∈Θ ‖ψn(θ)− ψ0(θ)‖ p−→ 0 andmoreover ψ0(θ) is continuous, which means (C2) and (C3) are satisfied.

Page 86: What Is Econometrics - Rice University

86 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS

A similar approach can be used to verify that Assumptions (M1)-(M5) satisfythe conditions of Theorem 4.6.

Using Lemma 2.4, it is possible to obtain a consistency result for maximumlikelihood that does not require existence of derivatives.

Theorem 5.1. Suppose (i) yi (i = 1, 2, ..., n) are i.i.d. with p.d.f. f(yi|θ), (ii)f(yi|θ) 6= f(yi|θ0) for θ 6= θ0, (iii) θ ∈ Θ, which is compact, (iv) ln f(yi|θ)continuous at each θ ∈ Θ with probability one, and (v) E[supθ∈Θ |ln f(yi|θ)|] <∞ then θ −→p θ

0 �

Proof: We proceed by verifying the conditions of Theorem 4.5. (C1) is as-sured by (iii). (C4) is guaranteed by (ii) and arguments yielding (5.23). Definea(zi, θ) = ln f(yi|θ) and ψn(θ) = 1

n

∑ni=1 ln f(yi|θ), then (iv) and (v) allow us

to apply Lemma 2.4 of N-M to show (C2) and (C3) are satisfied. �We now turn to sufficient conditions for interchanging differentiation and

integration of a function g(y,θ). These conditions are given as Theorem 1.3.2in Amemiya(1985). Suppose (i) ∂

∂θ g(y,θ) continuously in θ ∈ N and y, where

N is an open set, (ii)∫g(y,θ)dy exits, and (iii)

∫ ∥∥ ∂∂θ g(y,θ)

∥∥ dy < M < ∞for θ ∈ N , then ∂

∂θ

∫g(y,θ)dy =

∫∂∂θ g(y,θ)dy for θ ∈ N . This means the

integrated norm of ∂∂θ g(y,θ) is bounded by a constant. Verification of (M3) in

the Chapter requires verification of these conditions for f(y,θ) and is a similarexercise to verifying (M5). Note that these conditions are on f(y,θ) rather thanln f(y,θ).

With these sufficent conditions we can adapt the asymptotic normality re-sults of Theorem 4.6 to the specific case of maximum likelihood and obtain aresult with more primitive assumptions. In the following E[·] indicates integra-tion weighted by the true density f(y,θ0) and Eθ[·] indicates integration withrespect to f(y,θ) with θ allowed to vary as an argument.

Theorem 5.2. Suppose (i) yi i.i.d f(y,θ), (ii) θ −→p θ, θ0 interior Θ, (iii)f(y,θ) twice continuously differentiable, f(y,θ) > 0 in a neighborhood N of θ0,

(iv) supθ∈N∫ ∥∥∥∂f(y,θ)

∂θ

∥∥∥ dy = supθ∈N Eθ

[∥∥∥∂ ln f(y,θ)∂θ

∥∥∥] <∞, supθ∈N∫ ∥∥∥∂2f(y,θ)

∂θ∂θ′

∥∥∥ dy =

supθ∈N Eθ

[∥∥∥∂2 ln f(y,θ)∂θ∂θ′ + ∂ ln f(y,θ)

∂θ∂ ln f(y,θ)

∂θ′

∥∥∥] <∞, (v) ϑ = E[∂ ln f(y,θ0)∂θ

∂ ln f(y,θ0)∂θ′ ]

exists and nonsingular, (vi) E[supθ∈N

∥∥∥∂2 ln f(y,θ)∂θ∂θ′

∥∥∥] <∞, then√n(θ−θ0) −→d

N(0,ϑ−1) �

Proof: We proceed by verifying the conditions of Theorem 4.6. Define ψn(θ) =1n

∑ni=1 ln f(yi|θ) then (N1) and (N2) are assured (ii) and (iii). Now (iv) assures

that the zero mean score and information matrix equality results apply so we

have E[∂ ln f(yi|θ0)

∂θ

]= 0 and E

[∂ ln f(yi|θ0)

∂θ∂ ln f(yi|θ0)

∂θ

]= ϑ = −E

[∂2 ln f(yi|θ0)

∂θ∂θ′

].

Define a(zi, θ) = ∂2 ln f(y,θ)∂θ∂θ′ then by (i), (iii), and (vi), we can apply Lemma 2.4

to find that Ψ(θ) = E[∂2 ln f(y,θ)∂θ∂θ′ ] continuous and supθ∈N

∥∥∥ 1n

∑ni=1

∂2 ln f(y,θ)∂θ∂θ′ −Ψ(θ)

∥∥∥ −→p

0 so (N3) is satisfied. But Ψ = Ψ(θ0) = −ϑ by the information matrix equalityand nonsingular by (v) so (N4) is met. And by (i) we can apply Lindberg-Levy

Page 87: What Is Econometrics - Rice University

5.3. MAXIMUM LIKELIHOOD INFERENCE 87

to find√n∂ψn(θ)

∂θ = 1√n

∑ni=1

∂ ln f(yi|θ0)∂θ −→d N(0,ϑ) and (N5) is satisfied with

V = ϑ. Since the conditions of Theorem 4.6 are met and Ψ−1VΨ−1 = ϑ−1 wehave the final result. �