monte carlo simulation and resampling

Monte Carlo Simulation and Resampling

Tom Carsey (Instructor)

Jeff Harden (TA)

ICPSR Summer Course

Summer, 2011

— Monte Carlo Simulation and Resampling 1/114

Introductions and Overview

What do I plan for this course?

What do you want from this course?

What are the expectations for everyone involved?

Overview of syllabus


What is the Objective?

The fundamental objective of scientific research is inference.

By that I mean we want to use the data we observe to draw

broader conclusions about a process we care about that

extend beyond our data.

We have a sample of data we can study, but the goal is to

learn about the population from which it came.

Monte Carlo simulations and resampling methods help us

meet these objectives.


Why Use Simulations?

Analysis where observable data is not available

Mimic the repeated sampling framework of classical

frequentist statistics

Provide solutions where analytic solutions are not available or

are intractable

Testing hypothetical processes

Robustness Checks

Mimics an experimental lab


What is a Monte Carlo Simulation?

Computer simulation that generates are large number of

simulated samples of data based on an assumed Data

Generating Process (DGP) that characterizes the population

from which the simulated samples are drawn.

Patterns in those simulated samples are then summarized

and described.

Such patterns can be evaluated in terms of substantive

theory or in terms of the statistical properties of some

estimator.


What is a DGP?

A DGP describes how a values of a variable of interest are

produced in the population.

Most DGP’s of interest include a systematic component and

a stochastic component.

We use statistical analysis to infer characteristics of the DGP

by analyzing observable data sampled from the population.

In applied statistical work, we never know the DGP – if we

did, we wouldn’t need statistical estimates of it.

In Monte Carol simulations, we do know the DGP because

we create it.


What is Resampling?

Like Monte Carlo simulations, resampling methods use a

computer to generate a large number of simulated samples

of data.

Also like Monte Carlo simulations, patterns in these

simulated samples are then summarized, and the results used

to evaluate substantive theory or statistical estimators.

What is different is that the simulated samples are generated

by drawning new samples (with replacement) from the

sample of data you have.

In resampling methods, the researcher DOES NOT know or

control the DGP, but the goal of learning about the DGP

remains the same.


Simulations as Experiments

Experiments rest on control of the research environment.

Control achieved by balanced (often through randomization)

assignment of observations to groups.

Then, all members of all groups are treated equally except

for one factor.

If differences emerge between groups, causality is attributed

to that factor, which is generally called the treatment effect.

Examples in applied research


Simulations as Experiments (2)

Computer simulations follow the same logic.

The computer is the “lab” and the researcher controls how

simulated samples are generated.

One factor is varied across groups of simulated samples and

any differences that appear are attributed to that factor

(again, generally called the treatment effect).


Simulations as Experiments (3)

The power of experiments rests in their control of the

environment and the resulting claims of causality.

Of course, finding out that the treatment causes some

response does not necessarily explain why that response

emerges.

The limitations of experimental work include:

They can quickly become very complex

Results may not generalize well to the (necessarily more

complex) real world outside of the lab.


Populations and Samples

The distinction between the population DGP and the

sample(s) of data we generate or have available to us is

critical.

If the goal is inference (descriptive or causal), then we are

attempting to make statements about the population based

on some sample data.

The fundamental difference between Monte Carlo simulation

and resampling is that we create/control the population

DGP in Monte Carlo simuations, but not in resampling.

Both methods allow us to evaluate theoretical and/or

statistical assumptions.

Both methods offer opportunities to relax or eliminate some

statistical assumptions.


Monte Carlo Simulation of OLS

Ordinary Least Squares (OLS) Regression assumes some

dependent variable (often labeled Y ) is a linear function of

some set of independent variables (often labeled as X ’s), plus

some stochastic (random) component (often labeled as ε).

A set of parameters describes the relationship between the

X ’s and Y . They are often represented as β’s.

The Model might be represented like this:

Yi = β0 + β1X1i + β2X2i + . . .+ εi (1)

Or like this in matrix notation

Y = Xβββ + εεε (2)


TYPE TITLE HERE

0 1 2 3 4 5

01

23

45

6

Component Parts of a Simple Regression

Independent Variable -- X

Dep

ende

nt V

aria

ble

-- Y yi = β0 + β1xi

β1

ε4

} β0

Monte Carlo Simulation of OLS (2)

Next we need to specify more about the stochastic

component of the model.

In OLS, we generally assume that the residual follows a

normal distribution with a mean of zero and a constant

variance. This can be expressed as:

εi ∼ fN(ei | 0, σ2) (3)

where σ2 represents a constant variance.

We have now specified the systematic and the stochastic

components of Y .


Monte Carlo Simulation of OLS (3)

We can rewrite these two components as follows:

Y ∼ fN(yi | µi , σ2) (4)

µi = Xβ (5)

This set-up models the randomness in Y directly, and makes

clear that the conditional mean of Y is captured by Xβ.

The value of this set-up is it can be generalized, like this:

Y ∼ f (y | θ, σ2) (6)

θ = g(X , β) (7)

This makes clear that the functions f and g must be clearly

specified as part of the DGP for Y .

Monte Carlo simulations focus all the nitty gritty of

specifying these functions.


Know Your Assumptions

To simulate a DGP with the goal of evaluating a statistical

estimator, you need to know the assumptions of that

estimator.

For OLS, the key ones are:

Independent variables are fixed in repeated samples

The model’s residuals are independently and identically

distributed (iid)

The residuals are distributed normally

No perfect collinearity among the independent variables

These assumptions must be properly incorporated into the

simulation, but then can be examined one by one through

repeating the simulation.


“Fixed in Repeated Samples – Really?”

In experimental analysis, this assumption is plausible.

Researchers often fix the exact values of the treatment

variable.

In observational analysis (like most of social science), it is

not. X ’s are random variables just like Y .

Thus, there is some DGP out there for the X ’s as well.

The key element of this assumption boils down to assuming

that the X ’s are uncorrelated with the residual (ε) from the

regression model.

In short, the DGP for the X ’s must be uncorrelated with the

DGP for the residuals.

We’ll see how measurement error in X messes this up.


Simulating OLS in R

set.seed(123456) # Set the seed for reproducible results

sims ¡- 500 # Set the number of simulations at the top of the script

alpha.1 ¡- numeric(sims) # Empty vector for storing the simulated intercepts

B.1 ¡- numeric(sims) # Empty vector for storing the simulated slopes

a ¡- .2 # True value for the intercept

b ¡- .5 # True value for the slope

n ¡- 1000 # sample size

X ¡- runif(n, -1, 1) # Create a sample of n observations on the variable X.

# Note that this variable is outside the loop, because X

# should be fixed in repeated samples.

for(i in 1:sims)– # Start the loop

Y ¡- a + b*X + rnorm(n, 0, 1) # The true DGP, with N(0, 1) error

model ¡- lm(Y ˜ X) # Estimate OLS Model

alpha.1[i] ¡- model$coef[1] # Put the estimate for the intercept

# in the vector alpha.1

B.1[i] ¡- model$coef[2] # Put the estimate for X in the vector B.1

˝ # End loop


sim1.pdf

0.10 0.15 0.20 0.25 0.30

02468

12

Simulated Distribution of Intercept

Estimated Values of Parameters

Density

0.3 0.4 0.5 0.6 0.7

02

46

Simulated Distribution of Slope

Estimated Values of Parameters

Density

What Did We Learn?

We see that the estimated intercepts and slopes vary from

one simulated sample to the next.

We see that they tend to be centered very near the true

values we specified in the DGP

We see that their distributions are at least bell-shaped, if not

perfectly normal.

We can learn a lot more, however, if we manipulate features

of the DGP, re-run the simulation, and then observe what, if

anything, changes.

I’ll leave the nuts and bolts to lab, but let’s look at one

example.


Multicollinearity in OLS

What is Multicollinearity?

What does it do to OLS results?

Let’s investigate this with a simulation


Multicollinearity Simulation

Model with 2 independent variables, correlated at .1, .5, .9.

and -.9

Each sample size is 1,000

I draw 1,000 simulated samples at each level of correlation

True values for β0 = 0, β1 = .5, and β2 = .5

Here is what I get


-0.10 -0.05 0.00 0.05 0.100

24

68

12

Density Estimate of B0 by Level of Multicollinearity

Estimated Values of B0

Density

Low CorrMedium CorrHigh Corr

0.3 0.4 0.5 0.6 0.7

02468

12



Density


0.3 0.4 0.5 0.6 0.7

02468

12



Density


0.3 0.4 0.5 0.6 0.7

0.3

0.4

0.5

0.6

0.7

Population Correlation = 0

Estimate of Beta 1

Est

imat

e of

Bet

a 2

0.3 0.4 0.5 0.6 0.7

0.3

0.4

0.5

0.6

0.7

Population Correlation = 0.5

Estimate of Beta 1

Est

imat

e of

Bet

a 2

0.3 0.4 0.5 0.6 0.7

0.3

0.4

0.5

0.6

0.7


Estimate of Beta 1

Est

imat

e of

Bet

a 2

-0.04 -0.02 0.00 0.02

0.3

0.4

0.5

0.6

0.7


Diff in Cor of X1 and Y compared to X2 and Y

Est

imat

e of

Bet

a 1

Randomness and Probability

Making inference requires use of probability and probability

distributions.

We draw a sample, but we want to speak about the larger

population.

We can make those statements if we have a sense of the

probability of drawing the sample that we have.

The key element to drawing a useful sample is randomness.


Randomness

For a sample to be random, it means that every element in

the larger population had a fair or equal chance of being

selected.

If the sample is large enough, it will include mostly “typical”

cases.

It will also include some “odd” cases.

When the sample is large, it will have enough cases that are

odd in different ways to cancel out, and enough typical cases

to outweigh the few odd ones.

Thus, large random samples give us a great deal of statistical

power.


Probability Model

To make inference, we need to develop a probability model

for the data.

We need to generate a belief about the probability that the

population of possible cases would produce a sample of

observations that looks like the one we have.

A probability model for a single variable describes the range

of possible values that variable could have and the

probability, or likelihood, of the various possible values

occurring in random sample.

Note that OLS (logit, probit – really any single equation

model) is really a probability model about a single variable –

Y . It’s just a probability that is conditional on some X ’s.


Probability Model (cont.)

A random variable represents a random draw from the

population. Each data point is a particular realization of a

random variable.

That means that it’s value for the variable under

consideration is just one observed value from the whole

range of possible values that could have been observed.

The range of all possible values for a random variable is

called the distribution of that variable.

The shape of that distribution describes how likely we are to

observe particular values of a random variable if we were to

draw one out by chance. This is the probability distribution

for the random variable in question.


Drawing a Random Sample

Each observation is just one of many we could have selected.

Each entire sample is just one of many we could have

selected.

We could learn a lot about the probability distribution that

describes the population if we could draw lots and lots of

samples.

In observational work, we usually only have one sample in our

hands, which is why we end up making some assumptions

about the probability distribution that describes the larger

population from which our one sample was taken.

But in simulations, we can generate lots and lots of samples.


What is Probability?

Probability involves the study of randomness and uncertainty.

At a fundamental level, a probability is a number between 0

and 1 that describes how likely something is to occur.

Another way to think about it is how frequently an outcome

will result if an action, or trial, is repeated many times. This

so-called “Frequentist” view of probability lies at the heart of

classical statistical theory and the notion of repeated

samples.

The idea of an “expected” outcome is how we test

hypotheses. We compare what we observe to what we

expected given some set of assumptions and we try to decide

how likely it was to observe what we observed.


Example: Flipping a Coin

Suppose I have a coin and I toss it in the air. What is the

probability that it will come up “Heads?”

We can make an assumption about the coin being fair and

assert based on that assumption that the probability is .5.

Or we could flip the coin a lot of times and see how

frequently we get Heads. If the coin is fair, it should be

about half the time.

The first approach defines a probability as a logical

consequence based on assumptions.

The second approach relies on the law of large numbers to

approach the true probability. The law of large numbers says

that increasing the number of observations leads the

observed average to converge toward the true average.

The coin toss example is shown in the next slide.


0 100 200 300 400 500

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Number of Trials

Cum

ulat

ive

Pro

porti

on o

f Suc

cess

Out

com

es

Figure: Cumulative Frequency of the Proportion of Coin Flips that

come up Heads

Randomness Again

Random does NOT mean haphazard or chaotic.

Any one observation or trial might be hard to predict.

However, Random Variables are systematic. They follow

rules and they show stability in the long run.

Again, the way to think about this is that the range of

possible values and how likely each one is to occur is

described by a probability distribution. You either need to

assume what that distribution is, find a way to uncover it, or

find methods that are robust to various distributions.


Properties of Probabilities

All probabilities fall between 0 and 1.

The probability of some event, E, happening, often written

as P(E ), is defined as 0 ≤ P(E ) ≤ 1.

The sum of the probabilities of all possible outcomes must

equal 1.

If E is a set of possible outcomes for an event, then P(E )will equal the sum of the probabilities of all of the events

included in set E .

Finally, for any set of outcomes E , P(E ) + P(not E ) = 1. In

other words, P(E ) + (1− P(E )) = 1.


Conditional Probability

So far, we have been dealing with independent events.

When the probability of one outcome changes depending on

some other factor, then the probability is conditional on

that other factor.

For example, the probability that a citizen might turn out to

vote could depend upon whether that person lives in a place

where the campaign is close and hotly contested.

The conditional probability of event E happening given that

event F has happened, is generally written like this: P(E |F ).


Conditional Probability (cont.)

A conditional probability can be computed like this:

P(E |F ) =P(E ∩ F )P(F )

From this, we can say E is independent of F if and only if

P(E |F ) = P(E ) (which also implies that P(F |E ) = P(F )).

Another way to think about independence is that two events

E and F are independent if:

P(E ∩ F ) = P(E )P(F )

This second expression captures what is called the

“multiplicative” rule regarding conditional probabilities.


Probability Distributions

Describes the range of possible values and probability of

observing those values in a random draw (with replacement).

PDF - Probability Distribution Function (Discrete) or

Probability Density Function (continous)

CDF - Cumulative Distribution/Density Function.

Total Area under a PDF sums to 1.

The CDF records the accumulated probability as is

approaches 1.


PDF of the Normal

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

X

Nor

mal

PD

F o

f X

Figure: PDF of a Random Variable, X, Distributed Normally with

mean=0 and sd=1


CDF of the Normal

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

X

Nor

mal

CD

F o

f X

Figure: CDF of a Random Variable, X, Distributed Normally with

mean=0 and sd=1


PDF’s and CDF’s

Sum of area under a PDF equals 1.

The CDF shows this summation across the range of the

variable.

Note the Mass of the PDF centered around the Mean.

That is because the Expected Value of a random Variable is

its Mean.

OLS is about estimating the probability that Y (the

dependent variable) takes on some value conditional on the

values of the X , or independent, variables. These conditional

probability models are predicting the expected value of the

dependent variable, which is the mean.


Random Variables

Two types of Random Variables — Discrete and Continuous.

These are roughly similar to categorical and continuous.

Discrete Random Variables can only take on integer values

“Heads” or “Tails”

“Strongly Agree”, “Agree”, “Disagree”, “Strongly Agree”

A count of objects or events — How many “Heads” or How

many “Wars”?

Continuous Random Variables can take on any value on the

Real Number line.


Discrete Example

Toss a coin three times and record the number of “Heads.”

(a count)

One possible sequence of how three tosses might come out

is (H,T,T).

The set of all possible outcomes, then, is the set:

(H,H,H),(H,H,T),(H,T,H),(H,T,T),(T,H,H),(T,H,T),(T,T,H),(T,T,T).

If X is the number of Heads, it can be either 0,1,2, or 3. If

the coin we are tossing is fair, then each of the eight possible

outcomes are equally likely.

Thus, P(X = 0) = 18 ,P(X = 1) = 3

8 ,P(X = 2) = 38 , and

P(X = 3) = 18 .


Discrete (cont.)

This describes the discrete probability distribution function of

X .

Note that the individual probabilities of all of the possible

events sum to 1.

The cumulative probability distribution function would be

represented like this:

P(X ≤ 0) = 18 ,P(X ≤ 1) = 4

8 ,P(X ≤ 2) = 78 , and

P(X ≤ 3) = 88 .

You can use the sample() function in R to generate

observation of a discrete random variable.My.Sample ¡- sample(k,size=n,prob=p,replace=TRUE)


Example using sample()

Tossing 3 coins 800 timesset.seed(23212) # Allows results to be reproduced

n ¡- 800 # Sample size I want to draw

k ¡- c(”0 Heads”,”1 Head”,”2 Heads”,”3 Heads”) # possible outcomes

p ¡- c(1,3,3,1)/8 # probability of getting 0, 1, 2, or 3 Heads

My.Sample ¡- sample(k,size=n,prob=p,replace=TRUE)

table(My.Sample)

My.Sample

0 Heads 1 Head 2 Heads 3 Heads

94 312 293 101


Probability Distributions

Common Discrete Distributions include: Bernoulli, Binomial,

Multinomial, Poisson, Negative Binomial

Common Continuous Distributions include: Uniform, Normal,

Chi-2 (or χ2), F, and Student-t.

The last three are sampling distributions that have a degrees

of freedom parameter.

PDF’s of discrete distributions are represented as spike plots

while PDF’s of continuous distributions are represented as

density plots.


Spike Plot of a Binomial

2 4 6 8

0.00

0.05

0.10

0.15

0.20

0.25

Number of Trials

Bin

omia

l PD

F o

f X

●

●

●

●

●

●

●

●

●

Figure: PDF of a Binomial Random Variable with n=8 and p=.5


Continuous PDF’s

Continuous PDF’s don’t really describe the probability of

getting any precise value because the probability of getting

any precise value is effectively 0.

Rather, they are used to describe the probability of getting a

value that falls between an upper and lower bound.

We can consider one tail of the distribution, two tails, or all

but the tails.

Example with the Normal


Areas Under a Normal PDF

P((X ≤≤ −− 1.5))

X=−1.5 X=1.5

1 −− P((X ≤≤ 1.5))

X=1.5 X=−1.5

P((X ≤≤ −− 1.5)) ++ ((1 −− P((X ≤≤ 1.5))))

X=−1.5 X=1.5

P((X ≤≤ −− 1.5)) ++ ((1 −− P((X ≤≤ 1.5))))

X=−1.5 X=1.5

Figure: Shades Areas Under a Normal Distribution


Conclusions

Probability is about uncertainty and randomness.

Randomness does not mean haphazard.

Random variables follow probability distributions that can be

defined with some assumptions or through frequentist

repeated samples.

The expected value of a random variable is the mean of the

distribution from which it was drawn.

Thus, we build probability and conditional probability models

for data based on classical probability theory.


Generating Random Variables

R has many functions that generate random variables that

follow many types of distributions.runif() # Random Uniform distribution

rnormal() # Random Normal distribution

rt() # Random Student’s T distribution

rf() # Random F distribution

rchisq() # Random Chi-Square distribution

rbinom() # Random Binomial distribution

If you type help(Distributions) in R , you will get a

complete listing of those built into R . Many others are

available in other packages.


Generating Random Variables (2)

However, you are not limited to only those distributions

already programmed into R .

You can simulate a random draw from any PDF if you know

the formula for the PDF.

When thinking about a probability distribution, you need toconsider the number of parameters that describe its locationand shape. Key elements to consider include:

Mean

Variance

Range of valid values

Skewness (symmetry of the distribution)

Kurtosis (“peakedness” of middle; “heaviness” of tails)

When selecting a distribution function for generating a

random variable, you have to make sure it is producing the

type of variable you want.


Examples of Random Variables

Suppose I want a vector of 10 values randomly and uniformly

distributed between 0 and 1?¿ Random10 ¡- runif(10)

¿ Random10

[1] 0.51932983 0.03848523 0.29820136 0.15254877 0.26798912

[6] 0.28751082 0.82063644 0.92177149 0.30496555 0.60416280

Now, what if I repeat the command? Will they be the same?¿ Random10 ¡- runif(10)

¿ Random10

[1] 0.19536123 0.87823454 0.49350686 0.67970321 0.03955143

[6] 0.75172914 0.56054510 0.33262119 0.35444109 0.19775575

Why are they different?

Well, they are random (100,000,001 numbers between 0 and

1, inclusive, out to 8 decimal places)

Actually, they are pseudo-random numbers


Pseudo-Random Number Generators

Pseudo-random number generators are actually complex

computer formulas that generate long strings of numbers

that behave as if they were random.

They insert a starting value, called a seed, into the formula,

and then it cycles

So, you can re-create a “random” sequence by staring with

the same seed

R picks a new seed when you start a new session. STATA

picks the same seed at the start of every session.


Setting the Seed

You can set the seed in R using the set.seed() function¿ set.seed(682879)

¿ Random5 ¡- runif(5)

¿ Random5

[1] 0.3506136 0.9191146 0.6758455 0.9105095 0.8402629


¿ Random5

[1] 0.32960079 0.70555853 0.23793750 0.68339820 0.10286161

¿ set.seed(682879)


¿ Random5

[1] 0.3506136 0.9191146 0.6758455 0.9105095 0.8402629


¿ Random5

[1] 0.32960079 0.70555853 0.23793750 0.68339820 0.10286161


Setting the Seed (2)

Very important to know how software sets the seed and

whether it resets it automatically or not.

As noted, R resets its seed to something new every time you

open the software, but STATA resets its seed to the same

value every time the software is opened.

A website for a group called Random.org

(http://www.random.org/) offers truly random numbers

based on atmospheric noise and a discussion of them.


Example of a Random Normal Variable

Using the rnorm() function¿ set.seed(17450)

¿ Normal500 ¡- rnorm(500,mean=5,sd=2)

¿ Normal500

[1] 3.3420737 5.7393314 7.2544833 1.2088551 7.5385345 4.7587521

.

.

[493] 4.8450481 4.4310396 5.8419443 4.3937012 4.5475633 8.5981948

[499] 3.0382390 6.1508388

The first argument sets N. The second sets the Mean, and

the third sets the Standard Deviation

We can check the Mean and SD like this:¿ mean(Normal500)

[1] 5.052378

¿ sd(Normal500)

[1] 1.940603


Conclusions

Simulations as experiments give researchers new leverage we

don’t have in observational analysis.

Interesting models have both systematic and stochastic

components.

Getting the distribution of the stochastic component right is

critical for inference and a major topic of focus for

simulations.

Monte Carlo simulations let you define the population DGP.

Resampling methods do not.


Properties of Statistical Estimators

There are three basic properties of statistical estimators thatresearchers might want to evaluate using Monte Carlosimulations:

Bias

Efficiency

Consistency

Bias is about getting the right answer on average.

Efficiency is about minimizing the variance around an

estimate.

Consistency is about getting closer and closer to the right

answer as your sample size increases.


Unbiased and Inefficient Biased and Inefficient

Biased and Efficient Unbiased and Efficient

Figure: Illustration of Bias and Inefficiency of Parameter Estimates

Properties of Statistical Estimators (2)

In the OLS/GLM context, most tend to equate bias with the

estimates of the β’s and efficiency with the estimates of

their standard errors.

At one level, this makes sense. We want to know if our point

estimates are unbiased, and the Standard Errors measure

their distribution (which we generally want to be small).

This is O.K. in some settings, but this is not exactly right.

The parameters and their standard errors that are computed

using sample data are both estimates of something. Either

could be biased (e.g. systematically wrong) or inefficient

(estimated with less precision that we’d like).

Monte Carlo simulation can be used to evaluate both.


Evaluating Bias

Again, Bias is about systematically getting the wrong answer.

One way to measure it is absolute bias:

abs(True Parameter - Simulated Parameter)

You can repeat the simulation multiple times and compute

the mean of this difference and also show its distribution.

Next, you might vary some feature of the simulation and

show how changing that feature affects absolute bias.

An example.


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

Measurement Error Variance

Abs

olut

e B

ias

Figure: Impact of Measurement Error on Absolute Bias in Simple OLS

Evaluating Bias (2)

In this example, the initial impact of measurement error

appears to be small. It then grows more rapidly, but that

growth rate appears to slow down.

In this example, True β1 = .5, True X ranges from -1 to 1,

but observed X has random measurement error distributed

normally with a mean of zero and variance that grows to 1.

Correct interpretation of the previous figure requires knowing

the scale of all of these bits of information. If True β1equalled 27, then absolute bias that never exceeds .4 is not

bad.

What about the ratio of variance in Observed X due to True

X versus measurement error? In this case, the maximum of

1 results in a variance in Observed X of about 1.3, while the

variance in True X equals about .33.


Evaluating Bias (3)

Simulations for bias must consider the plausible ranges of

values for X and the factor that might cause bias.

One option would be to re-label each axis in the figure to

express the relative bias and proportion of variance in X due

to measurement error.

Other thing to notice in the figure is how the distribution of

absolute bias changes as the level of measurement error

changes. The variance is lowest at very low and very high

levels of measurement error. Why?

At low values of error, the parameter is consistently

estimated near the true value. At high values of error, the

parameter is consistently estimated to be near zero.

This is more clear if I double the maximum variance of the

measurement error from 1 to 2


0.0 0.5 1.0 1.5 2.0

0.0

0.1

0.2

0.3

0.4

0.5

Measurement Error Variance

Abs

olut

e B

ias

Figure: Impact of Measurement Error on Absolute Bias in Simple OLS

Efficient Estimates of Parameters

There might be several ways to estimate a parameter – How

can we evaluate their efficiency?

In a case like multicollinearity, we can see that slopes are less

efficiently estimated as multicollinearity increases by looking

at standard error estimates.

However, it is not accurate to say that the method with the

smallest standard error are the most efficient as a general

rule. We can standard errors that are wrongly estimated to

be small.

OLS assumptions that have efficiency implications DO NOT

always inflate standard error estimates.

Better to look at the distribution of the simulated values of

the parameter in question.


Efficient Estimation of the “Average”

Two very common methods of measuring the “Average” orcentral tendency of a variable are the Mean and the Median.

The Mean is the sum of all values divided by N

The Median is the middle value – the 50th percentile value.

In a single variable case with a symmetric distribution, both

will provide unbiased estimates of the central tendency of the

variable.

Which is more efficient?


A Simulation Study

set.seed(89498)

Sims ¡- 10000

N ¡- 100

Results ¡- matrix(NA,nrow=Sims,ncol=2)

i ¡- 1

for(i in 1:Sims)–

Y ¡- runif(N)

Results[i,1] ¡- mean(Y)

Results[i,2] ¡- median(Y)

˝


0.40 0.45 0.50 0.55 0.60

02

46

810

1214

Measures of Central Tendency from a Uniform(0,1) Variable

Measures of Central Tendency

Density

MeanMedian

-0.4 -0.2 0.0 0.2 0.4

01

23

4

Measures of Central Tendency from A Standard Normal Variable


Density

MeanMedian

Results of Simulation

We clearly see that in either case, the Mean is a more

efficient estimator of central tendency than is the Median.

The look more similar when the underlying distribution from

which the sample is being drawn is normal rather than

uniform, but that’s also a function of scales, so be careful.

Notice we used the distribution of the estimates themselves

– we did not compute a standard error.

What we’ve done regarding bias and efficiency for parameter

estimates could also be applied to estimates of standard

errors (they too can be right or wrong, and they too can be

widely dispersed or tightly clustered in repeated samples).

Epilogue: is the Mean always more efficient?


-0.4 -0.2 0.0 0.2 0.4

01

23

4

Measures of Central Tendency from A Laplace(0,1)


Density

MedianMean

Performance of Standard Errors

A Standard Error is meant to serve as a measure of the

uncertainty of a parameter estimate.

It can thought of as an estimate of the standard deviation of

all possible estimates of a given parameter based on equally

sized samples randomly drawn from the same population.

We generally use Standard Errors for hypothesis testing and

the construction of confidence intervals.

Still, any analytic computation of a standard error relies on

some assumptions – if those assumptions are not met, the

formula will not produce a proper estimate of the standard

error.

If the standard error is wrong, our hypothesis tests and

confidence intervals will be wrong.


What is a Confidence Interval?

Suppose we run a regression and see the following ResultsCoefficient Standard Error

Constant 0.5 0.2

X1 1.3 0.4

X2 2.8 1.6

Assuming a large sample, a normal distribution, etc. wecould compute a 95% confidence interval for the coefficientoperating on X1 like this:

95% CI = 1.3 ± 1.96*0.4

95% CI = 2.084 to 0.516

I can do the same for the coefficient operating on X2:

95% CI = 2.8 ± 1.96*1.6

95% CI = 5.936 to -0.336

How would you interpret these results?


Confidence Intervals (cont.)

The 95% CI has the estimated parameter at its center, and

extends ± 1.96 standard errors if we assume the coefficient

estimates are normally distributed.

How to interpret this?

If I had a lot of samples drawn from the same population,

95% of the CI’s I computed like this would contain the True

value of the parameter.

In any one sample, the CI either does or does not include the

True parameter – you can’t make a probabilistic statement

about it (e.g. you do not have a 95% chance that our CI

includes the true value).

What it does suggest is a plausible range of values for the

parameter.


Performance of Standard Errors (2)

Thus, one way to evaluate the performance of standard

errors in a Monte Carlo simulation is to determine whether

they meet their intended definition: In a large number of

repeated samples, a CI set at XX% should include the true

population parameter XX% of the time.

If it includes the True parameter more than it should, the

confidence interval is too large and you risk accepting a Null

hypothesis when it is False.

If it includes the True parameter less than it should, the

confidence interval is too small and you risk rejecting a Null

hypothesis when it is True.


Coverage Probabilities

In R , what you need to do is compute a confidence interval

at a given level (let’s say XX%) each time through the

simulation (each of the 1,000 iterations).

At each point, check to see if that confidence interval

contains the True population parameter or not (and you set

the Truth, so you know what it is).

Record a 1 when it does and a 0 when it does not.

The percentage of times you score a 1 equals the percentage

of times that your confidence interval included the True

value.

If this percentage is approximately equal to XX%, your

standard error estimates are accurate.


What About Type II Error?

Coverage probabilities describe the proportion of estimated

confidence intervals that contain the true population

parameter. An accurate 95% CI corresponds to a 5%

probability of Type I error – rejecting a Null hypothesis that

is True.

What about Type II error – the failure to reject a Null

hypothesis that is false?

For Type I error, there is only one true parameter to compare

to the CI that is computed.

For Type II error, there are an infinite number of False Null

hypotheses.

Pick a plausible one (say, one exactly 1.96 standard errors

away from the True parameter), then compute the proportion

of times your simulated CI includes that plausible False Null.


Choosing between Bias or Inefficiency?

Which should I worry about more, bias or inefficiency?

Classical Frequentists, Shrinkage Models, Bayesians

In any given sample, your parameter estimates might deviate

from the Truth because they are biased or because there is

variance in their estimation.

One way to approach this is to adopt a strategy that

considers both factors.

Mean Squared Error.


Mean Squared Error

Mean Squared Error (MSE) is exactly what it sounds like –

you compute a series of errors or differences, you square

each of those differences, and you compute the mean.

This is commonly reported for OLS models as the MSE of

the regression by computing the MSE of the model residuals.

But this can be applied to anything, including parameter

estimates.

In a Monte Carlo simulation, I can estimate lots of slope

coefficients. Each time, I can compute the difference

between the estimated value and the True value and then

square that difference.

The mean of those squared differences is the MSE


Mean Squared Error (2)

If the MSE = 0, then the estimator always perfectly recovers

the population parameter. Of course, that is not realistic.

If the estimator is unbiased, then the observed squared errors

capture only sampling variance – our uncertainty about the

parameter estimate.

If the estimator is biased, then the observed squared errors

capture both this bias and sampling variance.

Specifically, the MSE of θ = Var(θ) + Bias(θ, θ)2

So MSE is a method of comparison that considers both Bias

and Inefficiency in evaluating performance, where smaller

MSE is better.


Limitations of MSE

It is a loss function that considers both Bias and Inefficiency,

but just one specific loss function – a quadratic one.

The implied weighting of Bias and Inefficiency might not be

the ratio you desire.

MSE is sensitive to outliers. Means are more sensitive to

outliers, and squaring differences also emphasizes large

differences.

Alternatives include using the mean of absolute errors rather

than squared errors, or using methods that rely on medians

rather than means.

You will see examples in Lab.


Consistency

Consistency is about an estimator converging toward the

true value as sample size increases.

This assumption gets scant attention in OLS, but is

fundamental to MLE, where the small sample properties are

unknown.

This raises a more general concern with the finite sample

properties of estimators compared to their asymptotic

properties. (e.g. Beck and Katz, 1995).

Simulations can be extremely valuable in revealing finite

sample properties.

This is the same as saying that an easy factor to vary in a

Monte Carlo simulation is the size of each simulated sample

that you draw.


Other Performance Evaluations

You can evaluate the performance of models are all sorts ofother factors. These might include:

Explained Variance

Within Sample Predictive accuracy

Out of Model Forecasting

You can add a parsimony discount factor (or use things like

AIC or BIC)

The burden is on the researcher to identify a characteristic

that is appropriate and a way to measure performance on

that characteristic.

The trick is to make sure your simulation is doing what you

think it is doing.


Simulation Error

Simulation error can emerge from a number of places:

The most common is operator error – you make a mistake in

your program OR in your logic.

You stumble across an oddity in the pseudo random number

generator – I generally run simulations several times starting

from different seeds to guard against this.

Simulations themselves are probabilistic. You randomly draw

some finite sample of data, and you randomly draw some

finite number of those samples. Larger N at either stage can

have implications for your study, though some might limit

the idea of simulation error just to the number of samples

you draw.


Other Reasons to do Simulations

Evaluate Distributional Assumptions of Estimators

Evaluating the range of DGP’s that might produce a variable.

Evaluate the behavior of a statistic that has no or weak

support from analytic theory.

Robustness of sample estimates to different distributional

assumptions.


Distributional Assumptions

Monte Carlo simulations are well suited to evaluating

distributional assumptions of models.

Since you control the DGP, you can vary the distributional

assumption and observe if/how the results change.

The important question is often one of magnitude.

Examples


Normal Distribution in OLS

OLS assumes the residuals of the model are drawn from a

normal distribution.

What if the distribution has high Kurtosis? You can look

different distributions like the Laplace, or you can vary it

using the Student t at different degrees of freedom or the

Pearson Type VII.

What if it the distribution is skewed? You can draw a vector

of size N from a Chi-2 Distribution, then standardize the

values of that vector. The result will be a vector of random

observations with a mean of zero, a variance of 1, but with a

positive skew proportional to the degrees of freedom in the

Chi-2 distribution.


-2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Simulated Standardized Chi-Square Distributions

Density

DF=1DF=2DF=5DF=20

Distributional Assumptions

Sometimes we have multiple estimators we could use to

estimate a model that might vary in the distributional

assumption. How can we tell which one to use?

An example is Jeff’s work on using OLS or Median

Regression (MR) to estimate a linear model.

OLS estimates the conditional mean of Y and assumes the

residuals are drawn from a normal distribution.

MR estimates the conditional median of Y and assumes the

residuals are drawn from a Laplace distribution.

We’ll save it for Lab.


The Beta Distribution

The Beta distribution is particularly useful if you want to

explore a range of distributions over the 0-1 space.

The Beta distribution is governed by two parameters, oftencalled a and b or α and β.

A Beta(α = 1, β = 1) is the uniform(0,1) distribution

A Beta(α < 1, β < 1) is U-shaped

A Beta(α < 1, β ≥ 1) is strictly decreasing

A Beta(α > 1, β > 1) is unimodal

A Beta where both α and β are positive, but with α < β will

have positive skew; α > β will have negative skew.

This makes it extremely flexible in exploring how estimators

behave over different distributional shapes.


Figure: Examples of different Beta distributions

Substantive Example using Beta

Mooney (1997, p. 72-77)

Lijphart and Crepaz (1991) score the United States as a

-1.341 on a standardized scale of corporatism.

According to Ligphart and Crepaz (1991, p. 235),

corporatism, “. . . refers to an interest group system in

which groups are organized into national, specialized,

hierarchical and monopolistic peak organizations.”

Since this is a standardized score, it should have a mean of

zero and a standard deviation of 1.

The question is: is the level of corporatism in the U.S.

significantly lower than average?


Example (2)

There are two problems:

We don’t have a good theory about the proper probability

distribution for the DGP.

Even if we did, they only measured 18 countries.

Thus, an analytic approach is unwise because they depend on

strong theory and/or large samples (e.g. asymptotic

properties).

Mooney shows that we can use a simulation to get a sense

of how likely a score of -1.341 is and what sorts of

probability distributions for the DGP are likely or unlikely to

produce such scores.


Example (3)

Mooney’s simulation:

Defines a range of Beta distributions

Draws a very large sample from that distribution

Standardizes the sample (thus, mean of zero like the original

scale)

Records attributes of the sample (level of α, β, kurtosis, and

skewness)

Computes the proportion of observations that fall below

-1.341

Treats that proportion as the Probability of Type 1 error (e.g.

level of statistical significance)


Example (4)

I ran this simulation ranging both α and β from 1 through 30

NOTE: a Beta(1,1) is a uniform distribution, a Beta(30,30)

is effectively a normal distribution. Those in between have

various levels of skewness and kurtosis.

I drew samples of 10,000 from each distribution.

I plotted the resulting levels Type 1 error as a function of

these attributes of the various Beta distributions.


0 5 10 15 20 25 30

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Mooney Replication, Slide 1

Level of A

Pro

babi

lity

of T

ype

1 E

rror

0 5 10 15 20 25 30

0.00

0.02

0.04

0.06

0.08

0.10

0.12


Level of B

Pro

babi

lity

of T

ype

1 E

rror


0 5 10 15 20 25 30

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0 5

1015

2025

30

Level of A

Leve

l of B

Pro

babi

lity

of T

ype

1 E

rror

2 3 4 5 6 7 8

0.00

0.02

0.04

0.06

0.08

0.10

0.12


Level of Kurtosis

Pro

babi

lity

of T

ype

1 E

rror

-2 -1 0 1 2

0.00

0.02

0.04

0.06

0.08

0.10

0.12


Level of Skewness

Pro

babi

lity

of T

ype

1 E

rror


1 2 3 4 5 6 7 8 9

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

-2-1

0 1

2

Level of Kurtosis

Leve

l of S

kew

ness

Pro

babi

lity

of T

ype

1 E

rror

What Did We Learn?

Under a wide range of distributions, a score of -1.341 or

occurs more than 5% of the time in large samples.

When -1.341 is rare is when α is relatively low, but is not

responsive to β.

More specifically, this is most likely when there is a positive

skew to the DGP.

This makes sense since a variable with a positive skew has a

short tail on the left and long tail on the right. If the mean is

0, then a short tail on the negative side makes -1.341

relatively rare.

Is the U.S. significantly below average? Only if there is a

strong positive skew to the DGP


Statistics with No Analytic Support

Some statistics have weak or no theoretical/analytic supportat all, and many other have weak support in small samples.Mooney (1997) notes a few:

The ratio of two correlated regression coefficients (Bartels

1993)

Jackman’s (1994) estimator legislative vote to seat bias.

Difference between two medians.

You know enough now to imagine the approach:

Simulate data with plausible characteristics (e.g. define the

systematic and stochastic components of the DGP)

Compute the statistic in question

Examine the simulated distribution of the statistic

Alter one feature at a time in your DGP, repeat the

simulation, and observe any changes in the pattern describing

your statistic.


Your Results and Your Data

Another great use of simulations centers on evaluating the

robustness of your findings from the analysis of your sample

of data.

In this sort of study, your sample data and initial analysisprovide the information you need to define the populationDGP for the simulation study.

Use the actual values of the X ’s in your data, or generate X ’s

that look like your observed X ’s in your sample.

Use your OLS estimates of the β’s as the values of your

population parameters.

Simulate the stochastic component of Y based on the

observed residuals of your model.

Use all of this to generate simulated samples of Y .

Your simulation then re-runs your analysis using your X ’s and

simulated Y ’s.


Your Results and Your Data (2)

You can evaluate the simulation in two basic ways:

Does your simulations recover the “True” parameters?

How similar are your simulated Y ’ to your actual observed

Y ’s.

Of course, the real power comes when you begin to

manipulate features of your simulation and then re-evaluate

the performance of your statistical analysis as noted above.

Features you might vary include: N, attributes of X,

attributes of the stochastic component of the model, etc.

This allows you to determine how sensitive your original

results are to different assumptions or different features in

the sample of data you have.


Other Kinds of Data

Everything thus far has involved continuous variables andcontinuous probability distributions. Of course, there are lotsof variables (and associated probability distributions) that arenot continuous in the population DGP or are at least notobserved as continuous. Common ones include:

Dichotomous variables

Ordered categorical variables

Unordered categorical variables

Count variables

And other kinds of data structures, including:

Clustered/Multi-level data.

Panel and Time Series Cross Section (TSCS) data.

We’ll let Jeff do that in Lab! (except . . .)


Time Series Cross Section Data

Data that has observations for the same set of multiple units

across multiple time periods (e.g. 50 states over 30 years).

Tremendous debate on the proper way to analyze such data.

Political Analysis (2007: 15(2) Special Issue

Political Analysis (2011:19(2)) Symposium on Fixed-Effects

Vector Decomposition

Simple Question: Does a lagged value of Y capture unit

fixed effects?


TSCS (2)

The TSCS Model looks like this:

Yit = Xitβ + eit (8)

With a residual like this:

eit = µi + αt + εit (9)

Unit effects capture the history of each unit.

They may be viewed as fixed or random (too much to sort

out now)

But, does the previous value of Y also capture that history?


A Simulation Study

Simulated Yit for 50 units over 50 years.

I set Yit as a function of its own past value (parameter =

0.5).

I varied the degree of unit effect by adding a random unit

effect drawn from a normal distribution with a zero mean

and a variance that ranged from 0 to 4 in increments of 0.5.

I drew 1,000 samples at each level of unit effect.

I estimated two models: Y regressed just on its lagged value

and Y regressed on its lagged value plus a full set of unit

fixed effects.

I plot the simulated parameters operating on the lagged

value of Y for both models at each level of unit effect.


0.3 0.5 0.7 0.9

05

1015

20

Unit Var = 0.0

Estimated B1

Density

No Unit EffectsUnit Effects

0.3 0.5 0.7 0.9

05

1015

20

Unit Var = 0.5

Estimated B1

Density

0.3 0.5 0.7 0.9

05

1525

Unit Var = 1.0

Estimated B1

Density

0.3 0.5 0.7 0.9

020

40

Unit Var = 1.5

Estimated B1

Density

0.3 0.5 0.7 0.9

020

4060

Unit Var = 2.0

Estimated B1

Density

0.3 0.5 0.7 0.9

020

4060

80

Unit Var = 2.5

Estimated B1

Density

0.3 0.5 0.7 0.9

050

100

150

Unit Var = 3

Estimated B1

Density

0.3 0.5 0.7 0.9

050

100

150

Unit Var = 3.5

Estimated B1

Density

0.3 0.5 0.7 0.9

050

150

250

Unit Var = 4.0

Estimated B1

Density

Results of TSCS Simulation

Failure to account for fixed effects when they are present

results in positive bias in the estimate of the coefficient

operating on the lagged value of Y .

Not shown, but it is clear that controlling for fixed effects

when they are present significantly improves the fit of the

model. In other words, just including the lagged value of Y is

NOT sufficient to capture unit effects.

Some evidence that the coefficient operating on the lagged

value of Y is biased slightly downward when a full set of

fixed effects are included.

Variance in the parameter is larger when fixed effects are

included (likely due to multicollinearity)


Wrapping Up Monte Carlos

Monte Carlo simulations as experiments.

Vast array of applications for substantive and methodological

research.

Great teaching tool.

Can get very complex very quickly.

Be careful – make sure your simulation is doing what you

think it is doing.


monte carlo simulation and resampling

Documents