econ 140 lecture 31 univariate populations lecture 3

Lecture 3 1

Econ 140Econ 140

Univariate Populations

Lecture 3

Lecture 3 2

Econ 140Econ 140Today’s Plan

• Univariate statistics - distribution of a single variable

• Making inferences about population parameters from sample statistics - (For future reference: how can we relate the ‘a’ and ‘b’ parameters from last lecture to sample data)

• Dealing with two types of probability

– ‘A priori’ classical probability

– Empirical classical

Lecture 3 3

Econ 140Econ 140A Priori Classical Probability

• Characterized by a finite number of known outcomes

• The expected value of Y can be defined as

k

kkY pYYE

• The expected value will always be the mean value

µY is the population mean

is the sample mean

Y

• The outcome of an experiment is a randomized trial

Lecture 3 4

Econ 140Econ 140Flipping Coins

One coin Two coins Number of HeadsT T 0T H 1H T 1H H 2

• Example: flipping 2 fair coins

– Possible outcomes are:

HH, TT, HT, TH

– we know there are only 4 possible outcomes

– we get discreet outcomes because there are a finite number of possible outcomes

– We can represent known outcomes in a matrix

Lecture 3 5

Econ 140Econ 140Flipping Coins (2)

• The probability of some event A is nm

A )Pr(

– where m is the number of events keeping with event A and n is the total number of possible events. – If A is the number of heads when flipping 2 coins we can represent the probability distribution function like this:

Number of Heads

Probability Distribution Function (PDF)

0 0.25 = 1/41 0.50 = 1/22 0.25 = 1/4

Lecture 3 6

Econ 140Econ 140Flipping Coins (3)

• If we graph the PDF we get

Probability Distribution Function

0.00

0.25

0.50

0.75

1.00

-1 0 1 2 3

Number of Heads

Pro

bab

ility

• The expected value is• = 0(0.25) + 1(0.5) + 2(0.25)

k

kkY pYYE

Lecture 3 7

Econ 140Econ 140Empirical Classical Probability

• Characterized by an infinite number of possible outcomes

• With empirical classical probability, we use sample data to make inferences about underlying population parameters

– Most of the time, we don’t know what the population values are, so we need to use a sample

• Example: GPAs in the Econ 140 population

– We can take a sample of every 5th person in the room

– Assuming that our sample is random (that Econ 140 does not sit in some systematic fashion), we’ll have a representative sample of the population

Lecture 3 8


• Statisticians/economists collect sample data for many other purposes

• CPS is another example: sampling occurs at the household level

• CPS uses weights to correct data for oversampling– Over-sampling would be if we picked 1 in 3 in front of

the room and only 1 in 5 in the back of the room. In that case we would over-sample the front

– There’s a spreadsheet example on the course website

(the weighted mean is our best guess of the population mean, whereas the unweighted mean is the sample mean)

Lecture 3 9


• On the course website you’ll find an Excel spreadsheet that we will use to calculate the following:

– Expected value

– PDF and CDF

– Weights to translate sample data into population estimates

– Examine the difference between the sample (unweighted) mean and the estimated population (weighted) mean:

Weighted mean = sum(EARNWKE*EARNWT)/sum(EARNWT)

• This approximates the population mean estimate

Lecture 3 10

Econ 140Econ 140Empirical Classical Probability(3)

• So how do we construct a PDF for our spreadsheet example?

– Pick sensible earnings bands (ie 10 bands of $100)

– We can pick as many bands as we want - the greater the number of bands, the more accurate the shape of the PDF to the ‘true population’. More bands = more calculation!

Lecture 3 11


• Constructing PDFs:

– Count the number of observations in each band to get an absolute frequency

– Use weights to translate sample frequencies into estimates of the population frequencies

– Calculate relative frequencies for each band by dividing the absolute frequency for the band by the total frequency

Lecture 3 12


– A weighted way to approximate the PDF:

weightsall of avgband within weightsof avg weightsAvg

– When we have k bands, always check:

if the probabilities don’t sum to 1, we’ve made a mistake! 1kpk

Lecture 3 13


• Going back to our expected value…

The expected value of Y will be: k

kkYE pY

– The pk are frequencies and they can be weighted or not

– The Yk are the earnings bands midpoints (50, 150, 250, and so on in the spreadsheet)

• From our spreadsheet example our weighted mean was $316.63 and the unweighted mean was $317.04– Since the sample is so large, there is little difference between the sample (unweighted) mean and the

population (weighted) mean

Lecture 3 14


• We can also calculate the weighted and unweighted expected values:E(Weighted value): $326.85

E(Unweighted value: $327.31

• Why are the expected values different from the means? – We lose some information (bands for the wage data) in calculating the expected values!

• So why would we want to weight the observations?– With a small sample of what we think is a large population, we might not have sampled randomly. We use weights to make the sample more closely

resemble the population.

Lecture 3 15


• The mean is the first moment of distribution of earnings

• We may also want to consider how variable earnings are– we can do this by finding the variance, or standard error

• Calculate the variance– In our example, the unweighted variance is:

78.353,3022 kpYkY

– The weighted variance is 29730.34

– The difference between the two is 623.44

Lecture 3 16


The weighted PDF is pink

It’s tough to see, but the weighting scheme makes the population distribution tighter

Lecture 3 17


• We can use our PDF to answer:

– What is the probability that someone earns between $300 and $400?

• But we can’t use this PDF to answer:

– What is the probability that someone earns between $253 and $316?

• Why?

– The second question can’t be answered using our PDF because $253 and $316 fall somewhere within the earnings bands, not at the endpoints

Lecture 3 18

Econ 140Econ 140Standard Normal Curve

• We need to calculate something other than our PDF, using the sample mean, the sample variance, and an assumption about the shape of the distribution function

• Examine the assumption later

• The standard normal curve (also known as the Z table) will approximate the probability distribution of almost any continuous variable as the number of observations approaches infinity

Lecture 3 19

Econ 140Econ 140Standard Normal Curve (2)

• The standard deviation (measures the distance from the mean) is the square root of the variance:

y

2

2 23 3

68%area under curve

95%

99.7%

Lecture 3 20


• Properties of the standard normal curve

– The curve is centered around

– The curve reaches its highest value at and tails off symmetrically at both ends

– The distribution is fully described by the expected value and the variance

y

y

• You can convert any distribution for which you have estimates of and to a standard normal distributiony 2

Lecture 3 21


• A distribution only needs to be approximately normal for us to convert it to the standardized normal.

• The mass of the distribution must fall in the center, but the shape of the tails can be different

1

or

2

y

Lecture 3 22


• If we want to know the probability that someone earns at most $C, we are asking: ?CYP

)( where

?*)(

)(

)(

YZ

CZP

CY

CY

We can rearrange terms to get:

• Properties for the standard normal variate Z:– It is normally distributed with a mean of zero and a variance of 1, written in shorthand

as Z~N(0,1)

Lecture 3 23


• If we have some variable Y we can assume that Y will be normally distributed, written in shorthand as Y~N(µ,2)• We can use Z to convert Y to a normal distribution

• Look at the Z standardized normal distribution handout– You can calculate the area under the Z curve from the mean of zero to the value of interest– For example: read down the left hand column to 1.6 and along the top row to .4 you’ll find that the area under the curve between Z=0 and Z=1.64 is 0.4495

Lecture 3 24


• Going back to our earlier question: What is the probability that someone earns between $300 and $400 [P(300Y 400)]?

2403.1985.00418.0)52.0104.0(

1985.0)52.00(

0418.0)0104.0(

52.0160

6.316400400

104.0160

6.316300300

16025608

256082

6.316

ZP

ZP

ZP

Z

Z

6.316300 400

P(300Y 400)

Z1 Z2

Lecture 3 25

Econ 140Econ 140What we’ve done

• ‘A priori’ empirical classical probability

– There are a finite number of possible outcomes

– Flipping coins example

• Empirical classical probability

– There are an infinite number of possible outcomes

– Difference between sample and population means

– Difference between sample and population expected values

– Difference in calculating PDF’s of a Univariate population.

• Use of standard normal distribution.

econ 140 lecture 31 univariate populations lecture 3

Documents

sample mean slide

population mean

sample data

sample example

sample unweighted mean

sample statistics

representative sample

mean value y