sociology 5811: lecture 6: probability, probability distributions, normal distributions copyright ©...

48
Sociology 5811: Lecture 6: Probability, Probability Distributions, Normal Distributions Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Upload: beryl-payne

Post on 13-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Sociology 5811:Lecture 6: Probability,

Probability Distributions, Normal Distributions

Copyright © 2005 by Evan Schofer

Do not copy or distribute without permission

Announcements

• Problem set #1 Due Today• Problem Set #2 handed out today; due in a week

• Class Schedule• Done with univariate stats

• Starting probability today

Review: Z-Score

• The Z-score: One way to assess relative placement of cases in a distribution– Can be used for comparisons, like quantiles

• Converts all values of variables to a new scale, with mean of zero, S.D. of 1– Scores typically run from about –3 to +3

• Formula:

Y

i

Y

ii s

YY

s

dZ

)(

Probability Defined

• Definition: “The probability of a particular outcome is the proportion of times that outcome would occur in a long run of repeated observations (Agresti & Finlay 1997, p. 81)”

• Probability of event A defined as p(A):

outcomes ofnumber total

occursA in which outcomes)( Ap

• Example: Coin Flip… probability of “heads”– 1 outcome is “heads”, 2 total possible outcomes – p(“heads”) = 1 / 2 = .5

Probability

• Question: What is the probability of picking a red marble out of a bowl with 2 red and 8 green?

outcomes ofnumber total

occurs redin which outcomes)red( p

There are 2 outcomes that

are red

There are 10 total possible outcomes

p(red) = 2 divided by 10

p(red) = .20

Frequencies and Probability

• Note: The probability of picking a color relates to the frequency of each color in the jar– 8 green marbles, 2 red marbles, 10 total– p(Green) = .8 p(Red) = .2

• For nominal or ordinal variables:

N

xfxp

)()(

• Where, f(x) is the frequency of x in a sample

Frequency Charts and Probability

GSS Data (N=2904)HIGHEST YEAR OF SCHOOL COMPLETED

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

0

Fre

qu

en

cy

1000

800

600

400

200

0

Note that 392 individuals have

16 years of education

Note that the total N is 2904

Probabilities: Nominal/Ordinal

• Height of bars in a frequency chart reflects the probability of choosing cases from our dataset

• If we pulled some case randomly from our data

• What is the Probability of choosing a person from the dataset with 16 years of education?

• Notation: p(Y=16)

• Computed as number of people with 16 years of education (frequency) divided by total N:

135.2904

392)16()16(

N

YfYp

Probability Distributions

• In a frequency plot, the height of bars reflects frequency

• Dividing each value by N converts a chart to a “probability distribution”

• Indicating the probability of choosing an individual with a given value of Y

• Entire plots can be converted to probability distributions

• Shape of the distribution is preserved

• Height of bar represents probabilities rather than frequencies.

Probability Distribution ExampleHIGHEST YEAR OF SCHOOL COMPLETED

HIGHEST YEAR OF SCHOOL COMPLETED

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

0

Pe

rce

nt

40

30

20

10

0

.4

.3

.2

.1

0

As we calculated,

p(Y=16) = .135

Probability: Continuous Variables

• Continuous measures can take on an infinite number of values

• So, it doesn’t make sense to think of the probability of picking any exact value

• 1. Typically, only one case has a given value• The sample may contain a case with 16.238908 years of

education: p(Y=16.238908) = 1/N

• 2. Most exact values have a frequency of 0• Ex: 0 cases with 16.48900242 years of education

• The probability of p(Y=16.48900242) is zero.

Continuous Distributions

• Continuous distributions can be approximated by connecting peaks of a histogram:

Line approximates height of bars for

all values of Y

Continuous Probability Distributions

• For continuous probability distributions:

• Probabilities are not associated with single values• e.g., the probability that Y=16

• Instead, probabilities are associated with a range of values

• e.g., the probability that Y is between 15 and 20

• These are visually represented by the area under a distribution between 15 and 20

curveunder area Total

rangein curveunder Area)range ain ( Yp

Continuous Probability Distributions

p(Red) = Red Area / (Red Area + Blue Area)

Probability Distributions: Notation

• Notation: • Greek alpha () is used to refer to a probability for a

continuous distribution

• Notation: p(15<Y<20) = • = Probability of variable Y between 15 and 20

• You can also choose an open-ended range• p(Y>.4) =

• Or multiple ranges• p(.2<Y<.4 and Y>8) =

• Question: If p(Y>MdnY) = , what is ?

Continuous Probability Distributions Examples

• p(a<Y<b) =

Continuous Probability Distributions Examples

• p(Y<a) =

Continuous Probability Distributions Examples

• p(Y<a, Y>b) =

The “Normal” Distribution

• A particular shape of symmetrical distribution that comes up a lot

• Some biological phenomena have this distribution, such as height, cholesterol levels

• Certain statistical regularities take this form

• It is a “Bell-Shaped” distribution• Note: not all bell-shaped curves are normal distributions.

Example of a Normal Curve

Normal Curve, Mean = .5, SD = .7

3.072.211.36.50-.36-1.21-2.07

1.2

1.0

.8

.6

.4

.2

0.0

Normal Curves

• Normal Curves are a “family” of curves

• They all share the same general curvature and formula

• But, there are infinite variations with different means, standard deviations

• They have different centers (means) and are more or less spread out.

• Examples of different normal curves:• Mean for male height = 70 inches, S.D = 4

• Mean for cholesterol = 182, S.D. = 38.

Formula for Normal Curves

• The shape of a normal probability distribution can be expressed as a function:

2

2/)(

2)(

22

Y

Y YYeYp

• Where:

– e refers to a constant (2.718) refers to a constant (3.142) refers to the mean of the normal curve refers to the standard deviation of the normal curve.

Properties of Normal Curves

• If you choose a mean and s.d., you can plot a corresponding normal curve

• Probability distributions can also be normal• Remember: the proportion of area under the curve in a

given range is equal to the probability of picking someone in that range

• Normal curves are useful because:• The probability of cases falling in a certain range on a

normal curve are well known

• Thus, it is easy to determine p(a<Y<b)!

Properties of Normal Curves

• Normal curves have well-known properties:– 68% of area under the curve (and thus cases) fall

within 1 standard deviation of the mean– 95% of cases fall within 2 standard deviations– 99% of cases fall within 3 standard deviations

• In fact, the a percentage can be easily determined for any number of standard deviations (e.g., s=1.5, s=2.3890)

• Note: This is only true of normal curves• You can’t apply these rules to non-normal distributions.

Properties of Normal Curves

• The predictable link between standard deviations and percent of cases falling near the mean makes normal curves very useful

• 1. You can determine the probability associated with any range of values around the mean

• e.g., there is a .95 probability that a person randomly chosen will fall within 2 SD of mean

• 2. You can convert Z-scores (# standard deviations) into something like a percentile

• If a case falls 3 standard deviations above the mean, it must be in the 99th percentile.

Properties of Normal Curves• Visually:

Question: Why are these

referred to as Z, 2Z, 3Z?

55 60 65 70 75 80 85

Normal Distribution: Example

• Male height is normally distributed– Distribution: mean = 70 inches, S.D. = 4 inches

Question: Is this a frequency distribution

or a probability distribution?

Normal Distribution: Example

• Male height is normally distributed• Distribution: mean = 70 inches, S.D. = 4 inches

• What is the range of heights that encompasses 99% of the population?

• Hint: that’s +/- 3 standard deviations

• Answer: 70 +/- (3)(4) = 70 +/- 12• Range = 58" to 82“

• This is very useful information • Ex: If you are designing a car to comfortably fit most

people.

Normal Distribution: Example

• 99% of cases fall within 3 S.D. of mean

55 60 65 70 75 80 85

A total of 1% fall above 82

inches or below 58 inches

Normal Distributions and Inference

• The link between normal distributions and probabilities allows us to draw conclusions

• Example: Suppose you are a detective

• You suspect that a person is taking an illegal drug• One side-effect of the drug is that it raises cholesterol to

extremely high levels

• Strategy: Take a sample of blood from person• Compare with known distribution for normal people

• Observation: Blood cholesterol is 5 standard deviations above the mean…

Normal Distributions and Inference

• What can you tell by knowing cholesterol is 5 standard deviations above the mean?

• 99% are within 3 standard deviations, 1% not

• A much lower percentage fall 5 S.D’s from the mean

• Based on properties of a normal curve:• Only .000000287 of cases fall 5 or more S.D’s from the

mean

• Conclusion: It is improbable that the person is not taking drugs

• But, in a world of 6 billion people, there are 1,722 such people – you can’t be absolutely certain…

Samples and Populations

• Issue: As social scientists, we wish to describe and understand large sets of people (or organizations or countries)– School achievement of American teenagers– Fertility of individuals in Indonesia– Behavior of organizations in the auto industry

• Problem: It is seldom possible to collect data on all relevant people (or organizations or countries) that we hope to study.

Samples and Populations

• How can we calculate the mean or standard deviation for a population, without data on most individuals? – Without even knowing the total N of the population?

• Are we stuck?

• IDEA: Maybe we can gain some understanding of large groups, even if we have information about only some of the cases within the group– We can examine part of the group and try to make

intelligent guesses about what the entire group is like.

Populations Defined

• Population: The entire set of persons, objects, or events that have at least one common characteristic of interest to a researcher (Knoke, p. 15)

• Populations (and things we’d like to study)– Voting age Americans (their political views)– 6th grade students attending a particular school (their

performance on a math test)– People (their response to a new AIDS drug)– Small companies (their business strategies).

Population: Defined

• People in those populations have one common characteristic, even if they are different in many other ways– Example: Voting age Americans may differ wildly,

but they share the fact that they are voting aged Americans

• Beyond literal definition, a population is the general group that we wish to study and gain insight into.

Sample: Defined

• Sample: A subset of a population– Any subset, chosen in any way– But, manner of choosing makes some samples more

useful than others– Datasets are usually samples of a larger population

• Beyond literal definition, sample often means “the group that we have data on”.

Statistical Inference: Defined

• Our Goal: to describe populations

• However, we only have data on a sample (a subset) of the population

• We hope that studying a sample will give us some insight into the overall population

• Statistical Inference: making statistical generalizations about a population from evidence contained in a sample (Knoke, 77).

Statistical Inference

• When is statistical inference likely to work?

• 1. When a sample is large– If a sample approaches the size of the population, it is

likely be a good reflection of that population

• 2. When a sample is representative of the entire population– As opposed to a sample that is atypical in some way,

and thus not reflective of the larger group.

Random Samples

• One way to get a representative sample is by choosing one randomly

• Definition: A sample chosen from a population such that each observation has an equal chance of being selected (Knoke, p. 77)– Probability of selection:

Np

1)selection(

• Randomness is one strategy to avoid “bias”, the circumstance when a sample is not representative of the larger population.

Biased Samples: Examples

• Biased samples can lead to false conclusions about characteristics of populations

• What are the problems with these samples?– Internet survey asking people the number of CDs they

own (population = all Americans)– Telephone survey conducted during the day of

political opinions (pop = voting age Americans)– Survey of an Intro Psych class on causes of stress and

anxiety (pop = All humans)– Survey of Fortune 500 firms on reasons that firms

succeed (pop = all companies).

Statistical Inference

• Statistical inference involves two tasks:

• 1. Using information from a sample to estimate properties of the population

• 2. Using laws of statistics and information from the sample to determine how close our estimate is likely to be– We can determine whether or not we are confident in

our assessment of a population

Statistical Inference Example• Population: Students in the United States• Sample: Individuals in this classroom• Question: What is the mean number of CD’s

owned by students in the US?– Goal #1: Use information on students in this class to

guess the mean number of CD’s owned by students in the US

– Goal #2: Try to determine how close (or far off) our estimate of the population mean might be. Estimate the quality of the guess.

• Part #2 helps prevent us from drawing inappropriate conclusions from #1

Population and Sample Notation

• Characteristics of populations are called parameters

• Characteristics of a sample are called statistics

• To keep things straight, mathematicians use Greek letters to refer to populations and Roman letters to refer to samples– Mean of sample is: Y-bar– Mean of population is Greek mu: μ– Standard deviation of sample is: s– Standard deviation of a population is lower case

Greek sigma: σ

Population and Sample Notation

• Estimates of a population parameter based on information from a sample is called a “point estimate”– Example of a point estimate: Based on this sample,

I’d guess that the mean # of CDs owned by students in the U.S. is 47.

• Formulas to estimate a population parameter from a sample are “estimators”

Estimation: Notation

• We often wish to estimate population parameters, using information from a sample we have

• We may use a variety of formulas to do this

• Mathematicians identify estimates of population parameters in formulas by placing a caret (“^” ) over the parameter– The caret is called a “hat”– An estimate of is called “sigma-hat”– Symbol: σ̂

Population and Sample Distributions

Y

s

Populations and Samples• Population parameters (μ, σ) are constants

– There is one true value, but it is unknown

• Sample statistics (Y-bar, s) are variables– Up until now we’ve treated them as constants– There are many possible samples, and thus many possible

values for each– In fact, the range of possible values makes up a distribution –

the “sampling distribution”

• This provides insight into the probable location of the population mean– Even if you only have one single sample to look at– This “trick” lets us draw conclusions!!!