basic statistics introduction to inferential statistics

77
Basic Statistics Introduction to Inferential Statistics

Upload: felicia-sanders

Post on 28-Dec-2015

293 views

Category:

Documents


9 download

TRANSCRIPT

Page 1: Basic Statistics Introduction to Inferential Statistics

Basic Statistics

Introduction to Inferential Statistics

Page 2: Basic Statistics Introduction to Inferential Statistics

STRUCTURE OF STATISTICS

STATISTICS

DESCRIPTIVE

INFERENTIAL

TABULAR

GRAPHICAL

NUMERICAL

ESTIMATION

TESTS OF HYPOTHESIS

Page 3: Basic Statistics Introduction to Inferential Statistics

Introduction to Inferential Statistics

• Inferential statistics about the population mean are usually used to answer one of two types of questions.– The first question is, What is the average

“something?” This is Estimation.

• “Something” could be hours spend studying by online students, speed driven by teenagers, distance people commute to work or school, or any number of other things.

Page 4: Basic Statistics Introduction to Inferential Statistics

Introduction to Inferential Statistics

• The second type question about the population mean is:– “Am I right or wrong if I guess (hypothesize) the

mean “something” to be {value}? This is Hypothesis Testing.

• Again, “something” could be hours spend studying by online students (10 hours), speed driven by teenagers (too fast*), distance people commute to work or school (12 miles), or any number of other things. *The hypothesized value must be a value not a value

judgment!

Page 5: Basic Statistics Introduction to Inferential Statistics

Inferential Statistics

C on fide nce In te rva ls

E s tim ation

t- te s ts

H ypo th es is T es ting

In fe re n tia l S ta tist ics

Page 6: Basic Statistics Introduction to Inferential Statistics

Relating to the Textbook

• Your textbook treats these two types of questions as distinctly different, with the Hypothesis Testing taking a predominate role.

• I see them as closely linked and in fact, I will show you how to do both things with one technique.

Page 7: Basic Statistics Introduction to Inferential Statistics

REMINDER!!

• Much of what we will cover from here until the end of the course is not in sequence with your book. The material is all there but I will be referring you to many sections of many chapters as we progress. You will need to pay careful attention to the PowerPoint lessons and be able to use your textbook as a reference.

Page 8: Basic Statistics Introduction to Inferential Statistics

Some Definitions for Estimation

• Estimation: Using sample statistics to estimate population parameters.

• Point Estimate: Use of a single number as the estimate for unknown parameter (usually never correct!).

• Interval Estimate: A range of values as the estimate for the unknown parameter.

• Confidence Interval: An interval estimate accompanied by a specific level of probability.

Page 9: Basic Statistics Introduction to Inferential Statistics

An Example of EstimationSuppose a university administrator is interested in determining the average IQ of all professors at her university. It is too costly to test all professors, so she selects a random sample of 20 professors. Each is given an IQ test and the results show a sample mean of 135. Since the test is nationally standardized, she knows that for the population is 15.

How would the administrator estimate the average IQ for ALL university professors?

Page 10: Basic Statistics Introduction to Inferential Statistics

Constructing a Confidence Interval

• The general formula for a confidence interval uses information from the sample and our knowledge of the sampling distribution from the Central Limit Theorem.

• We then construct an interval in which we think the population parameter will be.

Page 11: Basic Statistics Introduction to Inferential Statistics

Confidence Interval Formula

n

σZXCI

In words, the confidence interval is determined by adding and subtracting the bound on the estimate (the z score representing the level of confidence times the standard error) to and from the mean from the sample.

Page 12: Basic Statistics Introduction to Inferential Statistics

The Area Under the Normal Curve and the Sampling Distribution of the Means

X x

96.1 x

96.1

95%

Page 13: Basic Statistics Introduction to Inferential Statistics

The Sampling Distribution of the Means

• From the previous slide we can see that given our knowledge of the sampling distribution of the means, we know that 95% of all that we would obtain from numerous samples will fall with 1.96 standard errors of the unknown .

sX '

x

Page 14: Basic Statistics Introduction to Inferential Statistics

Our University Administrator Let’s return to our university administrator and

see how she will estimate the average IQ [m] of all professors at her university.

She will need to know the

Shape,

Mean, and

Standard Deviation of the

Sampling Distribution!

Page 15: Basic Statistics Introduction to Inferential Statistics

What about the Shape of the Sampling Distribution?

If we had repeated taking hundreds of samples of 20 professors, what does the CLT tell us will be the shape of the distribution of sample means from these samples?

It would be approximately normal or mound-shaped.

Page 16: Basic Statistics Introduction to Inferential Statistics

What about the Mean of the Sampling Distribution?

What does the CLT tell us the mean of the sampling distribution would be?

It would be the same as the population, which is and is unknown.

Page 17: Basic Statistics Introduction to Inferential Statistics

What about the Standard Deviation of

the Sampling Distribution? What does the CLT tell us the standard deviation

(standard error) of the sampling distribution would be?

It would be the same as the population standard deviation, 15, divided by the square root of the sample size, 20.

3.3520

15n

σσ XX

Page 18: Basic Statistics Introduction to Inferential Statistics

We can display this graphically

X

3.3520

15n

σσ XX

3.353.35

Page 19: Basic Statistics Introduction to Inferential Statistics

Using the standard deviation of the sampling distribution (standard error), we can determine a bound on the estimate.

We know that 95% of the observations in a distribution will fall within 1.96 standard deviations of the mean, therefore, if we take 1.96 of the standard errors (1.96 x 3.35 = 6.57), we know the maximum distance that our estimate will miss the population parameter (error) 95% of the time.

Page 20: Basic Statistics Introduction to Inferential Statistics

Using this information, we can determine the points on this graph where the sample

mean would occur 95% of the time:

57.6)35.396.1( 57.6)35.396.1(

X57.6 57.6

95%

6.57 6.57

Page 21: Basic Statistics Introduction to Inferential Statistics

Let’s illustrate the computations--

First, we compute the bound on the error of estimation:

57.620

1596.196.196.1 n

XX

We then subtract and add it to the sample mean:

4.12820

1596.113596.1

nX X

6.14120

1596.113596.1

nX X

Page 22: Basic Statistics Introduction to Inferential Statistics

The Answer!

• Based on our calculations, the way to state the estimate is:

The administrator is 95% confident that the mean IQ for all professors at her university is between 128.4 and 141.6.

Page 23: Basic Statistics Introduction to Inferential Statistics

We can show graphically the concept of the confidence interval.

X57.6 57.6

Since there is a 95% chance that the sample mean will be in this interval, the interval around the sample mean will capture the population mean () 95% of the time.

Page 24: Basic Statistics Introduction to Inferential Statistics

Important Concept

• When we construct a confidence interval, we are not saying that the parameter is in the middle but merely somewhere in that interval!! It is like throwing a net into the sea, we hope to catch the fish but we do not know where the fish is. If we are really hungry, we better throw a big net! (Which statistically is to have a higher degree of confidence).

Page 25: Basic Statistics Introduction to Inferential Statistics

How often will the 95% confidence interval capture ?

X57.6 57.6

Answer: 95% of the time

Page 26: Basic Statistics Introduction to Inferential Statistics

Here, the sample mean is as far left as it will fall 95% of the time. Please note that it still captures (barely) the population mean.

X57.6 57.6

Page 27: Basic Statistics Introduction to Inferential Statistics

Here, the sample mean is as far right as it will fall 95% of the time. Please note that it still captures (barely)

the population mean.

X57.6 57.6

Page 28: Basic Statistics Introduction to Inferential Statistics

Only 5 times in 100 samples will the obtained sample mean be so far away that a 95%

confidence interval will not capture .

X57.6 57.6

X

Page 29: Basic Statistics Introduction to Inferential Statistics

Summary

The sample mean is the point estimate for the population mean.

The standard deviation of the sampling distribution is also called the standard error for the estimate of the mean.

1.96 standard errors provides the 95% bound on the error of estimation.If we add and subtract this bound from the sample mean, we can create a confidence interval.

Finally, we can alter the confidence limits (from 95%) depending on the distance from the mean of the distribution that we choose.

Page 30: Basic Statistics Introduction to Inferential Statistics

Summary in Symbols for Estimating μ

XEstimatePoint

nEstimation ofError

nzEstimation ofError on Bound

Note: z would be 1.96 for 95% Bound, 2.575 for 99% Bound, 1.64 for 90% Bound, etc.

Confidence Interval:

nzXto

nzX

Page 31: Basic Statistics Introduction to Inferential Statistics

X

population

sample

One–Sample Test of Hypothesis Hypothesis Testing on a Population Mean

Page 32: Basic Statistics Introduction to Inferential Statistics

A particular test has a national mean and standard deviation of 100 and 15 respectively. The superintendent of a particular school system wants to know if the average IQ in her school system is different than the national average on this test.

population

Research Situation

Page 33: Basic Statistics Introduction to Inferential Statistics

Definitions Related to Hypothesis Testing

• Null Hypothesis: The hypothesis that we will test statistically. In a single sample problem it is the “guess” about the population mean ().– Written as: Ho: = value.

• Alternative Hypothesis: If the null is not feasible, then the alternative must be.– Written as: Ha: ‘value’, or < ‘value’, or >

‘value’

Page 34: Basic Statistics Introduction to Inferential Statistics

Step by Step: The One-Sample Test of Hypothesis Using the z-test.

1. State Research Question

2. Establish the Hypotheses

3. Establish Level of Significance

4. Collect Data

5. Calculate Statistical Test

6. Interpret the Results

Page 35: Basic Statistics Introduction to Inferential Statistics

Is the average IQ of students in that particular system different from the national average?

population

sample

0

X

Difference?sampling

1. Stating the Research Problem

Page 36: Basic Statistics Introduction to Inferential Statistics

Null Hypothesis

The mean IQ is not equal to 100.Research or Alternative Hypothesis

100:Ha

The mean IQ is 100.

100:Ho

2. Establish the Research Hypothesis

Page 37: Basic Statistics Introduction to Inferential Statistics

Alternative or Research Hypotheses

• The Alternative Hypotheses may take either a non-directional form, = ‘value’.

• The Alternative Hypotheses may be a directional hypothesis, > ‘value’ or < ‘value’.

• The decision to use a directional alternative is based on the research question under investigation.

Page 38: Basic Statistics Introduction to Inferential Statistics

Errors in Decisions

Our Decision

Null is Really True

Null is Really False

Accept Null Hypothesis

Good Decision Bad Decision

Type II Error ()

Reject Null Hypothesis

Bad Decision

Type I Error ()

Good Decision

(Power)

Page 39: Basic Statistics Introduction to Inferential Statistics

3. Establish the Level of Significance

is the probability of rejecting a true null hypothesis and will be equal to the area NOT within the area we would expect to find our sample mean (e.g., if we use 95% under the curve, then is .05).

defines what is called the “Rejection Region” because we will reject the Null if our calculated z statistic is in that region.

Page 40: Basic Statistics Introduction to Inferential Statistics

Graphical Depiction of Rejection Region

Hypothesized

Rejection Region Rejection Region

Page 41: Basic Statistics Introduction to Inferential Statistics

Rejection RegionDirectional Hypotheses

This would represent a directional hypothesis > ‘value’. The total area would be on only one side, e.g., .05, thus the critical value of z would be 1.645 rather than 1.96, giving a greater likelihood of rejecting the Null Hypothesis.

+1.645

Page 42: Basic Statistics Introduction to Inferential Statistics

Rejection RegionDirectional Hypotheses

This would represent a directional hypothesis , ‘value’. The total area would be on only one side, e.g., .05, and again, the critical value of z would be 1.645 rather than 1.96, giving a greater likelihood of rejecting the Null Hypothesis.

-1.645

Page 43: Basic Statistics Introduction to Inferential Statistics

Rejection Region

• It is determined by the Alpha () selected. defines how much of the area under the curve will be in the rejection region.

• The probability of rejecting a TRUE Null Hypothesis is equal to the area in the rejection region, since a sample mean will only be obtained that frequently; if the Null Hypothesis is True.

Page 44: Basic Statistics Introduction to Inferential Statistics

Student IQ12

81

109 88

122

A random sample of 81 students were given the IQ test and their scores were recorded. The mean of the sample was 105.

4. Collecting the Data

Page 45: Basic Statistics Introduction to Inferential Statistics

5. Analyzing the Data: Calculating the Test

Statistic

x

XZ

The statistic we will calculate to determine if the Research Hypothesis is tenable is a modification of our z score.

Notice the new formula uses the data from the sampling distribution and the population mean m divided by the standard error. These are exactly as we discussed in Estimation.

S

XXZ

sX '

Page 46: Basic Statistics Introduction to Inferential Statistics

Calculating the Z Statistic

• From our sample of 81 students, we calculated the sample mean to be 81.

• The population standard deviation is 15.

• Using our Z test formula we can determine where our sample mean would fall, if the population mean m is 100.

Page 47: Basic Statistics Introduction to Inferential Statistics

The Z Statistic

99.2667.1

5

81

15100105

x

XZ

Thus, our sample mean lies 2.99 standard deviations (standard errors) above the population mean of 100.

Page 48: Basic Statistics Introduction to Inferential Statistics

Locating our Mean on the Sampling Distribution of Means

– 1.667 + 1.667

10098.33 101.67 105

95% of all means

Page 49: Basic Statistics Introduction to Inferential Statistics

6. Interpreting the Results

• Since our mean is not in the area where we would expect 95% of all sample means from a distribution where the population mean is 100, we would reject the Null Hypothesis that = 100 and accept the alternative that it is different = 100.

• We would state that we reject the Null Hypothesis at the .05 level of confidence.

Page 50: Basic Statistics Introduction to Inferential Statistics

Problems with Z Test

• The z test requires that we know the population standard deviation, which we usually do not know.

• The z test is designed for large samples (n> 30), again which we don’t always have.

• What is the solution?

Page 51: Basic Statistics Introduction to Inferential Statistics

Solution

Use a t-distribution rather than the z-distribution

n

Xz

0

nsX

t 0

(See page 297-298)

Page 52: Basic Statistics Introduction to Inferential Statistics

Characteristics

• Mean of 0• Mound-Shaped

(Normal)• SD is same as

population except divided by the square root of n

• Mean of 0• Mound-Shaped (Not exactly

Normal)• SD is same as the sample

except divided by square root of n

• Thus, t is more variable than z--depends on degrees of freedom

z-distribution t-distribution

Page 53: Basic Statistics Introduction to Inferential Statistics

Understanding the t-Distribution

N

μXσ

22

Recall the difference between the sample and population variances.

1n

XXS

2

2

Population Sample

Other than using sample numerical indicators rather than population numerical indicators, the only difference is that the sample variance is divided by “n - 1” rather than “N”.

The reason for this when S2 is used to estimate σ2, it tends to underestimate it. A man named William S. Gosset discovered that dividing by n-1 corrected this problem. He also discovered that n-1had greater significance in statistics and it was called the degrees of freedom.

Page 54: Basic Statistics Introduction to Inferential Statistics

William GossetWilliam Gosset was the quality control engineer at Guinness Brewery in London in the early 1900s. For some reason, getting samples of 30 or more of his produces proved difficult. This prompted his search for a small-sample statistic that resulted in his publication of the t-test. He published it under the pen name of “Student” for a couple of reasons. First, moonlighting was frowned upon by Guinness. Second, he wanted to honor his teachers, particularly Karl Pearson.

Page 55: Basic Statistics Introduction to Inferential Statistics

Review Degrees of Freedom

3

?83

3

15

3

15

3

?5

n

XX

Degrees of freedom are the number of observations free to vary, thus the number of observations that contribute to the variance in a sample.

Recall the formula for the sample variance:

In order to compute the variance, we first must compute the sample mean. We find the sample mean by summing the scores and dividing by n.

Consider a problem with n = 3 and the sample mean = 5. Thus,

And if we know the first two numbers are 3 and 8….

The ? must be 4. Thus, only 2 of the 3 scores are free to vary. That is to say there are n – 1 degrees of freedom.

1

2

2

n

XXS

Page 56: Basic Statistics Introduction to Inferential Statistics

This concept of degrees of freedom is used for many different statistics, not just the t-statistic.

The t-distribution is presented in tables, but not complete tables like the normal curve z-scores. This is because it would take a different table for each different degree of freedom. Thus, only commonly used alpha values are tabulated.

Page 57: Basic Statistics Introduction to Inferential Statistics

The current rate for producing 5 amp fuses at Moe’s Electric Company is 250 per hour. A new machine has been purchased and installed that, according to the supplier, will increase the production rate. Is the new machine faster than the old one?

New Research Situation

population

Page 58: Basic Statistics Introduction to Inferential Statistics

AXSample

? population

Is there significant difference in mean score of Dependent variable between Sample and

Population?

Page 59: Basic Statistics Introduction to Inferential Statistics

Step by Step: The One-Sample Test of Hypotheses using the t Test

1. State Research Question

2. Establish the Hypotheses

3. Establish Level of Significance

4. Collect Data

5. Calculate Statistical Test

6. Interpret the Results

Page 60: Basic Statistics Introduction to Inferential Statistics

1. Stating Research Problem:

Is the production rate of new machine more than 250 per hour?

population

sample

0

X

Difference?sampling

Page 61: Basic Statistics Introduction to Inferential Statistics

2. Setting Hypotheses

The mean production rate is greater than 250 per hour.

Null HypothesisThe mean production rate of the new machine is equal to or less than 250 per hour. Notice that the null now contains the other side of the directional alternative!

Research or Alternative Hypothesis

One-tail test

250μ:Ho

250μ:Ha

Page 62: Basic Statistics Introduction to Inferential Statistics

3. Setting your level of significance

0 0: 250H

1 1: 250H 0

Rejecting 0 0: 250H Accepting

Two-tailed test of significance

One-tailed test of significance

1 1: 250H 1 1: 250H

.05if

.05

Possible Conclusion

The sample is from a population with a mean greater than that of the null hypothesis

Page 63: Basic Statistics Introduction to Inferential Statistics

4. Collecting Data

Hours Production12

10

254253

250

A sample of 10 randomly selected hours from last month revealed the mean hourly production on the new machine was 256, with a sample standard deviation of 4.67 per hour.

Page 64: Basic Statistics Introduction to Inferential Statistics

5. Analyzing the Data: Calculating the Test

Statistic

xs

μXt

The t-test formula uses the data from the sampling distribution and the population mean m divided by the standard error, which is now defined by using the sample standard deviation and not the population standard deviation as required in the z test. These are the same as we discussed in Estimation.

sX '

Page 65: Basic Statistics Introduction to Inferential Statistics

Steps in Calculating the t Test

• From our sample of 10 randomly selected hours, we calculated the sample mean to be 256 and the standard deviation to be 4.67.

• Using our t test formula we can determine where our sample mean would fall, if the population mean is 250.

Page 66: Basic Statistics Introduction to Inferential Statistics

Calculating the t statistic

xs

μXt

= 4.06

1.48

6

10

4.67250256

Thus, our obtained mean from our sample is 4.06 standard errors above the hypothesized mean of 250. Would this value be in our rejection region?

YES!

Page 67: Basic Statistics Introduction to Inferential Statistics

Graphical representation of our calculations

.05

1 1: 250H

0 4.06

Rejecting 0 0: 250H Accepting 0 0: 250H

Page 68: Basic Statistics Introduction to Inferential Statistics

.05

0 0: 250H

1 1: 250H

0 4.066

Rejecting 0 0: 250H Accepting 0 0: 250H

Difference in standard errors between sample mean and population mean

250 256

4.066

Sampling Distribution of X

6. Interpreting the Results

Page 69: Basic Statistics Introduction to Inferential Statistics

6. Interpreting the Results

• Since our sample mean is not in the area where we would expect 95% of all sample means from a distribution where the population mean is 250 and is in our rejection region, we would reject the Null Hypothesis that = 250 and accept the alternative that it is different >250.

• We would state that we reject the Null Hypothesis at the .05 level of confidence.

Page 70: Basic Statistics Introduction to Inferential Statistics

Assumption RequiredSince the n is too small to invoke the Central Limit Theorem, we can no longer be sure that the sampling distribution is normal. In fact, we have already learned that it is not, it is a t-distribution. In order for this to happen, we must assume that the original distribution is normally distributed.

The smaller the n, the more important this assumption is. For example, with an n of 4, normality might be important. However, by the time n is 30 or more, the normality assumption is not necessary as the CLT takes over. We can see this by examining a t-table (Table D).

Page 71: Basic Statistics Introduction to Inferential Statistics

Assumption SummaryFor a one-sample z-test, there is only one assumption:

●Sample was obtained randomly

For a one-sample t-test, there are two assumptions:

●Sample was obtained randomly●Original population is normally distributed

Page 72: Basic Statistics Introduction to Inferential Statistics

Comparing Estimation and Hypothesis Testing

• In estimation, we use data from a sample to estimate where we think, with a declared level of confidence, that the population mean to be.

• In hypothesis testing, we use data from a sample to evaluate the acceptability of the hypothesized value for the population mean.

Page 73: Basic Statistics Introduction to Inferential Statistics

Comparing the Two Formulae

n

stXCI

ns

μXt 0

Notice that I have changed the earlier confidence interval formula with the information from the t distribution. Also note the same standard error is used in both.

Page 74: Basic Statistics Introduction to Inferential Statistics

Calculating the t Statistic and the 95% Confidence Interval

xs

Xt

06.4

48.1

6

10

67.4250256

)48.1(833.1256%95

n

stXCI

=

*The value of 1.833 (called the critical value) is found in Table D, page 538, using one-tailed .05 and 9 degrees of freedom. We must remember that our alternative was that was greater than 250, which it was.

Page 75: Basic Statistics Introduction to Inferential Statistics

Comparing the 95% Confidence Interval and the t-test.

95% CI = 256 + 2.7 = 258.7 and = 256 – 2.7 = 253.3.

We are 95% confident that the population mean is between 253.3 and 258.7, based on the data from our random sample.

The calculated t statistic was 4.06, meaning that the obtained mean is 4.06 standard errors above the hypothesized mean of 250, and therefore we rejected the null of 250.

Page 76: Basic Statistics Introduction to Inferential Statistics

A Graphical Comparison

250Hypothesized

Rejection Region

t =4.06

253.3 258.795% confidence Interval

Page 77: Basic Statistics Introduction to Inferential Statistics

Arriving at the Same Conclusion

• Notice that the confidence interval does not contain the hypothesized value [250], thus it is a bad guess.

• General rule: If the confidence interval does not contain the hypothesized value we will reject the null hypothesis, the same as if we had calculated the t statistic.

• Thus, we can conduct a test of hypothesis simply by calculating the confidence interval.