characterizing variability and comparing patterns from data

Characterizing Variability and Comparing Patterns from Data

“Statistics”

Module 3

CHEE320 - Fall 2001

J. McLellan 2

Outline

• random samples• notion of a statistic• estimating the mean - sample average• assessing the impact of variation on estimates -

sampling distribution• estimating variance - sample variance and standard

deviation• making decisions - comparisons of means, variances

using confidence intervals, hypothesis tests

CHEE320 - Fall 2001

J. McLellan 3

Random Samples

Scenario - » we have an underlying pattern of variability for a process

which we would like to characterize -- the population» we perform a series of experiments on the process in such

a way that the results are independent - outcome of one experiment has no influence on any other experiment

» the underlying distribution in place during each experimental run is identical to that of the population

» when we run each experiment, we are collecting a value from the random variable Xi - which has uncertainty

» Xi represents the “i-th” act of sampling - referred to as a sample random variable

CHEE320 - Fall 2001

J. McLellan 4

Definition - Random Sample

A random sample of size “n” of a population random variable is a collection of random variables X1, … Xn such that

» the Xi’s are independent

» the Xi’s have distributions identical to that of X, i.e.,

Each Xi represents a snapshot of the process. The X i’s are referred to as sample random variables.

What do we do with these sample values?...

)()( xFxF XXi

CHEE320 - Fall 2001

J. McLellan 5

Sample Average

• used to estimate the mean

• given “n” samples, X1, …, Xn, compute

• interpretation - a rule for computing the sample average, involving sampling

• is a random variable• observed value

n

iiX

nX

1

1

X

n

iix

nx

1

1 Lower case is used to denoteobserved values of the samplerandom variables and average.

CHEE320 - Fall 2001

J. McLellan 6

Statistics

• Sample average is an example of a “statistic”

Definition

A statistic is a function of sample random variables that is used to estimate a value of a parameter, and does not depend on any unknown parameters.

– e.g., sample average estimates mean and doesn’t depend on unknown parameters

n

iiX

nX

1

1

CHEE320 - Fall 2001

J. McLellan 7

Sampling Distribution

A statistic is a random variable, with its own probability distribution– distribution arises from probability distribution of

underlying population, via the sample random variables

– distribution of the statistic is called the sampling distribution

– characteristics of the sampling distribution depend on:» the form of the statistic - e.g., linear function of the

sample random variables» the distribution of the underlying population

CHEE320 - Fall 2001

J. McLellan 8

Sampling Distribution for the Sample Average

• determine the mean and variance of the sample average

Mean

nn

nXE

n

XEn

Xn

EXE

n

i

n

ii

n

ii

n

ii

11

11

1}{

1

11}{

because of independence of sample random variables

Value expected on averageof the sample average isthe true mean of the process- sample average is an UNBIASED estimator for the mean.

CHEE320 - Fall 2001

J. McLellan 9

Sampling Distribution for the Sample Average

Variance

nn

n

XVarn

XVarn

Xn

VarXVar

n

ii

n

ii

n

ii

2

2

2

12

12

1

)(11

1)(

CHEE320 - Fall 2001

J. McLellan 10

Aside - Variance

If we have a sum of independent random variables, X and Y, with “a” and “b” constants, then

Var( a X+ b Y) = a2 Var(X) + b2 Var(Y)

CHEE320 - Fall 2001

J. McLellan 11

Variance of Sample Average

Interpretation– variance of sample average is 2 / n

» as n becomes larger, variance of sample average becomes smaller

» as more data is used, estimate becomes more precise» sample average represents a concentration of

information

CHEE320 - Fall 2001

J. McLellan 12

Distribution of the Sample Average

– in preceding slides, no assumption was made about distribution of population (e.g., normal, exponential)

– Central Limit Theorem implies that distribution of sample average approaches a Normal distribution when number of samples becomes large

» even if underlying population is non-Normal» important consequences for comparing values -

hypothesis tests and confidence limits

CHEE320 - Fall 2001

J. McLellan 13

Outline





CHEE320 - Fall 2001

J. McLellan 14

… is estimated using the following statistic:

Observed value:

Mean of the sample variance:

Sample Variance

n

ii XX

ns

1

22 )(1

1

n

ii xx

ns

1

22 )(1

1

22}{ sESample variance is an UNBIASED estimator of variance.

CHEE320 - Fall 2001

J. McLellan 15

Sample Standard Deviation

… is simply the square root of the sample variance

BUT

– sample standard deviation is a biased estimator of population standard deviation

» value on average does not tend to population value

}{sE

CHEE320 - Fall 2001

J. McLellan 16

Outline





CHEE320 - Fall 2001

J. McLellan 17

Confidence Intervals

Consider the sample average

We can standardize this to have zero mean and unit variance:

)/,(~ 2 nNX XX

“is distributed as”“Normally distributed with mean and variance”

nX

ZX

X/

CHEE320 - Fall 2001

J. McLellan 18


Distribution for standard normal:

Start with -

and consider Z -

95.0)96.196.1( ZP

95.0)/96.1/96.1(

95.0)96.1/

96.1(

nXnP

nX

P

XXXX

X

X

CHEE320 - Fall 2001

J. McLellan 19


Rearrange this last statement to obtain:

Interpretation - » limits of interval have uncertainty - if we repeated sequence of

estimating average and computing the limits, the endpoints would change somewhat BUT95% of the time, the interval would contain the true value of the mean

95.0)/96.1/96.1( nXnXP XXX

RANDOMRANDOM NOTrandom

CHEE320 - Fall 2001

J. McLellan 20


– this interval DOES NOT imply that the mean is uncertain

Picture - sequence of intervals associated with repeated experimentation true value of mean

CHEE320 - Fall 2001

J. McLellan 21


General result for mean -

100(1-)% confidence interval given by:

where - » z/2 - “fence” - value for which P(Z> z/2 ) = /2

» value obtained from tables • 95% - value is 1.96 - approximately 2

• 99% - value is 2.57

nzXnzX XXX // 2/2/

CHEE320 - Fall 2001

J. McLellan 22


General Approach» form a quantity with a known distribution that depends on

the parameter of interest

» form a probability statement - choose fences (limits) with a known probability

» re-arrange statement to obtain an interval specifying a range of values for the parameter of interest

nX

ZX

X/

95.0)96.1/

96.1(

n

XP

X

X

95.0)/96.1/96.1( nXnXP XXX

CHEE320 - Fall 2001

J. McLellan 23

Confidence Intervals for Mean

When population variance is “known”, 100(1-)% confidence interval is -

Known variance - » knowledge of variance when process has been operating

steadily for long period of time» on basis of extensive operating experience» “large number of data points”

nzXnzX XXX // 2/2/

CHEE320 - Fall 2001

J. McLellan 24


What if variance is unknown?» Estimate using sample variance s2

Follow previous approach by forming standardized quantity:

» issue - s2 is a statistic itself, and is a random variable» this quantity no longer has a standard Normal distribution

Solution - » what is the probability distribution of this quantity, when

data are Normally distributed?

ns

X

X

X

/

CHEE320 - Fall 2001

J. McLellan 25

Student’s t Distribution

When the data are Normally distributed,

follows a Student’s t distribution with n-1 degrees of freedom

Degrees of freedom - » number of statistically independent pieces of information

used to compute sample variance» recall that in s2, we divide by n-1 where n is the number

of data points

ns

X

X

X

/

CHEE320 - Fall 2001

J. McLellan 26

Student’s t Distribution

… has a shape similar to that of Normal distribution» symmetric» values are available in tables» extra parameter in tables - degrees of freedom

3 degrees of freedom

CHEE320 - Fall 2001

J. McLellan 27


Variance Unknown» estimated using sample variance» 100(1-)% case

is the number of degrees of freedom (n-1), where n is number of data points used to compute sample variance (and average)

» obtained following identical argument used in the known variance case

nstXnstX XXX // 2/,2/,

CHEE320 - Fall 2001

J. McLellan 28

Example #1

Conversion in a chemical reactor using new catalyst preparation

» data collected, average conversion computed using 10 data points is 76.1%

» prior operating history indicates that variance of conversion is 4.41 %2

» determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70%

CHEE320 - Fall 2001

J. McLellan 29

Example #1

• Confidence interval - 95% » upper tail area is 2.5% » standard devn = sqrt(4.41) = 2.1» confidence interval

» conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion

96.1025.02/ zz

4.778.74

10/)1.2)(96.1(1.7610/)1.2)(96.1(1.76

CHEE320 - Fall 2001

J. McLellan 30

Example #2

Conversion in a chemical reactor using new catalyst preparation

» data collected, average conversion computed using 10 data points is 76.1%

» current data set of 10 points used to estimate sample variance, which is 5.3 %2

» determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70%

CHEE320 - Fall 2001

J. McLellan 31

Example #2

• Confidence interval - 95% » variance UNKNOWN - need to use Student’s t distribution

-- degrees of freedom = 10-1 = 9» upper tail area is 2.5% » standard devn = sqrt(5.3) = 2.3» confidence interval

» conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion

262.2025.0,92/, tt

7.775.74

10/)3.2)(262.2(1.7610/)3.2)(262.2(1.76

CHEE320 - Fall 2001

J. McLellan 32

Confidence Intervals for Variance

First, we need to know the sampling distribution of the sample variance:

• when data are Normally distributed, sample variance is the sum of squared Normal random variables

» squaring “folds over” the negative values of the Normal random variable and makes them positive - asymmetry

n

ii XX

ns

1

22 )(1

1

CHEE320 - Fall 2001

J. McLellan 33

Chi-squared distribution

• is the distribution of a squared standard Normal random variable

» Chi-squared random variable with 1 degree of freedom» degrees of freedom = number of independent standard

Normal random variables being squared» e.g.,

• 3 degrees of freedom

21

2 ~ Z

23

23

22

21 ~ ZZZ

3 degrees of freedom

CHEE320 - Fall 2001

J. McLellan 34

Sampling distribution -sample variance

Sample variance» is the sum of n squared Normal random variables BUT

we add the sum of squared deviations from the sample average

» given value of sample average introduces constraint - given Xbar, we only have n-1 independent random variables (the n-th can be computed from the average)

» sample variance contains n-1 independent Normal random variables --> degrees of freedom for Chi-squared distribution is n-1

21

22

1~ nn

s

CHEE320 - Fall 2001

J. McLellan 35

Confidence Intervals - Sample Variance

• Form probability statement

• Re-arrange statement

• 100(1-)% interval is

1))1(

( 22/,12

22

2/1,1 nnsn

P

1)

)1()1((

22/1,1

22

22/,1

2

nn

snsnP

22/1,1

22

22/,1

2 )1()1(

nn

snsn

CHEE320 - Fall 2001

J. McLellan 36

Confidence Limits for Variance

Notes1) the tail areas are equal

» symmetric tail areas

however the interval can be asymmetric » consequence of asymmetry of Chi-squared distribution

2) is the value of the Chi-squared random variable with upper tail area of 1-/2 and n-1 degrees of freedom

22/1,1 n

equal tail areas

CHEE320 - Fall 2001

J. McLellan 37

Variance Confidence Intervals - Example

Temperature controller has been implemented on a polymer reactor -

» variance under previous operation was 4.7 C» under new operation, we have collected 10 data points

and computed a sample variance of 3.2 C» is the variance under the new control operation

significantly better?• i.e., is variance under new operation significantly lower?

CHEE320 - Fall 2001

J. McLellan 38


Use confidence interval for variance» n-1 = 10-1 = 9 degrees of freedom» form 95% confidence interval ( = 0.05)» from tables:

» interval for variance: » conclusion - variance reduction isn’t significant after

background variation in sample variance computation is taken into account

» note that interval isn’t symmetric

0.19

7.2

2025.0,9

2025.01,9

67.1052.1 2

CHEE320 - Fall 2001

J. McLellan 39


Comment – variance is sensitive to degrees of freedom

» need larger number of data points to obtain precise estimate

» e.g., if variance estimate was 3.2 C with 30 degrees of freedom (31 data points), the interval would be:

» cf. previous interval with 10 data points

71.504.2 2

67.1052.1 2

Conclusion still doesn’tchange, however.

characterizing variability and comparing patterns from data

Documents