characterizing variability and comparing patterns from data
DESCRIPTION
Characterizing Variability and Comparing Patterns from Data. “Statistics” Module 3. Outline. random samples notion of a statistic estimating the mean - sample average assessing the impact of variation on estimates - sampling distribution - PowerPoint PPT PresentationTRANSCRIPT
Characterizing Variability and Comparing Patterns from Data
“Statistics”
Module 3
CHEE320 - Fall 2001
J. McLellan 2
Outline
• random samples• notion of a statistic• estimating the mean - sample average• assessing the impact of variation on estimates -
sampling distribution• estimating variance - sample variance and standard
deviation• making decisions - comparisons of means, variances
using confidence intervals, hypothesis tests
CHEE320 - Fall 2001
J. McLellan 3
Random Samples
Scenario - » we have an underlying pattern of variability for a process
which we would like to characterize -- the population» we perform a series of experiments on the process in such
a way that the results are independent - outcome of one experiment has no influence on any other experiment
» the underlying distribution in place during each experimental run is identical to that of the population
» when we run each experiment, we are collecting a value from the random variable Xi - which has uncertainty
» Xi represents the “i-th” act of sampling - referred to as a sample random variable
CHEE320 - Fall 2001
J. McLellan 4
Definition - Random Sample
A random sample of size “n” of a population random variable is a collection of random variables X1, … Xn such that
» the Xi’s are independent
» the Xi’s have distributions identical to that of X, i.e.,
Each Xi represents a snapshot of the process. The X i’s are referred to as sample random variables.
What do we do with these sample values?...
)()( xFxF XXi
CHEE320 - Fall 2001
J. McLellan 5
Sample Average
• used to estimate the mean
• given “n” samples, X1, …, Xn, compute
• interpretation - a rule for computing the sample average, involving sampling
• is a random variable• observed value
n
iiX
nX
1
1
X
n
iix
nx
1
1 Lower case is used to denoteobserved values of the samplerandom variables and average.
CHEE320 - Fall 2001
J. McLellan 6
Statistics
• Sample average is an example of a “statistic”
Definition
A statistic is a function of sample random variables that is used to estimate a value of a parameter, and does not depend on any unknown parameters.
– e.g., sample average estimates mean and doesn’t depend on unknown parameters
n
iiX
nX
1
1
CHEE320 - Fall 2001
J. McLellan 7
Sampling Distribution
A statistic is a random variable, with its own probability distribution– distribution arises from probability distribution of
underlying population, via the sample random variables
– distribution of the statistic is called the sampling distribution
– characteristics of the sampling distribution depend on:» the form of the statistic - e.g., linear function of the
sample random variables» the distribution of the underlying population
CHEE320 - Fall 2001
J. McLellan 8
Sampling Distribution for the Sample Average
• determine the mean and variance of the sample average
Mean
nn
nXE
n
XEn
Xn
EXE
n
i
n
ii
n
ii
n
ii
11
11
1}{
1
11}{
because of independence of sample random variables
Value expected on averageof the sample average isthe true mean of the process- sample average is an UNBIASED estimator for the mean.
CHEE320 - Fall 2001
J. McLellan 9
Sampling Distribution for the Sample Average
Variance
nn
n
XVarn
XVarn
Xn
VarXVar
n
ii
n
ii
n
ii
2
2
2
12
12
1
)(11
1)(
CHEE320 - Fall 2001
J. McLellan 10
Aside - Variance
If we have a sum of independent random variables, X and Y, with “a” and “b” constants, then
Var( a X+ b Y) = a2 Var(X) + b2 Var(Y)
CHEE320 - Fall 2001
J. McLellan 11
Variance of Sample Average
Interpretation– variance of sample average is 2 / n
» as n becomes larger, variance of sample average becomes smaller
» as more data is used, estimate becomes more precise» sample average represents a concentration of
information
CHEE320 - Fall 2001
J. McLellan 12
Distribution of the Sample Average
– in preceding slides, no assumption was made about distribution of population (e.g., normal, exponential)
– Central Limit Theorem implies that distribution of sample average approaches a Normal distribution when number of samples becomes large
» even if underlying population is non-Normal» important consequences for comparing values -
hypothesis tests and confidence limits
CHEE320 - Fall 2001
J. McLellan 13
Outline
• random samples• notion of a statistic• estimating the mean - sample average• assessing the impact of variation on estimates -
sampling distribution• estimating variance - sample variance and standard
deviation• making decisions - comparisons of means, variances
using confidence intervals, hypothesis tests
CHEE320 - Fall 2001
J. McLellan 14
… is estimated using the following statistic:
Observed value:
Mean of the sample variance:
Sample Variance
n
ii XX
ns
1
22 )(1
1
n
ii xx
ns
1
22 )(1
1
22}{ sESample variance is an UNBIASED estimator of variance.
CHEE320 - Fall 2001
J. McLellan 15
Sample Standard Deviation
… is simply the square root of the sample variance
BUT
– sample standard deviation is a biased estimator of population standard deviation
» value on average does not tend to population value
}{sE
CHEE320 - Fall 2001
J. McLellan 16
Outline
• random samples• notion of a statistic• estimating the mean - sample average• assessing the impact of variation on estimates -
sampling distribution• estimating variance - sample variance and standard
deviation• making decisions - comparisons of means, variances
using confidence intervals, hypothesis tests
CHEE320 - Fall 2001
J. McLellan 17
Confidence Intervals
Consider the sample average
We can standardize this to have zero mean and unit variance:
)/,(~ 2 nNX XX
“is distributed as”“Normally distributed with mean and variance”
nX
ZX
X/
CHEE320 - Fall 2001
J. McLellan 18
Confidence Intervals
Distribution for standard normal:
Start with -
and consider Z -
95.0)96.196.1( ZP
95.0)/96.1/96.1(
95.0)96.1/
96.1(
nXnP
nX
P
XXXX
X
X
CHEE320 - Fall 2001
J. McLellan 19
Confidence Intervals
Rearrange this last statement to obtain:
Interpretation - » limits of interval have uncertainty - if we repeated sequence of
estimating average and computing the limits, the endpoints would change somewhat BUT95% of the time, the interval would contain the true value of the mean
95.0)/96.1/96.1( nXnXP XXX
RANDOMRANDOM NOTrandom
CHEE320 - Fall 2001
J. McLellan 20
Confidence Intervals
– this interval DOES NOT imply that the mean is uncertain
Picture - sequence of intervals associated with repeated experimentation true value of mean
CHEE320 - Fall 2001
J. McLellan 21
Confidence Intervals
General result for mean -
100(1-)% confidence interval given by:
where - » z/2 - “fence” - value for which P(Z> z/2 ) = /2
» value obtained from tables • 95% - value is 1.96 - approximately 2
• 99% - value is 2.57
nzXnzX XXX // 2/2/
CHEE320 - Fall 2001
J. McLellan 22
Confidence Intervals
General Approach» form a quantity with a known distribution that depends on
the parameter of interest
» form a probability statement - choose fences (limits) with a known probability
» re-arrange statement to obtain an interval specifying a range of values for the parameter of interest
nX
ZX
X/
95.0)96.1/
96.1(
n
XP
X
X
95.0)/96.1/96.1( nXnXP XXX
CHEE320 - Fall 2001
J. McLellan 23
Confidence Intervals for Mean
When population variance is “known”, 100(1-)% confidence interval is -
Known variance - » knowledge of variance when process has been operating
steadily for long period of time» on basis of extensive operating experience» “large number of data points”
nzXnzX XXX // 2/2/
CHEE320 - Fall 2001
J. McLellan 24
Confidence Intervals for Mean
What if variance is unknown?» Estimate using sample variance s2
Follow previous approach by forming standardized quantity:
» issue - s2 is a statistic itself, and is a random variable» this quantity no longer has a standard Normal distribution
Solution - » what is the probability distribution of this quantity, when
data are Normally distributed?
ns
X
X
X
/
CHEE320 - Fall 2001
J. McLellan 25
Student’s t Distribution
When the data are Normally distributed,
follows a Student’s t distribution with n-1 degrees of freedom
Degrees of freedom - » number of statistically independent pieces of information
used to compute sample variance» recall that in s2, we divide by n-1 where n is the number
of data points
ns
X
X
X
/
CHEE320 - Fall 2001
J. McLellan 26
Student’s t Distribution
… has a shape similar to that of Normal distribution» symmetric» values are available in tables» extra parameter in tables - degrees of freedom
3 degrees of freedom
CHEE320 - Fall 2001
J. McLellan 27
Confidence Intervals for Mean
Variance Unknown» estimated using sample variance» 100(1-)% case
is the number of degrees of freedom (n-1), where n is number of data points used to compute sample variance (and average)
» obtained following identical argument used in the known variance case
nstXnstX XXX // 2/,2/,
CHEE320 - Fall 2001
J. McLellan 28
Example #1
Conversion in a chemical reactor using new catalyst preparation
» data collected, average conversion computed using 10 data points is 76.1%
» prior operating history indicates that variance of conversion is 4.41 %2
» determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70%
CHEE320 - Fall 2001
J. McLellan 29
Example #1
• Confidence interval - 95% » upper tail area is 2.5% » standard devn = sqrt(4.41) = 2.1» confidence interval
» conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion
96.1025.02/ zz
4.778.74
10/)1.2)(96.1(1.7610/)1.2)(96.1(1.76
CHEE320 - Fall 2001
J. McLellan 30
Example #2
Conversion in a chemical reactor using new catalyst preparation
» data collected, average conversion computed using 10 data points is 76.1%
» current data set of 10 points used to estimate sample variance, which is 5.3 %2
» determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70%
CHEE320 - Fall 2001
J. McLellan 31
Example #2
• Confidence interval - 95% » variance UNKNOWN - need to use Student’s t distribution
-- degrees of freedom = 10-1 = 9» upper tail area is 2.5% » standard devn = sqrt(5.3) = 2.3» confidence interval
» conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion
262.2025.0,92/, tt
7.775.74
10/)3.2)(262.2(1.7610/)3.2)(262.2(1.76
CHEE320 - Fall 2001
J. McLellan 32
Confidence Intervals for Variance
First, we need to know the sampling distribution of the sample variance:
• when data are Normally distributed, sample variance is the sum of squared Normal random variables
» squaring “folds over” the negative values of the Normal random variable and makes them positive - asymmetry
n
ii XX
ns
1
22 )(1
1
CHEE320 - Fall 2001
J. McLellan 33
Chi-squared distribution
• is the distribution of a squared standard Normal random variable
» Chi-squared random variable with 1 degree of freedom» degrees of freedom = number of independent standard
Normal random variables being squared» e.g.,
• 3 degrees of freedom
21
2 ~ Z
23
23
22
21 ~ ZZZ
3 degrees of freedom
CHEE320 - Fall 2001
J. McLellan 34
Sampling distribution -sample variance
Sample variance» is the sum of n squared Normal random variables BUT
we add the sum of squared deviations from the sample average
» given value of sample average introduces constraint - given Xbar, we only have n-1 independent random variables (the n-th can be computed from the average)
» sample variance contains n-1 independent Normal random variables --> degrees of freedom for Chi-squared distribution is n-1
21
22
1~ nn
s
CHEE320 - Fall 2001
J. McLellan 35
Confidence Intervals - Sample Variance
• Form probability statement
• Re-arrange statement
• 100(1-)% interval is
1))1(
( 22/,12
22
2/1,1 nnsn
P
1)
)1()1((
22/1,1
22
22/,1
2
nn
snsnP
22/1,1
22
22/,1
2 )1()1(
nn
snsn
CHEE320 - Fall 2001
J. McLellan 36
Confidence Limits for Variance
Notes1) the tail areas are equal
» symmetric tail areas
however the interval can be asymmetric » consequence of asymmetry of Chi-squared distribution
2) is the value of the Chi-squared random variable with upper tail area of 1-/2 and n-1 degrees of freedom
22/1,1 n
equal tail areas
CHEE320 - Fall 2001
J. McLellan 37
Variance Confidence Intervals - Example
Temperature controller has been implemented on a polymer reactor -
» variance under previous operation was 4.7 C» under new operation, we have collected 10 data points
and computed a sample variance of 3.2 C» is the variance under the new control operation
significantly better?• i.e., is variance under new operation significantly lower?
CHEE320 - Fall 2001
J. McLellan 38
Variance Confidence Intervals - Example
Use confidence interval for variance» n-1 = 10-1 = 9 degrees of freedom» form 95% confidence interval ( = 0.05)» from tables:
» interval for variance: » conclusion - variance reduction isn’t significant after
background variation in sample variance computation is taken into account
» note that interval isn’t symmetric
0.19
7.2
2025.0,9
2025.01,9
67.1052.1 2
CHEE320 - Fall 2001
J. McLellan 39
Variance Confidence Intervals - Example
Comment – variance is sensitive to degrees of freedom
» need larger number of data points to obtain precise estimate
» e.g., if variance estimate was 3.2 C with 30 degrees of freedom (31 data points), the interval would be:
» cf. previous interval with 10 data points
71.504.2 2
67.1052.1 2
Conclusion still doesn’tchange, however.