qt1 - 07 - estimation
TRANSCRIPT
Introduction
Estimation
QUANTTECHINTEUQIASEVIT10SS
Why Estimation ?
[ From ] Inference
For a given population
Various statistical parameters are GIVEN or KNOWN
Mean, Standard Deviation etc
Task was to interpret them and take managerial decisions
How many shirts to be stocked in the store ?
Is the machine setup faulty ? Should we fix it
[ To ] Estimation
For the given population or sample
Various statistical parameters are NOT KNOWN
So managerial decisions cannot be taken
UNLESS we can estimate the parameters
Two kinds of Estimates
Point Estimate
A single number that is used to estimate a given population parameter
Mean Age = 22.3
Interval Estimate
A range of values used to estimate a population parameter
Mean Age is between 21.5 and 23
Difficulty of point estimate
It is either right or wrong !
No way to know the quantum of error in the estimate
Needs to accompanied by another estimate of the error that could have happened !!
Estimator
Estimator
A sample statistic that is used to estimate a population parameter
Estimate
A specific value of the statistic that is observed
Criteria for a Good Estimator
Unbiased
Example : Mean of the sampling distribution of sample means taken from the population is equal to the population mean itself.
Efficient
Depends on the standard error of the statistic
Standard error = standard deviation of the sampling distribution
If standard error is low, estimator is efficient
Consistent
When sample size increases the value of the statistic comes closer and closer to the value of the parameter
Sufficient
Uses all information that can be extracted from sample
Point Estimate
Estimate of Mean
Sample Mean
S x
x =
n
Estimate of Standard Deviation
S (x x)2
s2 =
n 1
We cannot use the lower statistic because
S (x x)2
s2 =
n
because it can be shown that it has a bias !
Where are the errors here ?
Potential number of patrons at a very popular musical concert that is always sold out ..
Estimator : Average number of tickets sold
Telephone calls are billed by whole minutes even if the duration is a fraction. What is the average length of a call ?
Estimator : Average billing for all calls made over a day / Rate per minute
Interval Estimate
An interval estimate describes a range of values within which a population parameter is likely to lie
Consider an interval estimate for the mean
Start with a point estimate
Find the likely error of this estimate
Standard error is standard deviation of the estimator
Make an interval estimate
Defined in terms of the estimate and the standard error
Find the probability that mean will fall in this interval estimate
Example
Estimate the average battery life of a car in months
From a sample size of 200 we get x = 36
Standard error of sample mean
s
sx = = 0.707 assuming s = 10
Now we can make an interval estimates like
x sx < m < x + sx => 35.293 < m < 36.707
x 2sx < m < x + 2sx => 34.586 < m < 37.414
x 3sx < m < x + 3sx => 33.879 < m < 38.121
Back to Probability
Sampling distribution of the mean is also normal with
Mean = 36.0
Standard Deviation ( Standard Error) = 0.707
So probability of the real mean lying between the limits given by the interval estimate is known !
68.3%
95.5%
99.7%
Interval Estimate of Mean
Probabilities are as follows
68.3%=> 35.293 < m < 36.707
95. 5 %=> 34.586 < m < 37.414
99.7%=> 33.879 < m < 38.121
Here we note that the probabilities are odd, fractional kind of numbers ...
So how can we have simpler probabilities like 50% or 90% probability ?
Confidence Interval
We observe that in a normal distribution
90% of the values lie within 1.64s of mean
99% of the values lie within 2.58s of mean
So we redefine our interval estimates as
90%=> 34.84 < m < 37.16
99%=> 34.18 < m < 37.82
Confidence Intervals
Original Limits
68.3%=> 35.293 < m < 36.707
95. 5 %=> 34.586 < m < 37.414
99.7%=> 33.879 < m < 38.121
More convenient limits
90%=> 34.84 < m < 37.16
99%=> 34.18 < m < 37.82
Limits
1s
2s
3s
1.64s
2.58s
Is a higher confidence interval always better ?
I am 99.999% sure that the average age of this class lies between 1 year and 50 years
Does this really help you in anyway ?
I am 95% sure that the average age of this class lies between 23 and 26 years
This gives me a far better idea of where the average age of the class lies
This information is better than the first information
95% confident that the mean battery life lies between 30 42 months
Does NOT mean that
There is 95% probability that the mean life of all our batteries falls within the interval established from this one sample
It DOES mean that
If we select many random samples of the same size and calculate a confidence interval for each of these samples then 95% of these intervals will contain the population mean
Calculation of Confidence Interval
Example
A large automotive parts wholesalers needs an estimate of the mean life that he can expect from a windshield wiper under normal driving conditions
It is known that the standard deviation of the population life is 6 months
Observations from 1 simple random sample of 100 blades is as follows
Sample Size
n = 100
Sample Mean
x = 21 months
Population standard deviation
s = 6 months
95% Confidence Interval
Standard Error
s
sx =
6
=
= 0.6 months
95% confidence level will include 47.5 % on each side of the mean
Sample size is > 30 so we can assume that the sample mean follows a normal distribution
In a normal distribution
95% values lie within 1.96 times the standard deviation
95 % values of the sample mean lie within 1.96 times the standard error
Upper Confidence Limit
Lower Confidence Limit
Upper Confidence Limit
x + 1.96 sx
= 21 + 1.96 ( 0.6)
= 22.18 months
Lower Confidence Limit
x 1.96 sx
= 21 1.96 ( 0.6)
= 19.82 months
Two major assumptions
Standard Deviation of Population is known
In reality this is may not be known
The sampling distribution follows the normal distribution
This assumption is valid only if the sample size is more than 30.
Standard deviation is not known
When Standard Deviation Known
Standard Error of the Sample mean
s
sx =
When Standard Deviation is not Known
Standard Error of the sample mean
s
sx =
S (x - x)2
s =
n - 1
^
^
^
How do we get this interval
[Usually] we are trying to estimate the population mean m
We have an estimator E which [ in most cases ] is the sample mean.
E follows a distribution that has mean m and standard error s
We create a statistic Q = (E m)/s
Q follows a some distribution ( normal ? T ? )
We identify two values Q1, Q2 such that probability of Q falling between Q1 and Q2 is equal to required confidence P
Interval is E Q1s < m < E + Q2s
What is our goal ?
What is known ?
E, s, P
What is to be calculated
Q1, Q2
What is the objective
To be confident that
Probability of m
Lying between E Q1s and E + Q2s
Is equal to P
What are the steps
Identify an estimator
What distribution does the estimator follow ?
Is the standard deviation known ?
If not what is the estimator for the standard deviation
Is the sample size big enough ?
Get a value of the estimate
Get a value of the standard error for the estimator
Set an appropriate confidence level in terms of probability
From the graph / table of the sampling distribution get the upper and lower limits in terms of estimate and the standard error
Confidence Intervals Revisited
Original Limits
68.3%=> 35.293 < m < 36.707
95. 5 %=> 34.586 < m < 37.414
99.7%=> 33.879 < m < 38.121
More convenient limits
90%=> 34.84 < m < 37.16
99%=> 34.18 < m < 37.82
Point to note ...
m and s come from the estimate
How do we connect
68.3%, 90%, 95.5%, 99%, 99.7%
1, 1.64, 2, 2.58, 3
Limits
1s
2s
3s
1.64s
2.58s
By looking atthe probabilitydistribution function of the estimator
Sample Size in Estimation
What should be the sample size such that with a known population standard deviation, the sample size should be adequate to ensure an adequate confidence interval ?
Which distribution does the estimator follow ?
So far ... and usually .. we assume that the estimator follows the normal distribution
That is how we get
68.3% => 1.0s
90.0% => 1.64s
95.5% => 2.0s
99.0% => 2.58s
99.7% => 3.00s
68.3%
95.5%
99.7%
The Student's t distribution
Used when
Standard deviation of the population is NOT known AND
Sample size is less than 30
When this happens we cannot use the Normal distribution but must look up the tables for the T distribution
t-distribution instead of normal
=NORMSINV(D5+0.5)
=TINV($F5;G$2)
Usage of t-table
The probability that we are working with
IS NOT the probability that the estimated value will fall inside the confidence interval
INSTEAD
It is the probability that the estimated value will fall OUTSIDE the confidence interval
This probability is defined as a
Confidence = 1 a
Degree of Freedom
1 sample size
Binomial Distribution / Proportions
We have a binomial distribution with p as the success probability
20% student population are engineers
45% of employees population are married
We need to have an estimate of p
Estimator is proportion p from sample
Assumptions
Estimator follows normal distribution
m = np
Standard error of estimate
Example
Sample = 75
Fraction graduate
p = 0.4
Fraction not graduate
q = 0.6
Estimate of p = 0.4
Standard Error for Estimator
= 0.057
99% confidence interval
Z = 2.58
LCL
0.4 0.057 * 2.58
= 0.253
UCL
0.4 + 0.057 * 2.58
= 0.547
Normal DistributionValue of XProbability DensityColumn C
33.550.0013926773648661
33.730.00318468329587139
33.900.00684972932251462
34.080.0138570887117816
34.250.026367078474001
34.430.0471892911201431
34.600.0794358088809738
34.780.125771030205113
34.950.187299380755427
35.130.262351463951498
35.300.345638466463144
35.480.428303944809365
35.650.499198769788985
35.830.547250750234345
36.000.56427479547586
36.180.547250750234345
36.350.499198769788985
36.530.428303944809365
36.700.345638466463144
36.880.262351463951498
37.050.187299380755427
37.230.125771030205113
37.400.0794358088809738
37.580.0471892911201431
37.750.026367078474001
37.930.0138570887117816
38.100.00684972932251462
38.280.00318468329587139
38.450.0013926773648661
Area within Control LimitsArea outside control limitsNormal DistributionStudent t DistributionArea in tail491929
0.010PzPt
0.850.150.431.440.151.7781.5741.5001.479
0.860.140.431.480.141.8381.6191.5401.517
0.870.130.441.510.131.9021.6661.5831.558
0.880.120.441.550.121.9711.7181.6281.602
0.890.110.451.600.112.0481.7731.6771.649
0.900.100.451.640.102.1321.8331.7291.699
0.910.090.461.700.092.2261.8991.7861.754
0.920.080.461.750.082.3331.9731.8501.814
0.930.070.471.810.072.4562.0551.9201.881
0.940.060.471.880.062.6012.1502.0001.957
0.950.050.481.960.052.7762.2622.0932.045
0.960.040.482.050.042.9992.3982.2052.150
0.970.030.492.170.033.2982.5742.3462.282
0.980.020.492.330.023.7472.8212.5392.462
0.990.010.502.580.014.6043.2502.8612.756
1.000.000.50#VALUE!0.00Err:502Err:502Err:502Err:502
???Page ??? (???)02/12/2008, 09:02:04Page /
Click to edit the title text format
Click to edit the outline text format
Second Outline Level
Third Outline Level
Fourth Outline Level
Fifth Outline Level
Sixth Outline Level
Seventh Outline Level
Eighth Outline Level
Ninth Outline Level
prithwis mukerjee
Population of InterestParameter of InterestSample Statistic used as EstimatorEstimate Made
Production at FactoryAnnual ProductionProduction in one month2000 t/year
Candidates for employmentAverage AgeAverage age of every tenth applicant26y 3m
Students in EngineeringProportion of womenProportion of women in a sample of 100 students24.50%
???Page ??? (???)22/10/2008, 14:52:11Page / Normal DistributionValue of XProbability DensityColumn C
33.550.0013926773648661
33.730.00318468329587139
33.900.00684972932251462
34.080.0138570887117816
34.250.026367078474001
34.430.0471892911201431
34.600.0794358088809738
34.780.125771030205113
34.950.187299380755427
35.130.262351463951498
35.300.345638466463144
35.480.428303944809365
35.650.499198769788985
35.830.547250750234345
36.000.56427479547586
36.180.547250750234345
36.350.499198769788985
36.530.428303944809365
36.700.345638466463144
36.880.262351463951498
37.050.187299380755427
37.230.125771030205113
37.400.0794358088809738
37.580.0471892911201431
37.750.026367078474001
37.930.0138570887117816
38.100.00684972932251462
38.280.00318468329587139
38.450.0013926773648661