1 15.math-review statistics. 2 zlet us consider x 1, x 2,…,x n, n independent identically...

24
1 15.Math-Review Statistics Statistics

Upload: dana-basford

Post on 29-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

1

15.Math-Review

StatisticsStatistics

Page 2: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 2

Let us consider X1, X2,…,Xn, n independent identically distributed random variables with mean and standard deviation .

And define:

Central Limit Theorem

n

iin XS

1

Page 3: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 3

Central Limit Theorem

The Central Limit Theorem (CLT) states: If n is large (say n30) then Sn follows approximately a

normal distribution with mean n and standard deviation

If n is large (say n30) then follows approximately a normal distribution with mean and standard deviation

n

nnS1

n

Page 4: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 4

Central Limit Theorem

Example: sums of a Bernoulli random variable

Frequency Chart

.000

.199

.399

.598

.797

0

7974

1.00 1.25 1.50 1.75 2.00

10,000 Trials 0 Outliers

Forecast: n=1

Frequency Chart

.000

.075

.149

.224

.298

0

745.2

2981

14.50 15.88 17.25 18.63 20.00

10,000 Trials 75 Outliers

Forecast: n=10

Frequency Chart

.000

.044

.089

.133

.178

0

444.5

889

1778

48.00 51.00 54.00 57.00 60.00

10,000 Trials 28 Outliers

Forecast: n=30

Frequency Chart

.000

.036

.073

.109

.146

0

364.7

729.5

1459

82.50 86.25 90.00 93.75 97.50

10,000 Trials 70 Outliers

Forecast: n=50

Page 5: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 5

Central Limit Theorem

Example: Averages of Bernoulli random variable

Frequency Chart

.000

.075

.149

.224

.298

0

745.2

2981

1.45 1.59 1.73 1.86 2.00

10,000 Trials 75 Outliers

Forecast: mean, n=10

Frequency Chart

.000

.044

.089

.133

.178

0

444.5

889

1778

1.60 1.70 1.80 1.90 2.00

10,000 Trials 47 Outliers

Forecast: mean, n=30

Frequency Chart

.000

.036

.073

.109

.146

0

364.7

729.5

1459

1.65 1.73 1.80 1.88 1.95

10,000 Trials 70 Outliers

Forecast: mean, n=50

Frequency Chart

.000

.199

.399

.598

.797

0

7974

1.00 1.25 1.50 1.75 2.00

10,000 Trials 0 Outliers

Forecast: n=1

Page 6: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 6

Central Limit Theorem

Example: Compare a binomial random variable X~B(40,0.2) with its normal approximation: What is the normal approximation? Compare P(X10), P(X 20), P(X30) for the binomial and the

normal approximation.

BINOMIAL: AVERAGE: NORMAL:

X<=5 0.16133 X<5 0.07591 0.11862 0.11784X<=10 0.83923 X<10 0.73178 0.78550 0.78540X<=20 0.99999 X<20 0.99998 0.99999 1.00000X<= 30 1.00000 X<30 1.00000 1.00000 1.00000

Page 7: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 7

Let us consider the following example. We work at a phone company and we would like to be

able to estimate the shape of the demand. We assume that monthly household telephone bills follow

a certain probability distribution (continuous) We have obtained the following data of monthly

household telephone bills by interviewing 70 randomly chosen households (or their habitants rather) for the month of October.

Sampling

Page 8: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 8

Table:

Sampling

Respondent October Respondent October Respondent OctoberNumber Phone Bill Number Phone Bill Number Phone Bill

1 95.67$ 25 79.32$ 49 90.02$ 2 82.69$ 26 89.12$ 50 61.06$ 3 75.27$ 27 63.12$ 51 51.00$ 4 145.20$ 28 145.62$ 52 97.71$ 5 155.20$ 29 37.53$ 53 95.44$ 6 80.53$ 30 97.06$ 54 31.89$ 7 80.81$ 31 86.33$ 55 82.35$ 8 60.93$ 32 69.83$ 56 60.20$ 9 86.67$ 33 77.26$ 57 92.28$

10 56.31$ 34 64.99$ 58 120.89$ 11 151.27$ 35 57.78$ 59 35.09$ 12 96.93$ 36 61.82$ 60 69.53$ 13 65.60$ 37 74.07$ 61 49.85$ 14 53.43$ 38 141.17$ 62 42.33$ 15 63.03$ 39 48.57$ 63 50.09$ 16 139.45$ 40 76.77$ 64 62.69$ 17 58.51$ 41 78.78$ 65 58.69$ 18 81.22$ 42 62.20$ 66 127.82$ 19 98.14$ 43 80.78$ 67 62.47$ 20 79.75$ 44 84.51$ 68 79.25$ 21 72.74$ 45 93.38$ 69 76.53$ 22 75.99$ 46 139.23$ 70 74.13$ 23 80.35$ 47 48.06$ 24 49.42$ 48 44.51$

Page 9: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 9

From this information we would like to be able to estimate, for example: What is an estimate of the shape of the distribution of October household telephone

bills? What is an estimate of the percentage of households whose October telephone bill

is bellow $45.00 What is an estimate of the percentage of households whose October telephone bill

is between $60.00 and $100.00? What is an estimate of the mean of the distribution of October household telephone

bills? What is an estimate of the standard deviation of the distribution of October

household telephone bills?

Sampling

Page 10: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 10

A population (or “universe”) is the set of all units of interest. A sample is a subset of the units of a population. A random sample is a sample collected in such a way that

every unit in the population is equally likely to be selected. It is hard to ensure that a sample will be random.

Sampling

Page 11: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 11

In our example the population corresponds to all the households in our area of coverage.

The random sample selected were the 70 households (or their inhabitants) interviewed.

And for the random variables X1,X2,… ,Xn corresponding to households 1, 2,… , n we observed x1=$95.67, x2=$82.69,… , xn=$74.13.

Note that if we had chosen a different random set of households we would have observed a different collection of values.

Sampling

Page 12: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 12

To fix notation: n will be our random sample size. X1,X2,… ,Xn correspond to the random variables of unknown

distribution f(x), which is common to our population, and what we want to study.

x1,x2,… ,xn are the observations obtained by observing the outcome of our random sample. These are numbers!!

We try to use these numbers to estimate the characteristics of f(x), for example what is the distribution, what is its mean, variance, etc.

Sampling

Page 13: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 13

To “look” at the shape of the distribution of X it is useful to create a frequency table and histogram of the sample values x1,x2,… ,xn.

Sampling

Interval Limit Frequency % Cumulative %-30 0 0.00% 0.00%

30-40 3 4.29% 4.29%40-50 6 8.57% 12.86%50-60 7 10.00% 22.86%60-70 13 18.57% 41.43%70-80 12 17.14% 58.57%80-90 11 15.71% 74.29%

90-100 9 12.86% 87.14%100-110 0 0.00% 87.14%110-120 0 0.00% 87.14%120-130 2 2.86% 90.00%130-140 2 2.86% 92.86%140-150 3 4.29% 97.14%150-160 2 2.86% 100.00%

160- 0 0.00% 100.00%

Histogram of Sample of October Telephone Bills

0

2

4

6

8

10

12

14

-30

30

-40

40

-50

50

-60

60

-70

70

-80

80

-90

90

-10

0

10

0-1

10

110

-12

0

12

0-1

30

13

0-1

40

14

0-1

50

15

0-1

60

16

0-

Range for Oct. Bill

Nu

mb

er

of

ho

us

eh

old

s

Page 14: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 14

A histogram can be obtained from excel, the output looks something like this:

Sampling

Bin FrequencyCumulative % Bin FrequencyCumulative %30 0 .00% 70 13 18.57%40 3 4.29% 80 12 35.71%50 6 12.86% 90 11 51.43%60 7 22.86% 100 9 64.29%70 13 41.43% 60 7 74.29%80 12 58.57% 50 6 82.86%90 11 74.29% 40 3 87.14%

100 9 87.14% 150 3 91.43%110 0 87.14% 130 2 94.29%120 0 87.14% 140 2 97.14%130 2 90.00% 160 2 100.00%140 2 92.86% 30 0 100.00%150 3 97.14% 110 0 100.00%160 2 100.00% 120 0 100.00%

More 0 100.00% More 0 100.00%

Page 15: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 15

From this analysis we can give the following description of the shape of this distribution (qualitative): An estimate of the shape of the distribution of October telephone bills

in the site area is that it is shaped like a Normal distribution, with a peak near $65.00, except for a small but significant group in the range between $125.00 and $155.00.

Sampling

Page 16: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 16

In order to answer the other relevant questions we can use the original data, and count favorable outcomes and divide by total possible outcomes (70): P(X 45.00) = 5/70 = 0.07 P (60.00 X 100.00) = 45/70 = 0.64

Here we are approximating the continuous unknown distribution by the discrete distribution given by the outcomes of the sample

Sampling

Page 17: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 17

Sample mean, variance and standard deviation: From our observed values x1,x2,… ,xn, we can compute:

Sampling

n

ii

n

ii

n

ii

n

xxn

s

xxn

s

xnn

xxx

1

2

1

22

1

1

1

1

deviation, stardard sample observed The

1

1

variance,sample observed The

1

mean, sample observed The

Page 18: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 18

In our example we have:

Sampling

79.28$69

)40.7913.74()40.7967.95(

1

1

deviation, stardard sample observed theand

40.79$70

13.7469.8267.95

mean, sample observed The

22

1

2

1

n

ii

n

xxn

s

n

xxx

Page 19: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 19

We will use these observed values to estimate the unknown mean , and standard deviation , of our unknown underlying distribution.

In other words:

Sampling

estimate will

estimate will

s

x

Also note that if we pick a different sample of the population, our observed values will be different.

We can define the random variables: sample mean, sample standard deviation, of which x and s are observations.

Page 20: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 20

Before the sample is collected, the random variables X1,X2,… ,Xn, can be used to define:

Sampling

n

ii

n

ii

n

ii

n

XXn

S

XXn

Xnn

XXX

1

2

1

22

1

1

1

1

deviation, stardard sample observed The

1

1S

variance,sample observed The

1

mean, sample The

Page 21: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 21

X and S are random variables

We distinguish between the sample mean X, which is a random variable, and the observed sample mean x, which is a number.

Similarly, the sample standard deviation S is a random variable, and the observed sample standard deviation s is a number.

Sampling

Page 22: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 22

Distribution of X From the formula that defines the sample mean we see that according

to CLT it should follow approximately a normal distribution (if n30)

The mean is E(X) = The standard deviation is E(X) =

In summary:

Sampling

n

" ," ,~ 22 nsxNnNX

Page 23: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 23

Example: At two different branches of the G-Mart department store, they randomly sampled 100 customers on August 13. At Store 1, the average amount purchased was $41.25 per customer, with a sample standard deviation of $24.00. At Store 2, the average amount purchased was $45.75 with a sample standard deviation of $34.00 Let X denote the amount of a random purchase by a single customer at Store 1 and let Y denote the amount of a random purchase by a single customer at Store 2. Assuming that X and Y satisfy a joint normal distribution, what is the distribution of X-Y? What is the probability that the mean of X exceeds the mean of Y?

Sampling

Page 24: 1 15.Math-Review Statistics. 2 zLet us consider X 1, X 2,…,X n, n independent identically distributed random variables with mean and standard deviation

15.Math-Review 24

Example: In the quality control department of our company, knobs are inspected to make sure that they meet quality standards. Since it is not practical to test every knob, we draw a random sample to test. It is extremely necessary that our knobs weigh at least 0.45 pounds. If we know that the average weight is less than 0.45 pounds, we stop the production line and reset all the machines. In a day we produce 300,000 knobs, and draw a random sample of 1,000 knobs to test. If yesterday (Wednesday) the observed sample mean was 0.42 pounds, and observed sample standard deviation was 0.2, how confident are you that the average weigh of knobs is less than 0.45 pounds? If the average weight of knobs produced is 0.45 pounds, with standard deviation of 0.2,

what is the probability that the average weight of the sample will be 0.42 or lower? Are these questions the same?

Sampling