the empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/sta220/lecture notes... ·...

54
week3 1 The empirical (68-95-99.7) rule With a bell shaped distribution, about 68% of the data fall within a distance of 1 standard deviation from the mean. 95% fall within 2 standard deviations of the mean. 99.7% fall within 3 standard deviations of the mean. What if the distribution is not bell-shaped? There is another rule, named Chebyshev's Rule, that tells us that there must be at least 75% of the data within 2 standard deviations of the mean, regardless of the shape, and at least 89% within 3 standard deviations.

Upload: others

Post on 12-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 1

The empirical (68-95-99.7) rule

• With a bell shaped distribution,about 68% of the data fall within a distance of 1 standarddeviation from the mean.95% fall within 2 standard deviations of the mean.99.7% fall within 3 standard deviations of the mean.

• What if the distribution is not bell-shaped?There is another rule, named Chebyshev's Rule, that tells us that there must be at least 75% of the data within 2 standard deviations of the mean, regardless of the shape, and at least 89% within 3 standard deviations.

Page 2: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 2

Linear transformations• A linear transformation changes the original value x into a

new variable xnew . • xnew is given by an equation of the form,

• Example 1.19 on page 54 in IPS.(i) A distance x measured in km. can be expressed in

miles as follow, .

(ii) A temperature x measured in degrees Fahrenheit can beconverted to degrees Celsius by

x a b xn ew = +

0 .6 2x xn ew =

5 160 5( 32)9 9 9x x xnew= − =− +

Page 3: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 3

Effect of a Linear Transformation

• Multiplying each observation in a data set by a number bmultiplies both the measures of center (mean, median, and trimmed means) by b and the measures of spread (range, standard deviation and IQR) by |b| that is the absolute value of b.

• Adding the same number a to each observation in a data set adds a to measures of center, quartiles and percentiles but does not change the measures of spread.

• Linear transformations do NOT change the overall shapeof a distribution.

Page 4: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 4

Measure x xnew

Mean

Median M

Mode

Range R

IQR IQR

Stdev s

a+bM

Mode a+bMode

x xba +

RbIQRb

sb

Page 5: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 5

Example 1• A sample of 20 employees of a company was taken and

their salaries were recorded. Suppose each employee receives a $300 raise in the salary for the next year. State whether the following statements are true or false.

a) The IQR of the salaries will i. be unchangedii. increase by $300iii. be multiplied by $300

b) The mean of the salaries will i. be unchangedii. increase by $300iii. be multiplied by $300

Page 6: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 6

Nonlinear transformations

• A very common nonlinear transformation in statistic is the logarithm transformation.

• Recall: lnx = logex where e is the natural number e = 2.7183.

• If measurements on a variable x have a right skewed distribution. The distribution of lnx will be roughly symmetric.

• If measurements on a variable x have a left skewed distribution. The distribution of lnx will be even more left skewed.

Page 7: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 7

Example 2 - Nonlinear transformations

0 1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

ln(sales)

Freq

uenc

y

Histogram for ln(sales)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

0

100

200

Sales

Freq

uenc

y

Histogram for sales data

Page 8: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 8

Density curves

• Using software, clever algorithms can describe a distribution in a way that is not feasible by hand, by fitting a smooth curveto the data in addition to or instead of a histogram. The curvesused are called density curves.

• It is easier to work with a smooth curve, because histogram depends on the choice of classes.

• Density CurveDensity curve is a curve that

is always on or above the horizontal axis.has area exactly 1 underneath it.

• A density curve describes the overall pattern of a distribution.

Page 9: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 9

• The area under the curve and above any range of values is the relative frequency (proportion) of all observations that fall in that range of values.

• Example: The curve below shows the density curve for scores in an exam and the area of the shaded region is the proportion of students who scores between 60 and 80.

Page 10: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 10

Median and mean of Density Curve

• The median of a distribution described by a density curve is the point that divides the area under the curve in half.

• A mode of a distribution described by a density curve is a peak point of the curve, the location where the curve is highest.

• Quartiles of a distribution can be roughly located by dividing the area under the curve into quarters as accurately as possible by eye.

Page 11: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 11

Normal distributions• An important class of density curves are the symmetric

unimodal bell-shaped curves known as normal curves. They describe normal distributions.

• All normal distributions have the same overall shape.

• The exact density curve for a particular normal distribution is specified by giving its mean μ and its standard deviation σ.

• The mean is located at the center of the symmetric curve and is the same as the median and the mode.

• Changing μ without changing σ moves the normal curve along the horizontal axis without changing its spread.

Page 12: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 12

• The standard deviation σ controls the spread of a normal curve.

Page 13: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 13

• There are other symmetric bell-shaped density curves that are not normal e.g. t distribution.

• The normal density curves are specified by a particular function. The height of a normal density curve at any point x is given by

• Notation: A normal distribution with mean μ and standard deviation σ is denoted by N(μ, σ).

211 2

2

xe

μσ

σ π

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

−−

Page 14: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 14

The 68-95-99.7 ruleIn the normal distribution with mean μ and standard deviation σ ,

Approx. 68% of the observations fall within σ of the mean μ.Approx. 95% of the observations fall within 2σ of the mean μ.Approx. 99.7% of the observations fall within 3σ of the mean μ.

Page 15: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 15

Example 1.23 on p72 in IPS• The distribution of heights of women aged 18-24 is

approximately N(64.5, 2.5), that is ,normal with mean μ = 64.5 inches and standard deviation σ = 2.5 inches.

• The 68-95-99.7 rule says that the middle 95% (approx.) of women are between 64.5-5 to 64.5+5 inches tall.The other 5% have heights outside the range from 59.5 to 69.5 inches, and 2.5% of the women are taller than 69.5 .

• Exercise:1) The middle 68% (approx.) of women are between ____to ___

inches tall.2) ___% of the women are taller than 66.75. 3) ___% of the women are taller than 72.

Page 16: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 16

Standardizing and z-scores

• If x is an observation from a distribution that has mean μ and standard deviation σ , the standardized value ofx is given by

• A standardized value is often called a z-score.• A z-score tells us how many standard deviations the original

observation falls away from the mean of the distribution.• Standardizing is a linear transformation that transform the data

into the standard scale of z-scores. Therefore, standardizing does not change the shape of a distribution, but changes the value ofthe mean and stdev.

xz μσ−=

Page 17: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 17

Example 1.24 on p73 in IPS• The heights of women is approximately normal with mean

μ = 64.5 inches and standard deviation σ = 2.5 inches.

• The standardized height is

• The standardized value (z-score) of height 68 inches is

or 1.4 std. dev. above the mean.

• A woman 60 inches tall has standardized height

or 1.8 std. dev. below the mean.

6 4 .52 .5

h eig h tz −=

6 8 6 4 .5 1 .42 .5z −= =

60 64.5 1.82.5z −= =−

Page 18: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 18

The Standard Normal distribution• The standard normal distribution is the normal distribution

N(0, 1) that is, the mean μ = 0 and the sdev σ = 1 .• If a random variable X has normal distribution N(μ, σ), then the

standardized variable

has the standard normal distribution.

• Areas under a normal curve represent proportion of observations from that normal distribution.

• There is no formula to calculate areas under a normal curve. Calculations use either software or a table of areas. The table and most software calculate one kind of area: cumulative proportions . A cumulative proportion is the proportion of observations in a distribution that fall at or below a given value and is also the area under the curve to the left of a given value.

XZ μσ−=

Page 19: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 19

The standard normal tables• Table A gives cumulative proportions for the standard

normal distribution. The table entry for each value z is the area under the curve to the left of z, the notation used is P( Z ≤ z).

e.g. P( Z ≤ 1.4 ) = 0.9192

Page 20: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

20

Standard Normal Distributionz .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

0.00.10.20.30.40.50.60.70.80.91.01.11.21.31.41.51.61.71.81.92.02.12.22.32.42.52.62.72.82.93.0

.5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359

.5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753

.5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141

.6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517

.6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879

.6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224

.7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549

.7580 .7611 .7642 .7673 .7703 .7734 .7764 .7794 .7823 .7852

.7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133

.8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389

.8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621

.8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830

.8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015

.9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177

.9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319

.9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441

.9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545

.9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633

.9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706

.9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767

.9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817

.9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857

.9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890

.9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916

.9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936

.9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952

.9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964

.9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974

.9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981

.9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986

.9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990

The table shows area to left of ‘z’ under standard normal curve

For a negative number, -z : Area below (-z) = Area above (z) =1 – Area below (z)

Page 21: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 21

The standard normal tables - Example

• What proportion of the observations of a N(0,1) distribution takes values

a) less than z = 1.4 ?

b) greater than z = 1.4 ?

c) greater than z = -1.96 ?

d) between z = 0.43 and z = 2.15 ?

Page 22: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 22

Properties of Normal distribution

• If a random variable Z has a N(0,1) distribution then P(Z = z)=0. The area under the curve below any point is 0.

• The area between any two points a and b (a < b) under the standard normal curve is given by

P(a ≤ Z ≤ b) = P(Z ≤ b) – P(Z ≤ a)

• As mentioned earlier, if a random variable X has a N(μ, σ) distribution, then the standardized variable

has a standard normal distribution and any calculations about Xcan be done using the following rules:

σμ−

=XZ

Page 23: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 23

• P(X = k) = 0 for all k.

• The solution to the equation P(X ≤ k) = p isk = μ + σzp

Where zp is the value z from the standard normal table that has area (and cumulative proportion) p below it, i.e. zp is the pth percentile of the standard normal distribution.

( ) ⎟⎠⎞

⎜⎝⎛ −

≤=≤•σμaZPaXP

( ) ⎟⎠⎞

⎜⎝⎛ −

≤−=≥•σμbZPbXP 1

( ) ⎟⎠⎞

⎜⎝⎛ −

≤≤−

=≤≤•σμ

σμ bZaPbXaP

Page 24: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 24

Questions1. The marks of STA221 students has N(65, 15) distribution.

Find the proportion of students having marks (a) less then 50. (b) greater than 80.(c) between 50 and 80.

2. Example 1.30 on page 79 in IPS:Scores on SAT verbal test follow approximately the N(505, 110) distribution. How high must a student score in order to place in the top 10% of all students taking the SAT?

3. The time it takes to complete a stat220 term test is normally distributed with mean 100 minutes and standard deviation 14 minutes. How much time should be allowed if we wish to ensure that at least 9 out of 10 students (on average) can complete it? (final exam Dec. 2001)

Page 25: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 25

4. General Motors of Canada has a deal: ‘an oil filter and lube jobin 25 minutes or the next one free’. Suppose that you worked forGM and knew that the time needed to provide these services was approximately normal with mean 15 minutes and std. dev. 2.5 minutes. How many minutes would you have recommended to put in the ad above if it was decided that about 5 free services for 100 customers was reasonable?

5. In a survey of patients of a rehabilitation hospital the mean length of stay in the hospital was 12 weeks with a std. dev. of 1 week.The distribution was approximately normal.

a) Out of 100 patients how many would you expect to stay longer than 13 weeks?

b) What is the percentile rank of a stay of 11.3 weeks?c) What percentage of patients would you expect to be in longer

than 12 weeks?d) What is the length of stay at the 90th percentile?e) What is the median length of stay?

Page 26: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 26

Normal quantile plots and their use• A histogram or stem plot can reveal distinctly nonnormal

features of a distribution.• If the stem-plot or histogram appears roughly symmetric

and unimodal, we use another graph, the normal quantileplot as a better way of judging the adequacy of a normal model.

• Any normal distribution produces a straight line on the plot.

• Use of normal quantile plots:If the points on a normal quantile plot lie close to a straight line, the plot indicates that the data are normal. Systematic deviations from a straight line indicate a nonnormal distribution. Outliers appear as points that are far away from the overall pattern of the plot.

Page 27: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 27

-2 -1 0 1 2

460

470

480

490

500

510

520

530

540

ncores

valu

e

460 470 480 490 500 510 520 530 540

0

5

10

15

value

Freq

uenc

y• Histogram, the nscores plot and the normal quantile plot

for data generated from a normal distribution (N(500, 20)).

450 500 550

1

5

10

20304050607080

90

95

99

Data

Per

cent

Normal Probability Plot for value

ML Estimates

Mean:

StDev:

500.343

17.4618

Page 28: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 28210-1-2

1 0

5

0

nc ores

valu

e

2 1

0 5 10

0

5

10

value

Freq

uenc

y

• Histogram, the nscores plots and the normal quantile plot for data generated from a right skewed distribution

Page 29: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 29

0 5 1 0

-2

-1

0

1

2

va lue

ncor

es

0 5 10

1

5

10

20

3040506070

80

90

95

99

D ata

Per

cent

Norm al P robability P lot for value

M L Estim ates

M ean:

StD ev :

2.64938

2.17848

Page 30: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 30

-2 -1 0 1 2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

1 .0

nscore

valu

e

0 .25 0 .35 0 .45 0 .55 0 .65 0 .75 0 .85 0 .95 1 .05

0

5

10

value

Freq

uenc

y

• Histogram, the nscores plots and the normal quantile plot for data generated from a left skewed distribution

Page 31: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 31

0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0

-2

-1

0

1

2

va lu e

nsco

re

0.50 0.75 1.00 1.25

1

5

10

20

3040506070

80

90

95

99

Data

Per

cent

Normal Probability Plot for value

M L Estimates

M ean:

StDev:

0.8102

0.161648

Page 32: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 32-2 -1 0 1 2

0

1

2

3

4

5

ncores

valu

e

0 .0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

0

1

2

3

4

5

6

7

8

9

value

Freq

uenc

y

• Histogram, the nscores plots and the normal quantile plot for data generated from a uniform distribution (0,5)

Page 33: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 33

0 1 2 3 4 5

-2

-1

0

1

2

va lue

ncor

es

-2 -1 0 1 2 3 4 5 6

1

5

10

20

3040506070

80

90

95

99

D ata

Per

cent

Norm al Probability P lot for value

M L Estim ates

M ean:

StD ev:

2.21603

1.46678

Page 34: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 34

Question (similar to Q5 Term test Oct, 2000)

Below are 4 normal probability (quantile) plots and 4 histograms produced by MINITAB for some data sets. The histograms are not in the same order as normal scores plots.

Match the histograms with the nscores plots.

Page 35: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 3520 22 24 26 28 30 32 34 36 38 40

0

1

2

3

4

5

6

7

8

data

Freq

uenc

y

80 84 88 92 96 100 104 108 112 116

0

5

10

15

data

Freq

uenc

y

0 10 20 30 40 50 60

0

10

20

30

40

50

data

Freq

uenc

y

-2 -1 0 1 2

0

10

20

30

40

50

60

nscores

data

-2 -1 0 1 2

0

2

4

6

8

10

12

14

nscores

data

-2 -1 0 1 2

20

30

40

nscores

data

-2 -1 0 1 2

80

90

100

110

120

nscores

data

0 2 4 6 8 10 12 14

0

5

10

data

Freq

uenc

y

Page 36: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 36

Looking at data - relationships

• Two variables measured on the same individuals are associatedif some values of one variable tend to occur more often with some values of the second variable than with other values of that variable.

• When examining the relationship between two or more variables, we should first think about the following questions:– What individuals do the data describe?– What variables are present? How are they measured?– Which variables are quantitative and which are categorical? – Is the purpose of the study is simply to explore the nature of

the relationship, or do we hope to show that one variable can explain variation in the other?

Page 37: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 37

Response and explanatory variables

• A response variable measure an outcome of a study. An explanatory variable explains or causes changes in the response variables.

• Explanatory variables are often called independent variables and response variables are called dependent variables. The ides behind this is that response variables depend on explanatory variables.

• We usually call the explanatory variable x and the response variable y.

Page 38: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 38

Scatterplot• A scatterplot shows the relationship between two

quantitative variables measured on the same individuals.• Each individual in the data appears as a point in the plot

fixed by the values of both variables for that individual.• Always plot the explanatory variable, if there is one, on the

horizontal axis (the x axis) of a scatterplot.• Examining and interpreting Scatterplots

– Look for overall pattern and striking deviations from that pattern.

– The overall pattern of a scatterplot can be described by the form, direction and strength of the relationship.

– An important kind of deviation is an outlier, an individual value that falls outside the overall pattern.

Page 39: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 39

Example

• There is some evidence that drinking moderate amounts of wine helps prevent heart attack. A data set contain information on yearly wine consumption (litters per person) and yearly deaths from heart disease (deaths per 100,000 people) in 19 developed nations. Answer the following questions.

• What is the explanatory variable?• What is the response variable?• Examine the scatterplot below.

Page 40: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 40

0 1 2 3 4 5 6 7 8 9

100

200

300

Wine

Hea

rt di

seas

e de

aths

Page 41: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 41

• Interpretation of the scatterplot– The pattern is fairly linear with a negative slope. No outliers.– The direction of the association is negative . This means that

higher levels of wine consumption are associated with lower death rates.

– This does not mean there is a causal effect. There could be lurking variables. For example, higher wine consumption could be linked to higher income, which would allow better medical care.

• MINITAB command for scatterplot

Graph > Plot

Page 42: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 42

Categorical variables in scatterplots• To add a categorical variable to a scatterplot, use a different

colour or symbol for each category.• The scatterplot below shows the relationship between the

world record times for 10,000m run and the year for both men and women.

F M

200019501900

2300

2200

2100

2000

1900

1800

1700

1600

Year

Tim

e (s

econ

ds)

Page 43: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 43

Categorical explanatory variables

• Scatterplots display the association between two quantitative variables.

• To display a relationship between a categorical explanatory variable and a quantitative response variable, make a side-by-side comparison of the distributions of the response for each category.

• A back-to-back stemplot compares two distributions.

• Side-by-side boxplots compare any number of distributions.

Page 44: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 44

ExampleWe want to investigate to association between how much education a person has and his/her income.

Education appears as a categorical variable.1 = did not reach high school, 2 = some high school but no high school diploma.

up to 6 = postgraduate degree.

Order the categories and make side-by side boxplots for the income.

Page 45: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 45

• The side-by-side boxplots show a strong positive association between education and earnings.

Page 46: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 46

Correlation• A sctterplot displays the form, direction and strength of the

relationship between two quantitative variables.• Correlation (denoted by r) measures the direction and

strength of the liner relationship between two quantitative variables.

• Suppose that we have data on variables x and y for n individuals. The correlation r between x and y is given by

( )

yx

n

iiin

i y

i

x

i

ss

yxnyx

nsyy

sxx

nr

−⎥⎦

⎤⎢⎣

−=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛ −⎟⎟⎠

⎞⎜⎜⎝

⎛ −−

=∑

∑ =

=

1

1 11

11

Page 47: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 47

Example• Family income and annual savings in thousand of $ for a sample

of eight families are given below. savings income C3 C4 C5

1 36 -1.42887 -1.45101 2.073312 39 -1.02062 -1.03144 1.05271

2 42 -0.61237 -0.61187 0.374695 45 -0.20412 -0.19230 0.039255 48 0.20412 0.22727 0.04433

6 51 0.61237 0.64684 0.396117 54 1.02062 1.06641 1.088408 56 1.42887 1.34612 1.92343

Sum of C5 = 6.99429

• r = 6.99429/7 = 0.999185• MINITAB command: Stat > Basic Statistics > Correlation

Page 48: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 48

Properties of correlation• Correlation requires both variables to be quantitative and make no

use of the distinction between explanatory and response variables.• Because r uses standardized values of observations, it does not

depend on units of measurements of x and y. Correlation r has no unit if measurement.

• Positive r indicates positive association between the variables and negative r indicates negative association.

• Correlation measures the strength of only the linear relationship between two variables, it does not describe curved relationship!

• r is always a number between –1 and 1.Values of r near 0 indicates a weak linear relationship.The strength of the linear relationship increases as r moves away from 0. Values of r close to –1 or 1 indicates that thepoints lie close to a straight line.r is not resistant. r is strongly affected by a few outliers.

Page 49: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 49

Page 50: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 50

Question from Term test, summer 99• MINITAB analyses of math and verbal SAT scores is given below.

Variable N Mean Median TrMean StDev SE MeanVerbal 200 595.65 586.00 595.57 73.21 5.18Math 200 649.53 649.00 650.37 66.35 4.69GPA 200 2.6300 2.6000 2.6439 0.5803 0.0410

Variable Minimum Maximum Verbal 361.00 780.00 Math 441.00 800.00 GPA 0.3000 3.9000

Stem-and-leaf of Verbal N = 200Leaf Unit = 10

1 3 64 4 03419 4 56688888888999952 5 000000122222222333333333444444444

(56) 5 5555555555555666666677777777777777888888888888888999999992 6 0000000001111111122222233333333344444444444444445 6 55555566666666677888888888999915 7 00111122445 7 55568

Page 51: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 51

Stem-and-leaf of Math N = 200Leaf Unit = 10

1 4 43 4 7912 5 00122223438 5 55555666677777778888889999(63) 6 00000000000000111111111111222222222222222233333333334444444444499 6 55555555566666666666666777777777778888888999999951 7 00000000001111111111111222222233333444412 7 55667777892 8 00

400 500 600 700 800

0

10

20

30

Math

Freq

uenc

y

400 500 600 700 800

0

10

20

Verbal

Freq

uenc

y

Page 52: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 52

a) Find the 25th percentile, 75th percentile and the IQR of the math SAT scores.

b) You were one of the students of this study and your math SAT score was 532. What is your z-score and percentile standing?

c) If the math SAT scores were in fact left (negatively) skewed, but the mean was still 650, what could you say about the percentile standing of someone who obtains a score of 650?

d) What is the class width ?i) of the histogram for verbal SAT scores?ii) of the stemplot of the verbal SAT scores?

e) Describe both the verbal and math score distributions and compare one with the other.

Page 53: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 53

g) Give a rough sketch of how a normal probability plot would look if the verbal scores were

i. Right (positively) skewedii. Uniform in shape

h) For verbal scores, aside from running through the data and tallying, can you determine the approx. percentage of scores which fall between 523 and 668? If so give the percentage.

Page 54: The empirical (68-95-99.7) rulefisher.utstat.toronto.edu/~hadas/STA220/Lecture notes... · 2007-09-23 · week3 1 The empirical (68-95-99.7) rule • With a bell shaped distribution,

week3 54

Question (Term Test May 98)

• Descriptive statistics of scores of 3 groups of students are given below.

Variable Group N Mean Median TrMean StDevPost1 B 22 6.682 6.500 6.650 2.767

D 22 9.773 10.000 9.800 2.724

S 22 7.773 7.000 7.750 3.927

• Using the information above estimate the following in some reasonable way. State any assumptions that you have to make.(a) The 90th percentile of the post1 scores using method B.

b) The proportion of post1 scores that would be 7 or higher forthose using method D.