five-number summary minimum, first quartile, second quartile, third quartile, maximum minq 1 q 2 q 3...

five-number summaryMinimum, First Quartile, Second Quartile, Third Quartile, Maximum Min Q1 Q2 Q3 Max

center value in a distribution

Return to the definitions in Class Handout #1:

Now return to Exercise #1(i) in Class Handout #1:

From the stem-and-leaf display in part (g) for the variable “yearly income,” find the five-number summary, the median, the range, and the interquartile range.

1.-continued (i)

2

43

5

76

05

1

0

3

5

637

50

4 4

80

8

15

1

49 9

4

9

5

5 5 9

8

9

Min =

Q1 =

Q2 =

Q3 =

Max =

25

78

40.5

33

60

= median

range = Max – Min = 78 – 25 = 53

IQR = Q3 – Q1 = 60 – 33 = 27

outlier an observation whose value is deemed to be either unusually high or unusually low relative to the other observations in the data set; one guideline is to consider any value which is a distance of more than 1.5(IQR) below Q1 or above Q3 a potential outlier

box plot a graphical display for quantitative sample data where two edges of a rectangular box mark the values of Q1 and Q3 , a line dividing the box into two sections marks the value of Q2 , and the ends of two lines extending from the sides of the box represent the minimum and maximum.

Often, the lines are stopped at the largest and smallest values which are not potential outliers, and all potential outliers are designated with dots.

From the five-number summary found in part (i) for the variable “yearly income,” decide whether or not there are any potential outliers, and construct a box plot.

(j)

(1.5)IQR = (1.5)27 = 40.5

Since 33 – 25 = 7 < 40.5, the minimum is not a potential outlier.Since 78 – 60 = 18 < 40.5, the maximum is not a potential outlier.

25 35 45 55 65

Yearly Income ($1000s)

7530 40 50 60 70 80

Min =

Q1 =

Q2 =

Q3 =

Max =

25

78

40.5

33

60

= median

five-number summaryMinimum, First Quartile, Second Quartile, Third Quartile, Maximum Min Q1 Q2 Q3 Max

center value in a distribution refers to a middle or average value for the distribution

Some measures of center for a distribution are as follows:

the arithmetic average of n quantitative observations y1 , y2 , , yn , calculated by dividing the sum of the observations by the number of observations; that is,

y =y1 + y2 + + yn——————— = n

y—– n

mean ( denoted y )

median the second quartile Q2 in the five-number summary

mode the most frequently occurring observation(s) if any exist

(With a quantitative variable, the mean and median are often of more interest then the mode; the mode is more useful with a qualitative variable, where computing a mean or median makes no sense as a general rule.)

dispersion in a distribution refers the amount of variation (spread) in the distribution

Some measures of dispersion for a distribution are as follows:

interquartile range (IQR) the difference between the third and first quartiles, that is, Q3 – Q1

range the difference between the largest and smallest observations, that is, Max – Min

the sum of the squared deviations from the mean (i.e., the sum of y1 – y , y2 – y , … , yn – y) divided by one less than the number of observations; that is,

variance (denoted s2)

s2 = (y – y)2

———— n – 1

standard deviation (denoted s) the square root of the variance

The yearly incomes ($1000s) for the eight democrats in the sample are

28 75 26 78 40 60 49 39

Do each of the following for these eight incomes:

List the observations in ascending order.

Find the five-number summary.

Find the median.

Find the mean.

Find the variance and standard deviation.

26 28 39 40 49 60 75 78

26 33.5 44.5 67.5 78

44.5

26 + 28 + 39 + 40 + 49 + 60 + 75 + 78 495———————————————— = —— = 49.375

8 8

(26–49.375)2 + (28–49.375)2 + … + (78–49.375)2 2787.875———————————————————— = ———— = 398.268

8–1 7

1.-continued (k)

y =

s2 =

s = 398.268 = 19.957

The

se s

tatis

tics

can

all b

e ve

rifi

ed b

y us

ing

the

Exc

el s

prea

dshe

et n

amed

Sum

mar

y_St

atis

tics,

Find the mode. There is no mode, since no observations are repeated.

If the largest salary of 78 ($78,000) were changed to 175 ($175,000), what would be the values for the mean and median? What does this suggest about how the values of the mean and median are influenced by symmetry and skewness in a distribution?

The mean would be 61.5, but the median would stay equal to 44.5.

shape of a distribution (symmetry and skewness) involves the type and amount of symmetry or non-symmetry present in the distribution

A distribution is called positively skewed when larger values tend to be more dispersed (perhaps resulting in a few unusually high values) and called negatively skewed when smaller values tend to be more dispersed (perhaps resulting in a few unusually small values). One measure of skewness in a distribution is as follows:

This can be done using the Excel spreadsheet named Summary_Statistics,

When a distribution is nearly symmetric, the mean and median will be close in value. If the mean is smaller than the median, the distribution is negatively skewed; if the mean is larger than the median, the distribution is positively skewed.

shape of a distribution (symmetry and skewness) involves the type and amount of symmetry or non-symmetry present in the distribution

A distribution is called positively skewed when larger values tend to be more dispersed (perhaps resulting in a few unusually high values) and called negatively skewed when smaller values tend to be more dispersed (perhaps resulting in a few unusually small values). One measure of skewness in a distribution is as follows: (If the skewness ratio for a distribution

is less than –0.3, then the distribution can be considered very negatively skewed, and if the skewness ratio for a distribution is greater than +0.3, then the distribution can be considered very positively skewed.)

skewness ratioy – median————– s

If the largest salary of 78 ($78,000) were changed to 175 ($175,000), what would be the values for the mean and median? What does this suggest about how the values of the mean and median are influenced by symmetry and skewness in a distribution?

The mean would be 61.5, but the median would stay equal to 44.5.

When a distribution is nearly symmetric, the mean and median will be close in value. If the mean is smaller than the median, the distribution is negatively skewed; if the mean is larger than the median, the distribution is positively skewed.

From the information displayed in part (d) for the variable “political party affiliation,” find the mode.

(l)

Find the mode. There is no mode, since no observations are repeated.

The mode is the category “Republican” which is repeated most often.

What does the skewness ratio suggest about the shape of the distribution?

49.375 – 44.5—————— = + 0.24 suggests a somewhat positively skewed distribution 19.957

2. For each of four levels of educational attainment, the distribution of ages for Americans at least 25 years old in 1984 is organized into a frequency polygon (from data collected by the U.S. Bureau of the Census (and taken from the World Almanac and Book of Facts, 1986).

Did Not Complete High School

05

10152025303540

15 25 35 45 55 65 75 85 95

Age (years)

Rel

ativ

e F

requ

ency

Completed High School

05

10152025303540

15 25 35 45 55 65 75 85 95

Age (years)

Rel

ativ

e F

requ

ency

1 to 3 Years of College

05

10152025303540

15 25 35 45 55 65 75 85 95

Age (years)

Rel

ativ

e F

requ

ency

4 or More Years of College

05

10152025303540

15 25 35 45 55 65 75 85 95

Age (years)

Rel

ativ

e F

requ

ency

Did Not Complete High School Completed High School

1 to 3 Years of College 4 or More Years of College

Does the distribution of ages appear to be centered at different values for the different levels of education? If yes, for which level of education does the center appear to be smallest, and for which level of education does the center appear to be largest?

Does the distribution of ages appear to have a different amount of dispersion for the different levels of education? If yes, for which level of education does the dispersion appear to be smallest, and for which level of education does the dispersion appear to be largest?

The ages for those not completing high school appear to be centered at a considerably higher value than for those in each of the other three level of education categories.

The variation of ages appears to be roughly the same for the four level of education categories.

(a)

(b)

(c) Does the distribution of ages appear to have a different shape for the different levels of education? If yes, how does the shape appear to differ?

None of the distributions appear to be symmetric. The distribution of ages looks negatively skewed for those not completing high school and positively skewed for each of the other three level of education categories.

bar chart a graphical display for qualitative data where categories are listed on a horizontal axis and the height of a bar for each category represents a raw or relative frequency as indicated by the labels on a vertical axis.

pie chart a graphical display for qualitative data where relative frequency for each category is represented as a slice of a circle (pie) and the categories listed include all possibilities

histogram a graphical display for quantitative sample data where the possible numerical values are scaled on a horizontal axis and the height of a bar for each of several intervals of values represents a raw or relative frequency as indicated by the labels on a vertical axis.

frequency polygon a graphical display for quantitative sample data where the possible numerical values are scaled on a horizontal axis and dots placed at the middle of the top of where each bar for a histogram would be are connected to produce a rough “curve” - the proportion of observations that fall within a given interval of values is represented by the corresponding area under the “curve”

probability distribution curve a “smooth curve” designed to describe quantitative population data so that the proportion, or probability, of observations falling within a given interval of values is represented by the corresponding area under the “curve”

3. Suppose each of the box plots displayed represents the distribution of sample data selected from some population. For each box plot, make a sketch of what a corresponding histogram of the data could look like, and make a sketch of what the probability distribution curve for the corresponding population could look like.

Uniform Distribution

Bell-Shaped Distribution

Positively Skewed Distribution

Negatively Skewed Distribution

Now look at Class Handout #2.

Class Handout #2Definitions

parameter a numerical quantity which describes some characteristic of a population

statistic a numerical quantity which describes some characteristic of a sample

Tchebysheff’s Theorem

The symbol x is used to represent the mean for a sample.The symbol is used to represent the mean for a population.The symbol s is used to represent the standard deviation for a sample.The symbol is used to represent the standard deviation for a population.We see then that and are parameters, and that x and s are statistics.

1.

(a)

For each situation, identify the experimental unit, the population, the sample, the parameter, and the statistic.

A state government official computes the mean yearly amount of spending per pupil for 75 selected public school districts in the state in order to draw a conclusion about the mean yearly amount of spending per pupil for all public school districts in the state.

The experimental unit is

The population is

The sample is

The parameter is

The statistic is

all public school districts in the state.

the 75 selected public school districts.

each public school district in the state.

the mean yearly amount of spending per pupil for all public school districts in the state.

the mean yearly amount of spending per pupil for the 75 selected public school districts in the state.

(b) A pollster surveys 427 selected voters in a state by phone and calculates the proportion intending to vote for the incumbent governor in order to draw a conclusion about the proportion of all voters in the state intending to vote for the incumbent governor.


The population is

The sample is

The parameter is

The statistic is

all voters in the state.

the 427 selected voters.

each voter in the state.

the proportion of all voters in the state intending to vote for the incumbent governor.

the proportion of the 427 surveyed voters intending to vote for the incumbent governor.

Class Handout #2Definitions

parameter a numerical quantity which describes some characteristic of a population

statistic a numerical quantity which describes some characteristic of a sample

Tchebysheff’s Theorem a statement of the following facts (requiring calculus to prove):

The symbol x is used to represent the mean for a sample.The symbol is used to represent the mean for a population.The symbol s is used to represent the standard deviation for a sample.The symbol is used to represent the standard deviation for a population.We see then that and are parameters, and that x and s are statistics.

For any data set (population or sample), at least 75% (three fourths) of the measurements must lie within two standard deviations of the mean, that is, 75% (3/4) of a data set must lie between the values

mean – 2(standard deviation) and mean + 2(standard deviation) .

For any data set (population or sample), at least 89% of the measurements must lie within three standard deviations of the mean, that is, 89% of a data set must lie between the values

mean – 3(standard deviation) and mean + 3(standard deviation) .

2.

(a)

For each data set (all of which are displayed in a stem-and-leaf display), (i)

(ii)

verify that the given mean, the given standard deviation, and the given five-number summary are correct;find the proportion of measurements which lie within two standard deviations of the mean, and compare this proportion with what Tchebysheff’s Theorem states.

1011121314151617181920212223242526272829

95

81 51 4 6 8 91 2 4 6 95 92

51

mean =

standard deviation =

200

41.94

These statistics can all be verified by using the Excel spreadsheet named Summary_Statistics,

109

291

200

188

212

The interval within two standard deviations of the mean is from 200 – 2(41.94) to 200 +2(41.94), that is from to116.12 283.88

The proportion of measurements which lie within two standard deviations of the mean is 16/20 = 80% .

Min =

Q1 =

Q2 =

Q3 =

Max =

Tchebysheff’s Theorem states that this proportion will be at least 75%.

(b)1011121314151617181920212223242526272829

95

81 51 4 6 8 91 2 4 6 95 92

51

mean =

standard deviation =

200

33.20

These statistics can all be verified by using the Excel spreadsheet named Summary_Statistics,

129

271

200

188

212

The interval within two standard deviations of the mean is from 200 – 2(33.20) to 200 +2(33.20), that is from to133.60 266.40

The proportion of measurements which lie within two standard deviations of the mean is 18/20 = 90% .

Min =

Q1 =

Q2 =

Q3 =

Max =

Tchebysheff’s Theorem states that this proportion will be at least 75%.

Note that in both data sets, the two largest measurements and the two smallest measurements are all potential outliers.

For most bell-shaped (or mound-shaped) distributions with no outliers, 95% of the measurements will lie within two standard deviations of the mean.

3.

(a)

(b)

A random sample of 30 days of hotel reservations is selected to obtain information about the distribution of the daily number of “no shows”. Each value recorded below is the number of room reservation bookings where the party did not show up and did not cancel the reservation.

Identify the experimental unit, the variable of interest, the population, and the sample.

Are the mean and standard deviation for this data parameters or statistics?


The variable of interest is

The population is

The sample is

all days when hotel reservations are made.

the 30 selected days.

each day.

the number of “no shows” among hotel bookings for a day.

The mean and standard deviation calculated from this random sample are statistics. (Parameters are numerical quantities which describe characteristics of a population.)

18 16 16 16 14 18 16 18 14 1915 19 9 20 10 10 12 14 18 1214 14 17 12 18 13 15 13 15 19

(c)

(d)

(e)

Use the Excel files Summary_Statistics and M214_Data to find the mean and standard deviation for this data, and comment on the type of distribution the data appears to have.

mean = x =

standard deviation = s =

15.133

2.945

The data appears to have a somewhat bell-shaped (mound-shaped) distribution, and there are no outliers.

The interval within two standard deviations of the mean is from 15.133 – 2(2.945) to 15.133 +2(2.945), that is from to9.243 21.023

We shall use the mean and standard deviation from part (c) as estimates of the mean and standard deviation for the mean and standard deviation for the population. Explain why it is reasonable to assume that about 95% of all days will have a number of “no shows” within two standard deviations of the mean, and then find this interval.

From the interval found in part (d), how many rooms per day can the hotel overbook per day and still feel confident that all reservations will be honored?

Since it appears that the number of “no shows” each day will almost always be at least 9 or 10, a hotel might feel confident with overbooking 9 or 10 rooms per day.

Since the data appears to have a somewhat bell-shaped (mound-shaped) distribution, and there are no outliers, we expect about 95% of all measurements to be within two standard deviations of the mean.

Before submitting Homework #1, check some of the answers (if you haven’t done so already) from the link on the course schedule:

http://srv2.lycoming.edu/~sprgene/M214/Schedule214.htm

http://srv2.lycoming.edu/~sprgene/M214/Schedule214.htm

five-number summary minimum, first quartile, second quartile, third quartile, maximum minq 1 q 2 q 3...

Documents