five-number summary minimum, first quartile, second quartile, third quartile, maximum minq 1 q 2 q 3...
Post on 21-Dec-2015
225 views
TRANSCRIPT
five-number summaryMinimum, First Quartile, Second Quartile, Third Quartile, Maximum Min Q1 Q2 Q3 Max
center value in a distribution
Return to the definitions in Class Handout #1:
Now return to Exercise #1(i) in Class Handout #1:
From the stem-and-leaf display in part (g) for the variable “yearly income,” find the five-number summary, the median, the range, and the interquartile range.
1.-continued (i)
2
43
5
76
05
1
0
3
5
637
50
4 4
80
8
15
1
49 9
4
9
5
5 5 9
8
9
Min =
Q1 =
Q2 =
Q3 =
Max =
25
78
40.5
33
60
= median
range = Max – Min = 78 – 25 = 53
IQR = Q3 – Q1 = 60 – 33 = 27
outlier an observation whose value is deemed to be either unusually high or unusually low relative to the other observations in the data set; one guideline is to consider any value which is a distance of more than 1.5(IQR) below Q1 or above Q3 a potential outlier
box plot a graphical display for quantitative sample data where two edges of a rectangular box mark the values of Q1 and Q3 , a line dividing the box into two sections marks the value of Q2 , and the ends of two lines extending from the sides of the box represent the minimum and maximum.
Often, the lines are stopped at the largest and smallest values which are not potential outliers, and all potential outliers are designated with dots.
From the five-number summary found in part (i) for the variable “yearly income,” decide whether or not there are any potential outliers, and construct a box plot.
(j)
(1.5)IQR = (1.5)27 = 40.5
Since 33 – 25 = 7 < 40.5, the minimum is not a potential outlier.Since 78 – 60 = 18 < 40.5, the maximum is not a potential outlier.
25 35 45 55 65
Yearly Income ($1000s)
7530 40 50 60 70 80
Min =
Q1 =
Q2 =
Q3 =
Max =
25
78
40.5
33
60
= median
five-number summaryMinimum, First Quartile, Second Quartile, Third Quartile, Maximum Min Q1 Q2 Q3 Max
center value in a distribution refers to a middle or average value for the distribution
Some measures of center for a distribution are as follows:
the arithmetic average of n quantitative observations y1 , y2 , , yn , calculated by dividing the sum of the observations by the number of observations; that is,
y =y1 + y2 + + yn——————— = n
y—– n
mean ( denoted y )
median the second quartile Q2 in the five-number summary
mode the most frequently occurring observation(s) if any exist
(With a quantitative variable, the mean and median are often of more interest then the mode; the mode is more useful with a qualitative variable, where computing a mean or median makes no sense as a general rule.)
dispersion in a distribution refers the amount of variation (spread) in the distribution
Some measures of dispersion for a distribution are as follows:
interquartile range (IQR) the difference between the third and first quartiles, that is, Q3 – Q1
range the difference between the largest and smallest observations, that is, Max – Min
the sum of the squared deviations from the mean (i.e., the sum of y1 – y , y2 – y , … , yn – y) divided by one less than the number of observations; that is,
variance (denoted s2)
s2 = (y – y)2
———— n – 1
standard deviation (denoted s) the square root of the variance
The yearly incomes ($1000s) for the eight democrats in the sample are
28 75 26 78 40 60 49 39
Do each of the following for these eight incomes:
List the observations in ascending order.
Find the five-number summary.
Find the median.
Find the mean.
Find the variance and standard deviation.
26 28 39 40 49 60 75 78
26 33.5 44.5 67.5 78
44.5
26 + 28 + 39 + 40 + 49 + 60 + 75 + 78 495———————————————— = —— = 49.375
8 8
(26–49.375)2 + (28–49.375)2 + … + (78–49.375)2 2787.875———————————————————— = ———— = 398.268
8–1 7
1.-continued (k)
y =
s2 =
s = 398.268 = 19.957
The
se s
tatis
tics
can
all b
e ve
rifi
ed b
y us
ing
the
Exc
el s
prea
dshe
et n
amed
Sum
mar
y_St
atis
tics,
Find the mode. There is no mode, since no observations are repeated.
If the largest salary of 78 ($78,000) were changed to 175 ($175,000), what would be the values for the mean and median? What does this suggest about how the values of the mean and median are influenced by symmetry and skewness in a distribution?
The mean would be 61.5, but the median would stay equal to 44.5.
shape of a distribution (symmetry and skewness) involves the type and amount of symmetry or non-symmetry present in the distribution
A distribution is called positively skewed when larger values tend to be more dispersed (perhaps resulting in a few unusually high values) and called negatively skewed when smaller values tend to be more dispersed (perhaps resulting in a few unusually small values). One measure of skewness in a distribution is as follows:
This can be done using the Excel spreadsheet named Summary_Statistics,
When a distribution is nearly symmetric, the mean and median will be close in value. If the mean is smaller than the median, the distribution is negatively skewed; if the mean is larger than the median, the distribution is positively skewed.
shape of a distribution (symmetry and skewness) involves the type and amount of symmetry or non-symmetry present in the distribution
A distribution is called positively skewed when larger values tend to be more dispersed (perhaps resulting in a few unusually high values) and called negatively skewed when smaller values tend to be more dispersed (perhaps resulting in a few unusually small values). One measure of skewness in a distribution is as follows: (If the skewness ratio for a distribution
is less than –0.3, then the distribution can be considered very negatively skewed, and if the skewness ratio for a distribution is greater than +0.3, then the distribution can be considered very positively skewed.)
skewness ratioy – median————– s
If the largest salary of 78 ($78,000) were changed to 175 ($175,000), what would be the values for the mean and median? What does this suggest about how the values of the mean and median are influenced by symmetry and skewness in a distribution?
The mean would be 61.5, but the median would stay equal to 44.5.
When a distribution is nearly symmetric, the mean and median will be close in value. If the mean is smaller than the median, the distribution is negatively skewed; if the mean is larger than the median, the distribution is positively skewed.
From the information displayed in part (d) for the variable “political party affiliation,” find the mode.
(l)
Find the mode. There is no mode, since no observations are repeated.
The mode is the category “Republican” which is repeated most often.
What does the skewness ratio suggest about the shape of the distribution?
49.375 – 44.5—————— = + 0.24 suggests a somewhat positively skewed distribution 19.957
2. For each of four levels of educational attainment, the distribution of ages for Americans at least 25 years old in 1984 is organized into a frequency polygon (from data collected by the U.S. Bureau of the Census (and taken from the World Almanac and Book of Facts, 1986).
Did Not Complete High School
05
10152025303540
15 25 35 45 55 65 75 85 95
Age (years)
Rel
ativ
e F
requ
ency
Completed High School
05
10152025303540
15 25 35 45 55 65 75 85 95
Age (years)
Rel
ativ
e F
requ
ency
1 to 3 Years of College
05
10152025303540
15 25 35 45 55 65 75 85 95
Age (years)
Rel
ativ
e F
requ
ency
4 or More Years of College
05
10152025303540
15 25 35 45 55 65 75 85 95
Age (years)
Rel
ativ
e F
requ
ency
Did Not Complete High School Completed High School
1 to 3 Years of College 4 or More Years of College
Does the distribution of ages appear to be centered at different values for the different levels of education? If yes, for which level of education does the center appear to be smallest, and for which level of education does the center appear to be largest?
Does the distribution of ages appear to have a different amount of dispersion for the different levels of education? If yes, for which level of education does the dispersion appear to be smallest, and for which level of education does the dispersion appear to be largest?
The ages for those not completing high school appear to be centered at a considerably higher value than for those in each of the other three level of education categories.
The variation of ages appears to be roughly the same for the four level of education categories.
(a)
(b)
(c) Does the distribution of ages appear to have a different shape for the different levels of education? If yes, how does the shape appear to differ?
None of the distributions appear to be symmetric. The distribution of ages looks negatively skewed for those not completing high school and positively skewed for each of the other three level of education categories.
bar chart a graphical display for qualitative data where categories are listed on a horizontal axis and the height of a bar for each category represents a raw or relative frequency as indicated by the labels on a vertical axis.
pie chart a graphical display for qualitative data where relative frequency for each category is represented as a slice of a circle (pie) and the categories listed include all possibilities
histogram a graphical display for quantitative sample data where the possible numerical values are scaled on a horizontal axis and the height of a bar for each of several intervals of values represents a raw or relative frequency as indicated by the labels on a vertical axis.
frequency polygon a graphical display for quantitative sample data where the possible numerical values are scaled on a horizontal axis and dots placed at the middle of the top of where each bar for a histogram would be are connected to produce a rough “curve” - the proportion of observations that fall within a given interval of values is represented by the corresponding area under the “curve”
probability distribution curve a “smooth curve” designed to describe quantitative population data so that the proportion, or probability, of observations falling within a given interval of values is represented by the corresponding area under the “curve”
3. Suppose each of the box plots displayed represents the distribution of sample data selected from some population. For each box plot, make a sketch of what a corresponding histogram of the data could look like, and make a sketch of what the probability distribution curve for the corresponding population could look like.
Uniform Distribution
Bell-Shaped Distribution
Positively Skewed Distribution
Negatively Skewed Distribution
Now look at Class Handout #2.
Class Handout #2Definitions
parameter a numerical quantity which describes some characteristic of a population
statistic a numerical quantity which describes some characteristic of a sample
Tchebysheff’s Theorem
The symbol x is used to represent the mean for a sample.The symbol is used to represent the mean for a population.The symbol s is used to represent the standard deviation for a sample.The symbol is used to represent the standard deviation for a population.We see then that and are parameters, and that x and s are statistics.
1.
(a)
For each situation, identify the experimental unit, the population, the sample, the parameter, and the statistic.
A state government official computes the mean yearly amount of spending per pupil for 75 selected public school districts in the state in order to draw a conclusion about the mean yearly amount of spending per pupil for all public school districts in the state.
The experimental unit is
The population is
The sample is
The parameter is
The statistic is
all public school districts in the state.
the 75 selected public school districts.
each public school district in the state.
the mean yearly amount of spending per pupil for all public school districts in the state.
the mean yearly amount of spending per pupil for the 75 selected public school districts in the state.
(b) A pollster surveys 427 selected voters in a state by phone and calculates the proportion intending to vote for the incumbent governor in order to draw a conclusion about the proportion of all voters in the state intending to vote for the incumbent governor.
The experimental unit is
The population is
The sample is
The parameter is
The statistic is
all voters in the state.
the 427 selected voters.
each voter in the state.
the proportion of all voters in the state intending to vote for the incumbent governor.
the proportion of the 427 surveyed voters intending to vote for the incumbent governor.
Class Handout #2Definitions
parameter a numerical quantity which describes some characteristic of a population
statistic a numerical quantity which describes some characteristic of a sample
Tchebysheff’s Theorem a statement of the following facts (requiring calculus to prove):
The symbol x is used to represent the mean for a sample.The symbol is used to represent the mean for a population.The symbol s is used to represent the standard deviation for a sample.The symbol is used to represent the standard deviation for a population.We see then that and are parameters, and that x and s are statistics.
For any data set (population or sample), at least 75% (three fourths) of the measurements must lie within two standard deviations of the mean, that is, 75% (3/4) of a data set must lie between the values
mean – 2(standard deviation) and mean + 2(standard deviation) .
For any data set (population or sample), at least 89% of the measurements must lie within three standard deviations of the mean, that is, 89% of a data set must lie between the values
mean – 3(standard deviation) and mean + 3(standard deviation) .
2.
(a)
For each data set (all of which are displayed in a stem-and-leaf display), (i)
(ii)
verify that the given mean, the given standard deviation, and the given five-number summary are correct;find the proportion of measurements which lie within two standard deviations of the mean, and compare this proportion with what Tchebysheff’s Theorem states.
1011121314151617181920212223242526272829
95
81 51 4 6 8 91 2 4 6 95 92
51
mean =
standard deviation =
200
41.94
These statistics can all be verified by using the Excel spreadsheet named Summary_Statistics,
109
291
200
188
212
The interval within two standard deviations of the mean is from 200 – 2(41.94) to 200 +2(41.94), that is from to116.12 283.88
The proportion of measurements which lie within two standard deviations of the mean is 16/20 = 80% .
Min =
Q1 =
Q2 =
Q3 =
Max =
Tchebysheff’s Theorem states that this proportion will be at least 75%.
(b)1011121314151617181920212223242526272829
95
81 51 4 6 8 91 2 4 6 95 92
51
mean =
standard deviation =
200
33.20
These statistics can all be verified by using the Excel spreadsheet named Summary_Statistics,
129
271
200
188
212
The interval within two standard deviations of the mean is from 200 – 2(33.20) to 200 +2(33.20), that is from to133.60 266.40
The proportion of measurements which lie within two standard deviations of the mean is 18/20 = 90% .
Min =
Q1 =
Q2 =
Q3 =
Max =
Tchebysheff’s Theorem states that this proportion will be at least 75%.
Note that in both data sets, the two largest measurements and the two smallest measurements are all potential outliers.
For most bell-shaped (or mound-shaped) distributions with no outliers, 95% of the measurements will lie within two standard deviations of the mean.
3.
(a)
(b)
A random sample of 30 days of hotel reservations is selected to obtain information about the distribution of the daily number of “no shows”. Each value recorded below is the number of room reservation bookings where the party did not show up and did not cancel the reservation.
Identify the experimental unit, the variable of interest, the population, and the sample.
Are the mean and standard deviation for this data parameters or statistics?
The experimental unit is
The variable of interest is
The population is
The sample is
all days when hotel reservations are made.
the 30 selected days.
each day.
the number of “no shows” among hotel bookings for a day.
The mean and standard deviation calculated from this random sample are statistics. (Parameters are numerical quantities which describe characteristics of a population.)
18 16 16 16 14 18 16 18 14 1915 19 9 20 10 10 12 14 18 1214 14 17 12 18 13 15 13 15 19
(c)
(d)
(e)
Use the Excel files Summary_Statistics and M214_Data to find the mean and standard deviation for this data, and comment on the type of distribution the data appears to have.
mean = x =
standard deviation = s =
15.133
2.945
The data appears to have a somewhat bell-shaped (mound-shaped) distribution, and there are no outliers.
The interval within two standard deviations of the mean is from 15.133 – 2(2.945) to 15.133 +2(2.945), that is from to9.243 21.023
We shall use the mean and standard deviation from part (c) as estimates of the mean and standard deviation for the mean and standard deviation for the population. Explain why it is reasonable to assume that about 95% of all days will have a number of “no shows” within two standard deviations of the mean, and then find this interval.
From the interval found in part (d), how many rooms per day can the hotel overbook per day and still feel confident that all reservations will be honored?
Since it appears that the number of “no shows” each day will almost always be at least 9 or 10, a hotel might feel confident with overbooking 9 or 10 rooms per day.
Since the data appears to have a somewhat bell-shaped (mound-shaped) distribution, and there are no outliers, we expect about 95% of all measurements to be within two standard deviations of the mean.
Before submitting Homework #1, check some of the answers (if you haven’t done so already) from the link on the course schedule:
http://srv2.lycoming.edu/~sprgene/M214/Schedule214.htm