describing data sets graphical representation of data summary

29
1 Ch2 Online homework list: Describing Data Sets Graphical Representation of Data Summary statistics: Measures of Center Box Plots, Outliers, and Standard Deviation

Upload: others

Post on 12-Sep-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Describing Data Sets Graphical Representation of Data Summary

1

Ch2 Online homework list: Describing Data Sets Graphical Representation of Data Summary statistics: Measures of Center Box Plots, Outliers, and Standard Deviation

Page 2: Describing Data Sets Graphical Representation of Data Summary

2

2.1-2.3: Organizing Data

DEFINITIONS:

Qualitative Data are those which classify the units into categories. The

categories may or may not have a natural ordering to them. Qualitative variables

are also called categorical variables.

Quantitative variables have numerical values that are measurements (length,

weight, and so on) or counts (of how many). Arithmetic operations on such

numerical values do have meaning. We further distinguish quantitative variables

based on whether or not the values fall on a continuum

Let's Do It! What Type of Variable?

Hurricane Charles, in August 2004, has been blamed for at least 16 deaths. Listed below is information on other major storms and hurricanes that occurred from 1994 to 2003.

StormName Date Category Estimated

Damage/Cost* Deaths

Tropical Storm Alberto Jul-94 n/a $1.2billion 32

Hurricane Marilyn Sep-95 2 $2.5billion 13

Hurricane Opal Oct-95 3 $3.6billion 27

Hurricane Fran Sep-96 3 $5.8billion 37

Hurricane Bonnie Aug-98 3 $1.1billion 3

Hurricane Georges Sep-98 2 $6.5billion 16

Hurricane Floyd Sep-99 2 $6.5billion 77

Tropical Storm Allison Jun-01 n/a $5.1billion 43

Hurricane Isabel Sep-03 2 $4.0billion 47

For each variable, determine whether it is qualitative or quantitative. If the variable is quantitative, state whether it is discrete or continuous. (a) The name of the storm. (b) The date the storm occurred. (c) The category of the storm. (d) The estimated amount of damage or cost of the storm. (e) The number of deaths that occurred.

Page 3: Describing Data Sets Graphical Representation of Data Summary

3

DEFINITIONS:

The distribution of a variable provides the possible values that a variable can take on and

how often these possible values occur. The distribution of a variable shows the pattern of

variation of the variable.

Let's Do It!

College Admissions

The following pie chart shows the breakdown of undergraduate enrollment by race at the

University of Michigan for the fall term of 1996. The total number of undergraduates

enrolled for that term was 22,604.

(a) What percentage of undergraduates enrolled were of nonwhite race?

(b) How many undergraduates enrolled had no racial category listed?

Page 4: Describing Data Sets Graphical Representation of Data Summary

4

Let's Do It!

Allied Van Lines surveyed 1000 respondents in May 2004. The question asked was, “Would

you move if your mate had to relocate overseas because of work?” The results are

summarized in the following pie chart.

2%

30%

68%

(a) What percentage of the respondents said that they would actually move if their mate

relocated overseas?

(b) What questions would you ask about the sample selection using this information to draw

formal conclusions?

Let's Do It!

Nothing Really Matters

The bar graph shown here displays the percentage of respondents who think a particular

problem is the most important problem facing America for two different years. (SOURCE:

The Economist, March 30-April 5, 1996, pg 33.)

(a) In January 1992, which problem category had the

highest percentage of responses?

Was this the same category which had the highest

percentage of responses in 1996?

(b) In January 1992, what percentage of respondents

reported crime as the most important problem

facing America?

In January 1996, what percentage of respondents reported crime as the most important

problem facing America?

(c) What is the approximate sum of the percentage of respondents across all of the listed

problem categories for January 1992? Is this sum approximately 100%? If not, give a

possible reason why not.

p. 222-223

No

Yes

Not Sure

Page 5: Describing Data Sets Graphical Representation of Data Summary

5

Example: A Misleading Bar Graph

Problem

The bar graph that follows presents the total sales figures for three realtors. When the bars are

replaced with pictures, often related to the topic of the graph, the graph is called a pictogram.

Realtor #1 Realtor #3Realtor #2

$2.05 million

$1.41 million

$0.9 million

Total

Sales

No. 1 No. 2 No. 3

Realtor

(a) How does the height of the home for Realtor 1 compare to that for Realtor 3?

(b) How does the area of the home for Realtor 1 compare to that for Realtor 3?

What We’ve Learned: When you see a pictogram, be careful to interpret the results

appropriately, and do not allow the area of the pictures to mislead you.

Page 6: Describing Data Sets Graphical Representation of Data Summary

6

Let's Do It! Categorical Frequency Distributions

A survey was taken on how much trust people place in the information they read on the

Internet.

A = trust in everything they read,

M = trust in most of what they read,

H = trust in about half of what they read,

S = trust in a small portion of what they read.

Construct a categorical frequency distribution for the data.

M M M A H M S M H M

S M M M M A M M A M

M M H M M M H M H M

A M M M H M M M M M

A frequency distribution is the organization of raw data in table form, using classes and

frequencies.

Each raw data value is placed into a quantitative or qualitative category called a class.

The frequency of a class then is the number of data values contained in a specific class.

Two types of frequency distributions that are most often used are the categorical frequency

distribution and the grouped frequency distribution.

Categorical Frequency Distributions: The categorical frequency distribution is used for

data that can be placed in specific categories, such as nominal- or ordinal-level data.

Grouped Frequency Distributions: When the range of the data is large, the data must be

grouped into classes that are more than one unit in width.

Page 7: Describing Data Sets Graphical Representation of Data Summary

7

Histograms and Pie Charts

Let's Do It!

Distribution of scores

For 108 randomly selected college applicants, the following frequency distribution for

entrance exam scores was obtained.

a. Construct a relative frequency

histogram for the data.

b. Applicants who score above 107 need not enroll in a summer developmental program. In

this group, how many students do not have to enroll in the developmental program?

The Histogram: is a graph that displays the quantitative data by using contiguous vertical

bars (unless the frequency of a class is 0) of various heights to represent the frequencies of

the classes.

The Pie Graph: a circle that is divided into sections or wedges according to the

percentage of frequencies in each category of the distribution. The angle (in degrees) of

each wedge is given by:

Angle = relative frequency*360.

Class limits

126 – 134 9

28 117 – 125

43 108 – 116

22 99 – 107

6 90 – 98

Frequency

Page 8: Describing Data Sets Graphical Representation of Data Summary

8

Pie Graph

Let’s Explore It! The assets of the richest 1% of Americans are distributed as follows. Make a pie chart for the

percentages.

Principal residence 7.8% 28.08 °

Liquid assets 5.0% 18.0 °

Pension accounts 6.9% 24.84 °

Stock, funds, and trusts 31.6% 113.76 °

Businesses & real estate 46.9% 168.84 °

Miscellaneous 1.8% 6.48 °

Total 100% 360 °

Let’s do It! The population of federal prisons, according to the most serious offenses, consists of the

following. Make a Pie chart of the population.

Violent offenses 12.6%

Property offenses 8.5%

Drug offenses 60.2%

Public order offenses

Weapons 8.2%

Immigration 4.9%

Other 5.6%

Businesses &

Real Estate

46.9%

Misc. 1.8%

Principal

Residence

7.8%

Stocks, Funds, and

Trusts 31.6%

Pension

Accounts

6.9%

Liquid Assets

5.0%

Page 9: Describing Data Sets Graphical Representation of Data Summary

9

DATA SET 1

2.4 Measures of Central Tendency. Suppose you had to give a single number that would represent the most typical age for the 20 subjects. What number would you choose? Measures of center are numerical values that tend to report in some sense the middle of a set of data -- we will focus on the mean and the median. If the data are a sample, the mean and median would be called statistics. If the data form an entire population then these measures of center would be called parameters. Mean

DEFINITION: The mean of a set of n observations is simply the sum of the observations divided by

the number of observations, n.

Mean age of the 20 subjects in the medical study -- add the 20 ages up and divide by 20:

45 41 51 46 47 45 37

204335

. years

Special notation:

If x x xn1 2

, ,..., denote a sample of n observations,

then the mean of the sample is called "x-bar" and is denoted by:

xx

n

x x x

n

i n 1 2

The mean of a population is denoted by the Greek letter μ.

Subject # Gender Age

1001 M 45 1002 M 41 1003 F 51 1004 F 46 1005 F 47 1006 F 42 1007 M 43 1008 F 50 1009 M 39 1010 M 32 1011 M 41 1012 F 44 1013 F 47 1014 F 49 1015 M 45 1016 F 42 1017 M 41 1018 F 40 1019 M 45 1020 M 37

Page 10: Describing Data Sets Graphical Representation of Data Summary

10

Let’s Do it! Mean Number of Children per Household

Suppose that the number of children in a simple random sample of 10 households is as follows: 2, 3, 0, 2, 1, 0, 3, 0, 1, 4 (a) Calculate the sample mean number of children per

household.

(b) Suppose that the observation for the last household in the above list was

incorrectly recorded as 40 instead of 4.What would happen to the mean?

Note that 9 of the 10 observations are less than the mean. The mean is sensitive to extreme observations. Most graphical displays would have detected this outlying observation.

Let's Do it! A Mean Is Not Always Representative

Kim's test scores are 7, 98, 25, 19, and 26. Calculate Kim's mean test score. Explain why the mean does not do a very good job at summarizing Kim's test scores.

The mean = the point of equilibrium, the point where the distribution would balance. If the distribution is symmetric, as in the first picture at the left, the mean would be exactly at the center of the distribution. As the largest observation is moved further to the right, making this observation somewhat extreme, the mean shifts towards the extreme observation.

If a distribution appears to be skewed, we may wish also to report a more resistant measure of center.

Mean =2

1 2 3

Mean =2.5

1 2 5

Mean =4

1 2 11

Page 11: Describing Data Sets Graphical Representation of Data Summary

11

The Mean of Group Data /Frequency Tables The procedure for finding the mean for grouped data uses the midpoints of the classes. This procedure is shown next. Example

The data represent the number of miles run during one week for a sample of 20 runners. Solution

The procedure for finding the mean for grouped datais given here. Step 1 Make a table as shown.

Step 2 Find the midpoints of each class and enter them in column C.

Step 3 For each class, multiply the frequency by the midpoint, as shown, and place the product in column D. 1 .8 = 8 , 2 . 13 = 26 etc. The completed table is shown here.

Step 4 Find the sum of column D. Step 5 Divide the sum by n to get the mean.

Page 12: Describing Data Sets Graphical Representation of Data Summary

12

Let's Do It! : Eighty randomly selected light bulbs were tested to determine their lifetime in hours. The frequency table of the results is shown in table. Find the average lifetime of a light bulb ( use the procedure above)

Life interval in hours

Frequency

53-63 6

64-74 12

75-85 25

86-96 18

97-107 14

108-118 5

Let's Do It!

The cost per load (in cents) of 35 laundry detergents tested by consumer organization is given below.( use Calculator : 1-varstat)

Class limit Frequency

13-19 2

20-26 7

27-33 12

34-40 5

41-47 6

48-54 1

55-61 0

62-68 2

Page 13: Describing Data Sets Graphical Representation of Data Summary

13

A measure of center that is more resistant to extreme values is the median. Median

DEFINITION: The median of a set of n observations, ordered from smallest to largest, is a value such that half of the observations are less than or equal to that value and half the observations are greater than or equal to that value.

If the number of observations is odd, the median is the middle observation. If the number of observations is even, the median is any number between the two middle observations, including either of the two middle observations. To be consistent, we will define the median as the mean or average of the two middle observations. Location of the median: (n+1)/2, where n is the number of observations. The ages of the n = 20 subjects... Calculating (n+1)/2 we get (20+1)/2 = 10.5. So the two middle observations are the 10th and 11th observations, namely 43 and 44. The median is the mean of these two middle observations, (43+44)/2=43.5 years. 32 37 39 40 41 41 41 42 42 43 44 45 45 45 46 47 47 49 50 51

m ed ian = 43 .5

11 th obs10 th obs

Let's Do It! 1Median Number of Children per Household

Find the median number of children in a household from this sample of 10 households, that is, find the median of Number of Children: 2 3 0 1 4 0 3 0 1 2 (a) Median = ______________ (b) What happens to the median if the fifth observation in the first list was

incorrectly recorded as 40 instead of 4? (c) What happens to the median if the third observation in the

first list was incorrectly recorded as -20 instead of 0? Note: The median is resistant—that is, it does not change,

or changes very little, in response to extreme observations.

Page 14: Describing Data Sets Graphical Representation of Data Summary

14

Another Measure—The Mode

DEFINITION: The mode of a set of observations is the most frequently occurring value; it is the

value having the highest frequency among the observations.

The mode of the values: { 0, 0, 0, 0, 1, 1, 2, 2, 3, 4 } is 0 For { 0, 0, 0, 1, 1, 2, 2, 2, 3, 4 } two modes, 0 and 2 (bimodal) What would be the mode for { 0, 1, 2, 4, 5, 8 } ? The mode is not often used as a measure of center for quantitative data. The mode can be computed for qualitative data. The modal race category is ―white.‖

Which Measure of Center to Use? Bell-shaped, Symmetric Bimodal

m e a n = m e d i a n = m o d e

5 0 %

mean=median

two modes Skewed Right Skewed Left

m o d e

m e d i a n

m e a n

5 0 %

m o d e

m e d i a n

m e a n

5 0 %

0

10

20

30

40

50

60

70

80

American

Indian

No

Category

Hispanic African-

American

Asian White

Race

Pe

rce

nt

Page 15: Describing Data Sets Graphical Representation of Data Summary

15

Mean, Median, and Mode

The most common measure of center is the mean, which locates the balancing point of the distribution. The mean equals the sum of the observations, divided by how many there are. The mean is also affected by extreme observations (outliers and values which are far in the tail of a distribution that is skewed). So the mean tends to be a good choice for locating the center of a distribution that is unimodal and roughly symmetric, with no outliers. The median is a more robust measure of center, that is, it is not influenced by extreme values. The median is the middle observation when the data are ordered from smallest to largest. If you have an odd number of values, the median is the one in the middle. If you have an even number of values, the median is the mean of the two middle values, and fall exactly half way between them. If you have n observations, then (n+1)/2 tells you the location or position of the median. For skewed distributions or distributions with outliers, the median tends to be the better choice for locating the center. The mode is the value(s) that occurs most often. For a distribution, the mode is the value associated with the highest peak. The most frequent value can be far from the center of the distribution, so the mode is not really a measure of center. However, the mode is the only measure of the three that can be used for qualitative data. Tips: When you see or hear an ―average‖ reported, ask which average was really

computed -- the mean or the median. Think about or examine the distribution of values to assess if the measure of

center used is appropriate.

Page 16: Describing Data Sets Graphical Representation of Data Summary

16

2.5-2.5 MEASURING VARIATION OR SPREAD

Both sets of data have the same mean, median and mode but the values obviously

differ in another respect -- the variation or spread of the values.

The values in List 1 are much more tightly clustered around the center value of 60.

The values in List 2 are much more dispersed or spread out.

List 1: 55, 56, 57, 58, 59, 60, 60, 60, 61, 62, 63, 64, 65

mean = median = mode = 60 X X XXXXXXXXXXX .

35 40 45 50 55 60 65 70 75 80 85

List 2: 35, 40, 45, 50, 55, 60, 60, 60, 65, 70, 75, 80, 85

mean = median = mode = 60 X X X X X X X X X X X X X .

35 40 45 50 55 60 65 70 75 80 85

Range The range is the simplest measure of variability or spread. Range is just the difference between the largest value and the smallest value.

Inter-quartile Range The inter-quartile range measures the spread of the middle 50% of the data. You

first find the median (represented by Q2—the value that divides the data into two

halves), and then find the median for each half. The three values that divide the data

into four parts are called the quartiles, represented by Q1, Q2, and Q3. The

difference between the third quartile and the first quartile is called the inter-Quartile

range, denoted by IQR=Q3-Q1.

Page 17: Describing Data Sets Graphical Representation of Data Summary

17

Finding the Quartiles

1. Find the median of all of the observations. 2. First Quartile = Q1 = median of observations that fall below the

median. 3. Third Quartile = Q3 = median of observations that fall above the

median.

Notes When the number of observations is odd, the middle observation is

the median. This observation is not included in either of the two halves when computing Q1 and Q3.

Although different books, calculators, and computers may use slightly different ways to compute the quartiles, they are all based on the same idea.

In a left-skewed distribution, the first quartile will be farther from the median than the third quartile is. If the distribution is symmetric, the quartiles should be the same distance from the median.

Example Quartiles for Age

The ages of the 20 subjects in the medical study are listed below in order.

32, 37, 39, 40, 41, 41, 41, 42, 42, 43,

44, 45, 45, 45, 46, 47, 47, 49, 50, 51

The histogram of the ages is also provided.

(a) Calculate the median age. (b) Calculate the first Quartile Q1 for this age data. (c) Calculate the third Quartile Q3 for this age data. (d) Calculate the range for this age data.

32 37 39 40 41 41 41 42 42 43 44 45 45 45 46 47 47 49 50 51

m ed ian = 43 .5

Q1 = 41 Q3 = 46 .5

We see that the distribution of age is approximately symmetric and that the quartiles are about the same distance from the median. The quartiles are actually the 25th, 50th, and 75th percentiles.

Page 18: Describing Data Sets Graphical Representation of Data Summary

18

Five-Number Summary Five-number summary: Minimum, Q1, Median, Q3, Maximum

M in Q 1 Q 2 = M e d ia nQ 3 M a x

Boxplot:

To Build a Basic Boxplot

List the data values in order from smallest to largest. Find the five number summary: minimum, Q1, median, Q3, and maximum. Locate the values for Q1, the median and Q3 on the scale. These values

determine the ―box‖ part of the boxplot. The quartiles determine the ends of the box, and a line is drawn inside the box to mark the value of the median.

Draw lines (called whiskers) from the midpoints of the ends of the box out to the

minimum and maximum.

Using the 1.5 x IQR Rule to Identify Outliers and Build a Modified Boxplot List the data values in order from smallest to largest. Find the five number summary: minimum, Q1, median, Q3, and maximum. Locate the values for Q1, the median and Q3 on the scale. These values

determine the ―box‖ part of the boxplot. The quartiles determine the ends of the box, and a line is drawn inside the box to mark the value of the median.

Find the IQR = Q3 – Q1. Compute the quantity STEP = 1.5 x (IQR) Find the location of the inner fences by taking 1 step out from each of the

quartiles lower inner fence = Q1 – STEP;

upper inner fence = Q3 + STEP. Draw the lines (whiskers) from the midpoints of the ends of the box out to the

smallest and largest values WITHIN the inner fences. Observations that fall OUTSIDE the inner fences are considered potential outliers.

If there are any outliers, plot them individually along the scale using a solid dot.

Page 19: Describing Data Sets Graphical Representation of Data Summary

19

Example Any Age Outlier?

Let’s apply the "rule of thumb" to our age data set to assess if there are any outliers.

32, 37, 39, 40, 41, 41, 41, 42, 42, 43, 44, 45, 45, 45, 46, 47, 47, 49, 50, 51

(a) Construct the fences for the modified boxplot based on the 1.5 * IQR rule.

(b) Are there any outliers using the 1.5 * IQR rule? (c) Construct the modified boxplot.

The five-number summary for the age data is given by: min = 32, Q1 = 41, median = 43.5, Q3 = 46.5, and max = 51.

The distance between the median and the quartiles is roughly the same, supporting the rough symmetry of the distribution as seen previously from the histogram.

Page 20: Describing Data Sets Graphical Representation of Data Summary

20

Let's Do It! Five-Number Summary and Outliers

Page 21: Describing Data Sets Graphical Representation of Data Summary

21

Let’s Do It! Cost of Running Shoes

The prices for 12 comparable pairs of running shoes produced the following modified boxplot.

40 60 80 100 120

*

PRICE

(a) What was the approximate range of prices for such running shoes? Range = ______________ (b) Twenty-five percent of the shoes cost more than approximately what amount? $ _____________

Side-by-side boxplots are helpful for comparing two or more distributions with respect to the five-number summary.

Although the median of the first process is closer to the target value of 20.000 cm, the second process produces a less variable distribution.

Page 22: Describing Data Sets Graphical Representation of Data Summary

22

Let's Do It!

Comparing Ages—Antibiotic Study

Variable = age for 23 children randomly assigned to one of two treatment groups. (a) Give the five-number summary for each of the two treatment groups. Comment on your results.

Amoxicillin Group (n=11): 8 9 9 10 10 11 11 12 14 14 17

Five-number summary: Lower fence= , Upper fence= , outliers: Cefadroxil Group (n=12): 7 8 9 9 9 10 10 11 12 13 14 18

Five-number summary: Lower fence= , Upper fence= , outliers: (b) Make side-by-side Modified box-plots for the antibiotic study data in part (a).

p. 321

Page 23: Describing Data Sets Graphical Representation of Data Summary

23

Standard Deviation .…...a measure of the spread of the observations from the mean. .……think of the standard deviation as an ―average (or standard) distance of the observations from the mean.‖

Example Standard Deviation—What Is It?

Deviations: -4, 1, 3 Squared Deviations: 16, 1, 9

Observation Deviation Squared Deviation x x x x x

2 ----------------------------------------------------------------------------------------- 0 0 - 4 = -4 16 5 5 - 4 = 1 1 7 7 - 4 = 3 9 ----------------------------------------------------------------------------------------- mean = 4 sum always = 0 sum = 26

sample variance

4 1 3

3 1

16 1 9

2

26

213

2 2 2

sample standard deviation 13 36.

Interpretation of the Standard Deviation

Think of the standard deviation as roughly an average distance of the observations

from their mean. If all of the observations are the same, then the standard deviation

will be 0 (i.e. no spread). Otherwise the standard deviation is positive and the more

spread out the observations are about their mean, the larger the value of the

standard deviation.

Page 24: Describing Data Sets Graphical Representation of Data Summary

24

If x x xn1 2

, ,..., denote a sample of n observations,

the sample variance is denoted by:

)1(

/

)1(

)(

11

22

2222

2

2

1

2

2

n

nxx

nn

xxn

n

xxxxxx

n

xxs

ii

iini

Sample standard deviation, denoted by s , is the square root of the variance:

s s 2.

Shortcut formulas for computing the variance and standard deviation Mathematically equivalent to the preceding formulas and do not involve using the mean. They save time when repeated subtracting and squaring occur in the original formulas. They are also more accurate when the mean has been rounded.

.

Remarks:

The variance is measured in squared units. By taking the square root of the variance we bring this measure of spread back into the original units.

Just as the mean is not a resistant measure of center, since the standard deviation used the mean in its definition, it is not a resistant measure of spread. It is heavily influenced by extreme values.

There are statistical arguments that support why we divide by n1 instead of n in the denominator of the sample standard deviation.

Page 25: Describing Data Sets Graphical Representation of Data Summary

25

Example Consistency of Weight Loss Program

In a recent study of the effect of a certain diet on weight reduction, 11 subjects were put on the diet for two weeks and their weight loss/gain in lbs was measured (positive values indicate weight loss).

1, 1, 2, 2, 3, 2, 1, 1, 3, 2.5, -23. What is the standard deviation of the weight loss? Solution

x 1 1 2 5 23.... . 4.5 , x2 2 2 2 21 1 2 5 23 ... . ( ) 569.25

The standard deviation of this sample is s

569 25 4 5 11

10

2. ( . ) /7.5327

Let's Do It!

Emergency Room Patients

The following are the ages of a sample of 20 patients seen in the emergency room of a hospital on a Friday night. 35 32 21 43 39 60 36 12 54 45

37 53 45 23 64 10 34 22 36 55 Find the standard deviation of the ages.

Page 26: Describing Data Sets Graphical Representation of Data Summary

26

IQR and Standard Deviation The interquartile range, IQR, is the distance between the first and third quartiles (Q3 - Q1), and measures the spread of the middle 50% of the data. When the median is used as a measure of center, the IQR is often used as a measure of spread. For skewed distributions, or distributions with outliers, the IQR tends to be a better measure of spread if your goal is to summarize that distribution. Adding the minimum and maximum values to the median and quartiles results in the five-number summary. A graphical display of the five-number summary is a boxplot, and the length of the box corresponds to the IQR. The standard deviation is roughly the average distance of the observed values from their mean. The mean and the standard deviation are most useful for approximately symmetric distributions with no outliers. In the next chapter we will discuss an important family of symmetric distributions, called the normal distributions, for which the standard deviation is a very useful summary. Tip: The numerical summaries presented in this chapter provide information about the center and spread of a distribution, but a graph, such as a histogram or stem-and-leaf plot, provides the best picture of the overall shape of the distribution. Graph your data first!

Page 27: Describing Data Sets Graphical Representation of Data Summary

27

Variance and Standard Deviation for Grouped Data The procedure for finding the variance and standard deviation for grouped data is similar to that for finding the mean for grouped data, and it uses the midpoints of each class. Example

The data represent the number of miles that 20 runners ran during one week. Find the variance and the standard deviation for the frequency distribution of the data. Solution

Step1 Make a table as shown, and find the midpoint of each class.

Step 2 Multiply the frequency by the midpoint for each class, and place the products in column D. 1 .8 = 8, 2 . 13 =26, . . . , 2 .38 = 76 Step 3 Multiply the frequency by the square of the midpoint, and place the products in column E. 1 .8

2 = 64, 2 . 13

2 = 338, . . . ,

2 .382 = 2888

Step 4 Find the sums of columns B, D,

and E. The sum of column B is n, the sum of column D is f xi m , and the sum of column

E is f xi m

2 . The completed table is shown.

Step 5 Substitute in the formula and solve for s2 to get the variance. Step 6 Take the square root to get the standard deviation.

Page 28: Describing Data Sets Graphical Representation of Data Summary

28

Let's Do It! Eighty randomly selected light bulbs were tested to determine their lifetime in hours. The frequency table of the results is shown in table. Using the formula, find the standard deviation of lifetime of a light bulb.

Let's Do It! The data show distribution of the birth weight ( in oz.) of 100 consecutive deliveries. Use the calculator procedure (1-Var Stat) to calculate the standard deviation and the variance.

Chapter 2 Quiz Next Class.

Interval Frequency

29.50-69.45 5

69.50-89.45 10

89.50-99.45 11

99.50-109.45 19

109.50-119.45 17

119.50-129.45 20

129.50-139.45 12

139.50-149.45 6

Page 29: Describing Data Sets Graphical Representation of Data Summary

29

Chapter 2 Objectives:

Identify Types of Variables: (Quantitative / Categorical).

Compute percentages and answer questions using charts and histograms.

Identify misleading Pictograms.

Construct frequency table for categorical data.

Construct histograms of frequency tables.

Construct frequency table from a histogram.

Understand that histograms match the characteristics of the population

Compute Measures of Central Tendency (the three m’s: mean/median/mode) of data sets.

Combining Means: computing the overall mean of two groups using their averages (chapter 2

handout, Kim’s Scores)

Understand the Effect of Extreme Values on the mean (sensitive)/ median (resistant).

Compute the Mean of a Frequency Table

Effect of the Shape of the Distribution on the Mean, Median, Mode

Compute the Range of a data set.

Find the 5-Number Summary (Min, Q1, Median, Q3, Max).

Draw basic box-plot

Find the 5-Number Summary from a Modified Box-Plot.

Identify Potential outliers using the 1.5*IQR Rule.

Draw a Modified Box-Plot.

Remember that the Variance and Standard Deviation of a Sample is Different (in formula)

from the Population’s.

Compute the Variance and Standard Deviation of Data Sets.

Compute the Variance and Standard Deviation of a Frequency Table.