2.1-2.3: organizing data - faculty website listing and... · the following pie chart shows ... two...

33
1

Upload: vonhu

Post on 24-Jul-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

1

Page 2: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

2

2.1-2.3: Organizing DataDEFINITIONS:Qualitative Data are those which classify the units into categories. Thecategories may or may not have a natural ordering to them. Qualitative variablesare also called categorical variables.

Quantitative variables have numerical values that are measurements (length,weight, and so on) or counts (of how many). Arithmetic operations on suchnumerical values do have meaning. We further distinguish quantitative variablesbased on whether or not the values fall on a continuum

Let's Do It! What Type of Variable?

Hurricane Charles, in August 2004, has been blamed for at least 16 deaths. Listedbelow is information on other major storms and hurricanes that occurred from 1994to 2003.

StormName Date CategoryEstimated

Damage/Cost* DeathsTropical StormAlberto Jul-94 n/a $1.2billion 32Hurricane Marilyn Sep-95 2 $2.5billion 13Hurricane Opal Oct-95 3 $3.6billion 27Hurricane Fran Sep-96 3 $5.8billion 37Hurricane Bonnie Aug-98 3 $1.1billion 3Hurricane Georges Sep-98 2 $6.5billion 16Hurricane Floyd Sep-99 2 $6.5billion 77Tropical Storm Allison Jun-01 n/a $5.1billion 43Hurricane Isabel Sep-03 2 $4.0billion 47

For each variable, determine whether it is qualitative or quantitative. If the variable isquantitative, state whether it is discrete or continuous.(a) The name of the storm.

(b) The date the storm occurred.

(c) The category of the storm.

(d) The estimated amount of damage or cost of the storm.

(e) The number of deaths that occurred.

Page 3: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

3

DEFINITIONS:

The distribution of a variable provides the possible values that a variable can take on and

how often these possible values occur. The distribution of a variable shows the pattern of

variation of the variable.

Let's Do It!College Admissions

The following pie chart shows the breakdown of undergraduate enrollment by race at theUniversity of Michigan for the fall term of 1996. The total number of undergraduatesenrolled for that term was 22,604.

(a) What percentage of undergraduates enrolled were of nonwhite race?

(b) How many undergraduates enrolled had no racial category listed?

Page 4: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

4

Let's Do It!Allied Van Lines surveyed 1000 respondents in May 2004. The question asked was, “Wouldyou move if your mate had to relocate overseas because of work?” The results aresummarized in the following pie chart.

2%

30%

68%

(a) What percentage of the respondents said that they would actually move if their materelocated overseas?

(b) What questions would you ask about the sample selection using this information to drawformal conclusions?

Let's Do It!Nothing Really MattersThe bar graph shown here displays the percentage of respondents who think a particularproblem is the most important problem facing America for two different years. (SOURCE:The Economist, March 30-April 5, 1996, pg 33.)(a) In January 1992, which problem category had the

highest percentage of responses?

Was this the same category which had the highestpercentage of responses in 1996?

(b) In January 1992, what percentage of respondentsreported crime as the most important problemfacing America?

In January 1996, what percentage of respondents reported crime as the most importantproblem facing America?

(c) What is the approximate sum of the percentage of respondents across all of the listedproblem categories for January 1992? Is this sum approximately 100%? If not, give apossible reason why not.

No

Yes

Not Sure

Page 5: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

5

Example: A Misleading Bar GraphProblemThe bar graph that follows presents the total sales figures for three realtors. When the bars arereplaced with pictures, often related to the topic of the graph, the graph is called a pictogram.

Realtor #1 Realtor #3Realtor #2

$2.05 million

$1.41 million

$0.9 million

TotalSales

No. 1 No. 2 No. 3Realtor

(a) How does the height of the home for Realtor 1 compare to that for Realtor 3?

(b) How does the area of the home for Realtor 1 compare to that for Realtor 3?

What We’ve Learned: When you see a pictogram, be careful to interpret the resultsappropriately, and do not allow the area of the pictures to mislead you.

Page 6: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

6

Let's Do It! Categorical Frequency Distributions

A survey was taken on how much trust people place in the information they read on theInternet.A = trust in everything they read,M = trust in most of what they read,H = trust in about half of what they read,S = trust in a small portion of what they read.Construct a categorical frequency distribution for the data.

M M M A H M S M H MS M M M M A M M A MM M H M M M H M H MA M M M H M M M M M

A frequency distribution is the organization of raw data in table form, using classes andfrequencies.

Each raw data value is placed into a quantitative or qualitative category called a class.

The frequency of a class then is the number of data values contained in a specific class.

Two types of frequency distributions that are most often used are the categorical frequencydistribution and the grouped frequency distribution.

Categorical Frequency Distributions: The categorical frequency distribution is used fordata that can be placed in specific categories, such as nominal- or ordinal-level data.

Grouped Frequency Distributions: When the range of the data is large, the data must begrouped into classes that are more than one unit in width.

Page 7: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

7

Histograms and Pie Charts

Let's Do It!

Distribution of scores For 108 randomly selected college applicants, the following frequency distribution forentrance exam scores was obtained.

a. Construct a relative frequencyhistogram for the data.

b. Applicants who score above 107 need not enroll in a summer developmental program. Inthis group, how many students do not have to enroll in the developmental program?

The Histogram: is a graph that displays the quantitative data by using contiguous verticalbars (unless the frequency of a class is 0) of various heights to represent the frequencies ofthe classes.

The Pie Graph: a circle that is divided into sections or wedges according to thepercentage of frequencies in each category of the distribution. The angle (in degrees) ofeach wedge is given by:Angle = relative frequency*360.

Class limits

126 – 134 9

28117 – 125

43108 – 116

2299 – 107

690 – 98

Frequency

Page 8: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

8

Let's Do It!

Matching Shapes to CharacteristicsDistribution 1 Distribution 2

Characteristic = Characteristic =

Distribution 3 Distribution 4

Characteristic = Characteristic =

Characteristics:1. Distribution of age for the population of the United States in the year 1980. Describe and

explain the shape of the distribution.

2. Distribution of miles of coastline for the 50 United States.Describe and explain the shape of the distribution.Which states do you think would be in the last class furthest to the right?

3. Distribution of the number of miles traveled to work, that is,commuting distance for employed adults in a city.Describe and explain the shape of the distribution.

4. Distribution of age at death for the population of the United States (year 1980). Describeand explain the shape of the distribution.

Page 9: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

9

Pie GraphLet’s Explore It!The assets of the richest 1% of Americans are distributed as follows. Make a pie chart for thepercentages.

Principal residence 7.8% 28.08 °

Liquid assets 5.0% 18.0 °

Pension accounts 6.9% 24.84 °

Stock, funds, and trusts 31.6% 113.76 °

Businesses & real estate 46.9% 168.84 °

Miscellaneous 1.8% 6.48 °

Total 100% 360 °

Let’s do It!The population of federal prisons, according to the most serious offenses, consists of thefollowing. Make a Pie chart of the population.

Online homework on 2.1-2.3 posted.

Violent offenses 12.6%

Property offenses 8.5%

Drug offenses 60.2%

Public order offenses

Weapons 8.2%

Immigration 4.9%

Other 5.6%

Businesses &Real Estate46.9%

Misc. 1.8%

PrincipalResidence7.8%

Stocks, Funds, andTrusts 31.6%

PensionAccounts6.9%

Liquid Assets5.0%

Page 10: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

10

DATA SET 1

2.4 Measures of Central Tendency.Suppose you had to give a single number that wouldrepresent the most typical age for the 20 subjects.

What number would you choose?

Measures of center are numerical values that tend toreport in some sense the middle of a set of data -- we willfocus on the mean and the median.

If the data are a sample, the mean and median would becalled statistics. If the data form an entire population thenthese measures of center would be called parameters.

Mean

DEFINITION:The mean of a set of n observations is simply the sum of the observations divided by

the number of observations, n.

Mean age of the 20 subjects in the medical study --add the 20 ages up and divide by 20:

45 41 51 46 47 45 3720

4335

. years

Special notation:If x x x n1 2, ,..., denote a sample of n observations,then the mean of the sample is called "x-bar" and is denoted by:

xxn

x x xn

i n 1 2

The mean of a population is denoted by the Greek letter μ.

Subject # Gender Age1001 M 451002 M 411003 F 511004 F 461005 F 471006 F 421007 M 431008 F 501009 M 391010 M 321011 M 411012 F 441013 F 471014 F 491015 M 451016 F 421017 M 411018 F 401019 M 451020 M 37

Page 11: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

11

Let’s Do it! Mean Number of Children per HouseholdSuppose that the number of children in a simple random sample of 10 households isas follows: 2, 3, 0, 2, 1, 0, 3, 0, 1, 4

(a) Calculate the sample mean number of children per household.

(b) Suppose that the observation for the last household in the above list wasincorrectly recorded as 40 instead of 4.What would happen to the mean?

Note that 9 of the 10 observations are less than the mean. The mean issensitive to extreme observations. Most graphical displays would havedetected this outlying observation.

Let's Do it! A Mean Is Not Always Representative

Kim's test scores are 7, 98, 25, 19, and 26. Calculate Kim's mean test score.Explain why the mean does not do a very good job at summarizing Kim's test scores.

Let's Do It! Combining MeansWe have seven students. The mean score for three of these students is 54 and themean score for the four other students is 76. What is the mean score for all sevenstudents?

The mean = the point of equilibrium, the point where the distribution wouldbalance.

If the distribution is symmetric, as in the first picture at the left,the mean would be exactly at the center of the distribution.

As the largest observation is moved further to the right, makingthis observation somewhat extreme, the mean shifts towardsthe extreme observation.

If a distribution appears to be skewed, we may wish also toreport a more resistant measure of center.

Mean =2

1 2 3

Mean =2.5

1 2 5

Mean =4

1 2 11

Page 12: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

12

The Mean of Group Data /Frequency TablesThe procedure for finding the mean for grouped datauses the midpoints of the classes. This procedure isshown next.

ExampleThe data represent the number of miles run duringone week for a sample of 20 runners.

SolutionThe procedure for finding the mean for grouped datais given here.Step 1 Make a table as shown.

Step 2 Find the midpoints of each class andenter them in column C.

Step 3 For each class, multiply thefrequency by the midpoint, as shown, and place the product in column D.1 .8 = 8 , 2 . 13 = 26 etc.The completed table is shown here.

Step 4 Find the sum of column D.

Step 5 Divide the sum by n to get the mean.

Page 13: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

13

Let's Do It! :Eighty randomly selected light bulbs were tested to determine their lifetime in hours.The frequency table of the results is shown in table. Find the average lifetime of alight bulb.

Life interval in hours

Frequency

53-63 664-74 1275-85 2586-96 1897-107 14108-118 5

Let's Do It!The cost per load (in cents) of 35 laundry detergents tested by consumerorganization is given below.

Class limit Frequency13-19 220-26 727-33 1234-40 541-47 648-54 155-61 062-68 2

Page 14: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

14

A measure of center that is more resistant to extreme values is the median.Median

DEFINITION:The median of a set of n observations, ordered from smallest to largest, is a valuesuch that half of the observations are less than or equal to that value and half theobservations are greater than or equal to that value.

If the number of observations is odd, the median is the middle observation.

If the number of observations is even, the median is any number between the twomiddle observations, including either of the two middle observations.

To be consistent, we will define the median as the mean or average of the twomiddle observations.

Location of the median: (n+1)/2, where n is the number of observations.

The ages of the n = 20 subjects...

Calculating (n+1)/2 we get (20+1)/2 = 10.5. So the two middle observations are the10th and 11th observations, namely 43 and 44. The median is the mean of thesetwo middle observations, (43+44)/2=43.5 years.

32 37 39 40 41 41 41 42 42 43 44 45 45 45 46 47 47 49 50 51

median = 43.511th obs10th obs

Let's Do It! 1Median Number of Children per HouseholdFind the median number of children in a household from this sample of 10households, that is, find the median ofNumber of Children: 2 3 0 1 4 0 3 0 1 2

(a) Median = ______________

(b) What happens to the median if the fifth observation in the first list wasincorrectly recorded as 40 instead of 4?

(c) What happens to the median if the third observation in thefirst list was incorrectly recorded as -20 instead of 0?

Note: The median is resistant—that is, it does not change,or changes very little, in response to extreme observations.

Page 15: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

15

Another Measure—The Mode

DEFINITION:The mode of a set of observations is the most frequently occurring value; it is the

value having the highest frequency among the observations.

The mode of the values: { 0, 0, 0, 0, 1, 1, 2, 2, 3, 4 } is 0

For { 0, 0, 0, 1, 1, 2, 2, 2, 3, 4 } two modes, 0 and 2 (bimodal)

What would be the mode for { 0, 1, 2, 4, 5, 8 } ?

For {0, 0, 0, 0, 0, 1, 2, 3, 4, 4, 4, 4, 5 } ?

The mode is not often used as a measure of center for quantitative data.

The mode can be computed for qualitative data.

The modal race category is “white.” If categories were given coded as:

1=White, 2=Asian, 3=African-American, 4=Hispanic, 5=American Indian, 6=No category listed,

then the mode would be the value 1.0

10

20

30

40

50

60

70

80

AmericanIndian

NoCategory

Hispanic Afr ican-American

Asian White

Race

Per

cent

Page 16: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

16

Let’s Do It!Different Measures Can Give Different ImpressionsThe famous trio—the mean, the median, and the mode—represent three differentmethods for finding a so-called “center” value. These three values may be the samebut are more likely going to be different. When they are different, they can lead todifferent interpretations of the data being summarized.Consider the annual incomes of five families in a neighborhood:

$12,000 $12,000 $30,000 $90,000 $100,000(a) Calculate the average income.

(b) Calculate the median income.

(c) Calculate the modal income.

(d) If you were trying to promote that this is an affluent neighborhood, whichmeasure might you prefer to present?

(e) If you were trying to argue against a tax increase, which measure might youprefer to present?

(f) If you want to represent these values with the income that is in the middle,which measure might you prefer to present?

Which Measure of Center to Use?

Bell-shaped, Symmetric Bimodal

mean=median=mode

50%

mean=mediantwo modes

Skewed Right Skewed Left

modemedian

mean

50%

modemedian

mean

50%

Page 17: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

17

Mean, Median, and Mode

The most common measure of center is the mean, which locates the balancing pointof the distribution. The mean equals the sum of the observations, divided by howmany there are. The mean is also affected by extreme observations (outliers andvalues which are far in the tail of a distribution that is skewed). So the mean tendsto be a good choice for locating the center of a distribution that is unimodal androughly symmetric, with no outliers.

The median is a more robust measure of center, that is, it is not influenced byextreme values. The median is the middle observation when the data are orderedfrom smallest to largest. If you have an odd number of values, the median is the onein the middle. If you have an even number of values, the median is the mean of thetwo middle values, and fall exactly half way between them. If you have nobservations, then (n+1)/2 tells you the location or position of the median. Forskewed distributions or distributions with outliers, the median tends to be the betterchoice for locating the center.

The mode is the value(s) that occurs most often. For a distribution, the mode is thevalue associated with the highest peak. The most frequent value can be far from thecenter of the distribution, so the mode is not really a measure of center. However,the mode is the only measure of the three that can be used for qualitative data.

Tips: When you see or hear an “average” reported, ask which average was really

computed -- the mean or the median.

Think about or examine the distribution of values to assess if the measure ofcenter used is appropriate.

Online homework 2.4

Page 18: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

18

2.5-2.5 MEASURING VARIATION OR SPREAD

Both sets of data have the same mean, median and mode but the values obviously

differ in another respect -- the variation or spread of the values.

The values in List 1 are much more tightly clustered around the center value of 60.

The values in List 2 are much more dispersed or spread out.

List 1: 55, 56, 57, 58, 59, 60, 60, 60, 61, 62, 63, 64, 65 mean = median = mode = 60

X X

XXXXXXXXXXX .

35 40 45 50 55 60 65 70 75 80 85

List 2: 35, 40, 45, 50, 55, 60, 60, 60, 65, 70, 75, 80, 85 mean = median = mode = 60

X X X X X X X X X X X X X .

35 40 45 50 55 60 65 70 75 80 85

RangeThe range is the simplest measure of variability or spread.

Range is just the difference between the largest value and the smallest value.Range can give a distorted picture of the actual pattern of variation.

Two distributions: same range but different patterns of variation.the first distribution has most of its values far from the center, while the seconddistribution has most of its values closer to the center.

XX X X X X X XX X X X X X X X X X XX X X X X X X X X X X X X X X X X X X X X X20 21 22 23 24 25 26 27 28 29 30 20 21 22 23 24 25 26 27 28 29 30

Page 19: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

19

Inter-quartile RangeThe inter-quartile range measures the spread of the middle 50% of the data. You

first find the median (represented by Q2—the value that divides the data into two

halves), and then find the median for each half. The three values that divide the data

into four parts are called the quartiles, represented by Q1, Q2, and Q3. The

difference between the third quartile and the first quartile is called the inter-Quartile

range, denoted by IQR=Q3-Q1.

ExampleQuartiles for AgeThe ages of the 20 subjects in the medical study are listed below in order.

32, 37, 39, 40, 41, 41, 41, 42, 42, 43,

44, 45, 45, 45, 46, 47, 47, 49, 50, 51

The histogram of the ages is also provided.

(a) Calculate the median age.(b) Calculate the first Quartile Q1 for this age data.(c) Calculate the third Quartile Q3 for this age data.(d) Calculate the range for this age data.

32 37 39 40 41 41 41 42 42 43 44 45 45 45 46 47 47 49 50 51

median = 43.5Q1 = 41 Q3 = 46.5

We see that the distribution of age is approximatelysymmetric and that the quartiles are about the samedistance from the median.

The quartiles are actually the 25th, 50th, and 75thpercentiles.

2

4

6

8

30 35 4540 50 55

Count

DEFINITION:The pth percentile is the value such that p% of the observations fall at or

below that value and (100 - p)% of the observations fall at or above that

value.

Page 20: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

20

Finding the Quartiles

1. Find the median of all of the observations.2. First Quartile = Q1 = median of observations that fall below the

median.3. Third Quartile = Q3 = median of observations that fall above the

median.

Notes When the number of observations is odd, the middle observation is

the median. This observation is not included in either of the twohalves when computing Q1 and Q3.

Although different books, calculators, and computers may useslightly different ways to compute the quartiles, they are all based onthe same idea.

In a left-skewed distribution, the first quartile will be farther from themedian than the third quartile is. If the distribution is symmetric, thequartiles should be the same distance from the median.

Five-Number Summary

Five-number summary: Minimum, Q1, Median, Q3, Maximum

Min Q1 Q2=Median Q3 Max

Boxplot:

To Build a Basic Boxplot

List the data values in order from smallest to largest.

Find the five number summary: minimum, Q1, median, Q3, and maximum.

Locate the values for Q1, the median and Q3 on the scale. These valuesdetermine the “box” part of the boxplot. The quartiles determine the ends of thebox, and a line is drawn inside the box to mark the value of the median.

Draw lines (called whiskers) from the midpoints of the ends of the box out to theminimum and maximum.

Page 21: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

21

ProblemConsider the (ordered) ages of the 20 subjects in a medical study :

32, 37, 39, 40, 41, 41, 41, 42, 42, 43,44, 45, 45, 45, 46, 47, 47, 49, 50, 51

The five-number summary for the age data is given by:min = 32, Q1 = 41, median = 43.5, Q3 = 46.5, and max = 51.

Draw the Modified boxplot.

The distance between the median and the quartiles is roughly the same, supportingthe rough symmetry of the distribution as seen previously from the histogram.

Using the 1.5 x IQR Rule to Identify Outliers and Build a Modified Boxplot

List the data values in order from smallest to largest. Find the five number summary: minimum, Q1, median, Q3, and maximum. Locate the values for Q1, the median and Q3 on the scale. These values

determine the “box” part of the boxplot. The quartiles determine the ends of thebox, and a line is drawn inside the box to mark the value of the median.

Find the IQR = Q3 – Q1. Compute the quantity STEP = 1.5 x (IQR) Find the location of the inner fences by taking 1 step out from each of the

quartiles

lower inner fence = Q1 – STEP; upper inner fence = Q3 + STEP.

Draw the lines (whiskers) from the midpoints of the ends of the box out to thesmallest and largest values WITHIN the inner fences.

Observations that fall OUTSIDE the inner fences are considered potential outliers.If there are any outliers, plot them individually along the scale using a solid dot.

Page 22: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

22

Five-number summary:min=1Q1=21median=32Q3=66max=325

ExampleAny Age Outlier?

Let’s apply the "rule of thumb" to our age data set to assess if there are anyoutliers.

(a) Construct the fences for the modified boxplot basedon the 1.5 * IQR rule.

(b) Are there any outliers using the 1.5 * IQR rule?

(c) Construct the modified boxplot.

Inner Fences Potential OutliersOutsidevalue

Far Outsidevalue

Farthest observations thatare not potential outliers

Page 23: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

23

Let's Do It!Five-Number Summary and Outliers

Page 24: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

24

Let’s Do It! Cost of Running Shoes

The prices for 12 comparable pairs of running shoes produced the followingmodified boxplot.

40 60 80 100 120

*

PRICE

(a)What was the approximate range of prices for such running shoes?

Range = ______________

(b)Twenty-five percent of the shoes cost more than approximately what amount?

$ _____________

Side-by-side boxplots are helpful for comparing two or more distributions withrespect to the five-number summary.

Although the median of the first process iscloser to the target value of 20.000 cm, thesecond process produces a less variabledistribution.

Page 25: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

25

Let's Do It!Comparing Ages—Antibiotic Study

Variable = age for 23 children randomly assigned to one of two treatment groups.

(a) Give the five-number summary for each of the two treatment groups.Comment on your results.

Amoxicillin Group (n=11): 8 9 9 10 10 11 11 12 14 14 17

Five-number summary:

Cefadroxil Group (n=12): 7 8 9 9 9 10 10 11 12 13 14 18

Five-number summary:

(b) Make side-by-side Modified box-plots for the antibiotic study data in part (a).Amoxicillin : Lower fence= , Upper fence= , outliers:

Cefadroxil : Lower fence= , Upper fence= , outliers:

Page 26: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

26

Standard Deviation.…...a measure of the spread of the observations from the mean..……think of the standard deviation as an “average (or standard) distance of theobservations from the mean.”ExampleStandard Deviation—What Is It?

Deviations: -4, 1, 3

Squared Deviations: 16, 1, 9

-----------------------------------------------------------------------------------------ObservationDeviation Squared Deviation

x x x x x 2

-----------------------------------------------------------------------------------------0 0 - 4 = -4 165 5 - 4 = 1 1

7 7 - 4 = 3 9-----------------------------------------------------------------------------------------mean = 4 sum always = 0 sum = 26

sample variance

4 1 3

3 116 1 9

2262

132 2 2

sample standard deviation 13 3 6.

Interpretation of the Standard Deviation

Think of the standard deviation as roughly an average distance of the observations

from their mean. If all of the observations are the same, then the standard deviation

will be 0 (i.e. no spread). Otherwise the standard deviation is positive and the more

spread out the observations are about their mean, the larger the value of the

standard deviation.

Page 27: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

27

If x x x n1 2, ,..., denote a sample of n observations,

the sample variance is denoted by:

)1(

/

)1()(

1122

22222

21

22

nnxx

nnxxn

nxxxxxx

nxx

s

ii

iini

Sample standard deviation, denoted by s , is the square root of the variance:

s s 2 .

Shortcut formulas for computing the variance and standard deviationMathematically equivalent to the preceding formulas and do not involve using themean. They save time when repeated subtracting and squaring occur in the originalformulas. They are also more accurate when the mean has been rounded.

.

Remarks: The variance is measured in squared units. By taking the square root of the

variance we bring this measure of spread back into the original units.

Just as the mean is not a resistant measure of center, since the standarddeviation used the mean in its definition, it is not a resistant measure ofspread. It is heavily influenced by extreme values.

There are statistical arguments that support why we divide by n 1 instead ofn in the denominator of the sample standard deviation.

Page 28: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

28

ExampleConsistency of Weight Loss ProgramIn a recent study of the effect of a certain diet on weight reduction, 11 subjects wereput on the diet for two weeks and their weight loss/gain in lbs was measured(positive values indicate weight loss).

1, 1, 2, 2, 3, 2, 1, 1, 3, 2.5, -23.What is the standard deviation of the weight loss?

Solutionx 1 1 2 5 23.... . 4.5 , x2 2 2 2 21 1 2 5 23 ... . ( ) 569.25

The standard deviation of this sample is s

569 25 4 5 1110

2. ( . ) / 7.5327

Let's Do It!Emergency Room PatientsThe following are the ages of a sample of 20 patients seen in the emergency room ofa hospital on a Friday night.

35 32 21 43 39 60 36 12 54 4537 53 45 23 64 10 34 22 36 55

Find the standard deviation of the ages.

Page 29: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

29

IQR and Standard Deviation

The interquartile range, IQR, is the distance between the first and third quartiles(Q3 - Q1), and measures the spread of the middle 50% of the data. When themedian is used as a measure of center, the IQR is often used as a measure ofspread. For skewed distributions, or distributions with outliers, the IQR tends to be abetter measure of spread if your goal is to summarize that distribution.

Adding the minimum and maximum values to the median and quartiles results in thefive-number summary. A graphical display of the five-number summary is aboxplot, and the length of the box corresponds to the IQR.

The standard deviation is roughly the average distance of the observed valuesfrom their mean. The mean and the standard deviation are most useful forapproximately symmetric distributions with no outliers. In the next chapter we willdiscuss an important family of symmetric distributions, called the normaldistributions, for which the standard deviation is a very useful summary.

Tip:The numerical summaries presented in this chapter provide information about thecenter and spread of a distribution, but a graph, such as a histogram or stem-and-leaf plot, provides the best picture of the overall shape of the distribution.

Graph your data first!

Page 30: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

30

Variance and Standard Deviation for Grouped DataThe procedure for finding the variance and standard deviation for grouped data issimilar to that for finding the mean for grouped data, and it uses the midpoints ofeach class.

ExampleThe data represent the number of miles that 20runners ran during one week. Find the variance andthe standard deviation for the frequency distributionof the data.

SolutionStep1 Make a table as shown, and find themidpoint of each class.

Step 2 Multiply the frequency by themidpoint for each class, and place theproducts in column D.1 .8 = 8, 2 . 13 =26, . . . , 2 .38 = 76

Step 3 Multiply the frequency by the square ofthe midpoint, and place the products in columnE.1 .82 = 64, 2 . 132 = 338, . . . , 2 .382 = 2888

Step 4 Find the sums of columns B, D,and E. The sum of column B is n, the sum of column D is f xi m , and the sum of columnE is f xi m

2 . The completed table is shown.

Step 5 Substitute in the formula and solve for s2 to get the variance.

Step 6 Take the square root to get the standard deviation.

Page 31: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

31

Let's Do It!The data show distribution of the birth weight ( in oz.) of 100 consecutive deliveries.Find the variance and the standard deviation.

Online homework on 2.5-2.8

Test 1 Over Chapter 2.

Interval Frequency29.50-69.45 569.50-89.45 1089.50-99.45 1199.50-109.45 19109.50-119.45 17119.50-129.45 20129.50-139.45 12139.50-149.45 6

Page 32: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

32

TI Quick StepsObtaining Summary Measures

Step 1 Clear data.

Step 2 Enter data to be summarized.

Step 3 Obtain the summary measures for the data in L1.Summary measures are obtained by requesting the 1-Var Stats fromunder the STAT CALC menu list. The sequence of buttons is as follows:

The 1-Var Stats are now displayed in the window. Notice that both the samplestandard deviation s and the population standard deviationdepending on whether the values in L1 are a sample or the entire population. Theonly mean provided is x , but this would be the population meanthe entire population. To find more information, in particular the five-numbersummary, press down arrow button.

Page 33: 2.1-2.3: Organizing Data - Faculty Website Listing and... · The following pie chart shows ... Two types of frequency distributions that are most ... such as nominal- or ordinal-level

33

Producing a Boxplot

Step 1 Clear data and plots

Step 2 Enter data to be plotted

Step 3 Setting the STAT PLOT options for a boxplot.Finally set the stat plot options for producinga boxplot of the data in L1 as Plot 1.

The sequence of steps is as follows:

Press the ZOOM button and then “9” to have the boxplot displayed. Use the TRACEbutton and the right and left arrow keys to see values for the five-number summary.Note that the modified boxplot type is 4th graph icon in the Type list.