chapter 5 exploring data: distributions - seongchun kwon's …skwon.org/statistics.pdf · ·...

Chapter 5 Exploring Data: Distributions

5.1. Displaying Distributions: Histograms

Learning Objective

• How to read statistical data?

• How to visualize statistical data?

How to draw Histogram( bar graph )

A small part of a data set collected from the students in a large statistics class by anonymous responses to a class

questionnaire.

Notation

• First Column: Student number 2 to 8

• Column A: Sex- female or male

• Column B: right-handedness or left-handedness

• Column C: height in inches

• Column D: time spent in studying in minutes per weeknight

• Column E: number of coins each student are carrying

How can we read the data in the table?

• What is the height of the student number 2?

• Who spends the least amount of time on studying per weeknight?

• How many students use right hand?

• How many students study at least 2 hours per weeknight?

• What does each row represent?

• What does each column represent?

Terminology

• Any characteristics are called variables.

• Individuals are the objects described by a set of data.

Variables, Individuals

• What are variables in the table?

• How many individuals are we considering?

The distribution of a variable gives information

( as a table, graph, or formula) about how often the variable takes certain values or intervals of values.

Examples of Frequency Distributions

• Distribution of Sex

• Distribution of Coins

Value F M

Frequency

relative frequency distribution

The relative frequency distribution of a variable states all observed values of the variable and what fraction (or percentage) of the time each value occurs.

Relative Frequency:= Frequency

Total number of individuals

relative frequency distribution

• Conversion of decimal to percent and percent to decimal

• Relative frequency distribution of Sex

• Relative frequency distribution of Coins

Grouped frequency distribution

If there are many individuals, then it is better to analyze the data based on the grouped frequency distribution.

Remark:

Visualization of grouped frequency distribution is Histogram.(Bar graph).

Individuals, Variables

• What are individuals?

• How many variables do we have?

Procedure to make a grouped frequency distribution

1. Find the minimum and

maximum data values

2. Group neighboring data values into consecutive non-overlapping intervals: You need to decide the interval length which shows the data effectively.

3. Record the relevant frequencies

Method to record the grouped frequency distribution: table or histogram (bar graph)

Class Count

0.0 to 4.9 27

5.0 to 9.9 13

10.0 to 14.9 2

15.0 to 19.9 4

20.0 to 24.9 0

25.0 to 29.9 1

30.0 to 34.9 2

35.0 5o 39.9 0

40.0 to 44.9 1

• In the above table, data values range between 0.7 and 42.1. In this case, it is better to group the data values so that they range between 0 and 45 with interval lengths 5.

• Data values are recorded to one decimal point. This affects how to record the table form of data, rather histogram.

Histogram: What is the interpretation of the bar a, b and c?

Miss Smith's Math class has just taken a test. In order to come up with meaningful grades, Miss Smith will make a histogram to represent the

distribution of grades.

Data

Student Grade

Bullwinkle 84

Rocky 91

Bugs 75

Daffy 68

Wylie 98

Mickey 78

Minnie 77

Lucy 86

Linus 94

Asterix 64

Obelix 59

Donald 54

Sam 89

Taz 76

1. What is the highest score?

2. What is the lowest score?

3. What does the horizontal axis in the histogram represent?

4. What does the vertical axis in the histogram represent?

5. Fill the blank about the table with five bins and draw the histogram.

Class Count

50 - 59

60 - 69

70 - 79

80 – 89

90 - 99

6. Construct the table with 10 bins and draw the histogram.

7. Which histogram shows students’ overall academic performance in a better way? Why?

Data

Student Grade

Bullwinkle 84

Rocky 91

Bugs 75

Daffy 68

Wylie 98

Mickey 78

Minnie 77

Lucy 86

Linus 94

Asterix 64

Obelix 59

Donald 54

Sam 89

Taz 76

5.2. Interpreting Histograms

Global shape of histograms: Determining global shapes of histograms may be somewhat subjective although sometimes,

it has significant pattern. Skewed to the left or positively-

skewed: the longer tail of the histogram is on the left side

Skewed to the right or negatively-skewed: The longer tail of the histogram is on the right side

Symmetric: the right and left sides of the histogram are approximately mirror images of each other

Non-specific

Outlier: individual value(observation) that falls outside the overall pattern-It may show some particular aspects of data. However, it may suggest that there is an error in recording.

Important aspects to be discussed in interpreting histogram

outliers, shape (peaks, skewed distribution, symmetric distribution), center( next section ), spread( next section )

Example (Percent of Adult Population of Hispanic Origin by State in 2000 Census revisited)

Class Count

0.0 to 4.9 27

5.0 to 9.9 13

10.0 to 14.9 2

15.0 to 19.9 4

20.0 to 24.9 0

25.0 to 29.9 1

30.0 to 34.9 2

35.0 5o 39.9 0

40.0 to 44.9 1

How many peaks are in the graph?


Class Count

0.0 to 4.9 27

5.0 to 9.9 13

10.0 to 14.9 2

15.0 to 19.9 4

20.0 to 24.9 0

25.0 to 29.9 1

30.0 to 34.9 2

35.0 5o 39.9 0

40.0 to 44.9 1

What is the pattern of the graph? Skewed to the right? Skewed to the left? Or Symmetric?


Class Count

0.0 to 4.9 27

5.0 to 9.9 13

10.0 to 14.9 2

15.0 to 19.9 4

20.0 to 24.9 0

25.0 to 29.9 1

30.0 to 34.9 2

35.0 5o 39.9 0

40.0 to 44.9 1

What are outliers?

5.4. Describing Center: Mean and Median

Terminology

• Mean: average

mean =Sum of all data values

Number of data value

𝑥 = 𝑥1 + ⋯+ 𝑥𝑛

𝑛

• Median: mid number in the ordered list

Median=1

2(𝑛 + 1) th value

• Mode: the number that is repeated more often than any other. If no number is repeated, then there is no mode for the list.

To consider mean, median and mode, we need the data with individual values because we have to use actual values to calculate. Thus, histogram doesn’t give enough information about mean and median. However, mean, median and mode give some information about the shape of histogram.

Example

Find the mean, median, and mode for the following list of values:

13, 18, 13, 14, 13, 16, 14, 21, 13

Remark

Mean and median don’t have to be a value from the original list.

Remark

Half of the values in the data set lie below the median and half lie above the median.

Remark

The median is the most commonly quoted figure used to measure property prices. The use of the median avoids the problem of the mean property price which is affected by a few expensive properties that are not representative of the general property market.

Example

Find the mean, median, and mode for the following list of values:

8, 9, 10, 10, 10, 11, 11, 11, 12, 13

Group Discussion

The marks of nine students in a physics test that had a maximum possible mark of 50 are given below:

50 35 37 32 38 39 36 34 35

Find the mean, median and mode of this set of data values. Round to the nearest tenth.

Group Discussion

We have a distribution which is skewed to the left.

1. Draw any histogram which is skewed to the left. You don’t have to mark any values.

2. List median, mean and mod in order from the least to the greatest. Explain why you think so.

5.5. Describing Spread: The Quartiles

Remark:

• Mean can be affected much by outliers. For example, several very expensive houses in Barbourville can affect the average value of houses in Barbourville.

• Median doesn’t reflect the values from outliers much.

• However, we can infer something by comparing mean and median. To know the spread more clearly, we consider range and quartiles

Terminology:

• Range:=largest value – smallest value

Quartile

• first quartile (designated Q1) = lower quartile = splits lowest 25% of data = 25th percentile

• second quartile (designated Q2) = median M = cuts data set in half = 50th percentile

• third quartile (designated Q3) = upper quartile = splits highest 25% of data, or lowest 75% = 75th percentile

Remark:

The method of calculation is slightly different, depending on whether the given information has values collected from even or odd number of individuals.

Example 1:(even number of data)

After sorting, the city mileages of the 12 gasoline-powered midsized cars are:

15, 16, 18, 19, 20, 20, 21, 21, 21, 22, 24, 27

1. Find the range.

2. Find the first quartile, median, third quartile.

Example 2(odd number of data)

After sorting, the city mileages of the 12 gasoline-powered midsized cars are:

15, 16, 18, 19, 20, 20, 21, 21, 21, 22, 24, 27, 48

1. Find the range.

2. Find the first quartile, median, third quartile.

5.6. The Five-Number Summary and Boxplots

The five-number summary

Minimum Q1 M Q3 Maximum

Boxplot( or Box-and-Whisker diagram)

• Visualization of the five-number summary

• Boxplots can be drawn either horizontally or vertically.

• It helps visualize the rough shape of histogram from the information on spread, skewness, and outliers.

Boxplot

Vertical Visualization Horizontal Visualization

Group Activity

Below are the exam scores of 30 students. Make a boxplot of these data.

24 31 38 49 51 55 56 59 62

63 65 66 69 72 72 74 76 81

84 84 86 86 86 88 88 88 91

91 92 99

Answer

Group Activity

Below are the ages of 30 people who died in a city hospital in one month. Make a boxplot of these data.

7 22 25 31 37 38 41 48 49

50 55 58 62 62 64 65 66 66

72 75 76 76 76 85 86 88 88

88 92 94

Answer

5.7. Describing Spread: The Standard Deviation

Word meaning of deviation

Deviate: to turn aside or move away from what is considered a correct or normal course, standard of behavior, way of thinking, etc (deviated, deviating)

Survey on the quality of a restaurant

0

20

40

60

80

100

120

Terrible(1)

Poor(2)

Average(3)

VeryGood(4)

Excellent(5)

Terrible(1

)

Poor

(2)

Averag

e(3)

Very

Good(4)

Excelle

nt(5)

Food 1 99

Servic

e

5 15 30 35 15

Value 25 75

Atmo

spher

e

10 90

Terminology

(sample) Standard deviation s (of n observations x1, …., xn) It measures standard or average amount of deviation from their mean

Formula:

𝒔 =(𝒙𝟏 − 𝒙 )𝟐+(𝒙𝟐 − 𝒙 )𝟐+⋯+ (𝒙𝒏 − 𝒙 )𝟐

𝒏 − 𝟏

Remark

• Standard deviation is also denoted by σ.

• Standard deviation is zero only when there is no spread.

• Standard deviation is more sensitive to outliers than mean. If there are some outliers or a distribution is a strongly skewed distribution, then a standard deviation doesn’t give much information about the spread of a distribution.

in the following figures, the standard deviation of a is bigger than the standard deviation of b. However, the standard deviation of c may be

bigger than a because of outliers.

Calculation of Standard deviation

• Calculate the mean. • Write the list of deviation:= observation value- mean • Write the list the squared

deviations. • Add all values in the list of the

squared deviations. • Divide by n-1, where n is the

number of observations. • Square the whole value.

𝒔 =(𝒙𝟏 − 𝒙 )𝟐+(𝒙𝟐 − 𝒙 )𝟐+⋯+ (𝒙𝒏 − 𝒙 )𝟐

𝒏 − 𝟏

Example

On six consecutive Sundays, a tow-truck operator received 9, 7, 11, 10, 13, 7 service calls. Calculate s.

• Calculate the mean. • Use

𝒔 =(𝒙𝟏 − 𝒙 )𝟐+(𝒙𝟐 − 𝒙 )𝟐+⋯+ (𝒙𝒏 − 𝒙 )𝟐

𝒏 − 𝟏

Group Activity

Find the mean and then, standard deviation for the following data series:

12, 6, 7, 3, 15, 10, 18, 5

5.8. Normal distribution

Normal distribution(bell-shaped distribution): distribution whose shape is described by a normal curve

Normal curve

• smoothed-out histogram. Normal curve is symmetric with the same mean, median and mode

• The area under the curve is exactly 1.

Normal curve

The area under the curve between two vertical lines=proportion (%) of all values of the variable lies in that interval.

Examples of normal distributions

heights of men or women, blood pressure, marks on a standardized test such as SAT

Remark: If you say “That man is tall or has normal height”, then you are talking about rough statistical sense. Relate that with the figure above.

Learning Objective of the remaining section

• Understanding the geometric meaning of the standard deviation in a normal curve.

• Use the standard deviation to obtain the first quartile(25%) and the third quartile(75%) in a normal curve

Concave Up and Down

Concave up

Concave down

Change of Curvature

The point where the curve changes its concavity.

Concave down to concave up

Or

Concave up to concave down

Standard deviation and Change of Curvature in normal curve

(Geometric Meaning)

Standard deviation =

distance from the center to

the change-of-curvature points on either side

Standard Deviation and Quartiles in Normal Distribution

(Quantitative meaning of standard deviation)

• First quartile= mean − (0.67 X standard deviation)

• Third quartile = mean + (0.67 X standard deviation)

Example The distribution of heights of American women aged 18-24 is approximately normal with mean 64.5 inches and standard deviation 2.5 inches. a. What is the first

quartile? What is the interpretation of this?

b. What is the third quartile?

c. Between what two values do the middle 50% of scores lie?

Example

The scores of students on a standardized test form a normal distribution with a mean score of 500 and a standard deviation of 100. Between what two values do the middle 50% of scores lie?

Example

The distribution of the scores on a standardized exam is approximately normal with mean 250 and standard deviation 20. Between what two values do the middle 50% of scores lie?

Summary of Normal Curve

• Area under the normal curve= 1

• Middle 50% = from the first quartile to the third quartile

• First quartile=mean-(0.67X standard deviation)

• Third quartile= mean+(0.67X standard deviation)

5.9. The 68-95-99.7 Rule

If you use the 68-95-99.7 rule, then you can interpret

the data more effectively. The following figure illustrates the 68-95-99.7 rule,

when mean is 0 and standard deviation is 1.

Normal Distributions 68-95-99.7 Rule

• 68% of the observations fall within 1 standard deviation of the mean.

• 95% of the observations fall within 2 standard deviations of the mean.

• 99.7% of the observations fall within 3 standard deviations of the mean.

Example 1 ( Heights of American Women)

The distribution of heights of American women aged 18-24 is approximately normal with mean 64.5 inches and standard deviation 2.5 inches. Use the 68-95-99.7 rule to interpret the data.

Calculate the intervals of 1, 2 and 3 standard deviations:

• The interval of 1 standard deviation:



Example 1 ( Heights of American Women)

• The interval of 1 standard deviation:64.5 − 2.5, 64.5 + 2.5 =[62, 67]


[64.5 − 2 × 2.5, 64.5 + 2× 2.5] = [59.5, 69.5]

• The interval of 3 standard deviation: [64.5 − 3 × 2.5, 64.5 + 3

× 2.5] = [57, 72]

Apply the 68-95-99.7 rule. • About 68% of young women

are between 62 and 67 inches tall.

• About 95% of young women are between 59.5 and 69.5 inches tall.

• About 99.7% of young women are between 57 and 72 inches tall.

• About 2.5 % of young women are taller than 69.5 inches.

Example

The scores of students on a standardized test form a normal distribution with a mean of 400 and a standard deviation of 30. One thousand students took the test. Find the number of students who score above 460.

Example

The distribution of the scores on a standardized exam is approximately normal with mean 100 and standard deviation 15. What percentage of scores lie between 115 and 130?

Chapter 7 Data for Decisions

Sampling Terminology Example

The population is the entire group

from which you are getting

information.

The population for a study of

childhood cancer in the USA is all

childhood cancer patients in the USA.

A sample is used, when data are

collected from only part of the

population. This sample must be

representative of the population.

Valid conclusions are obtained when

the sample results represent those of

the population.

When a census is conducted, data are

collected from the entire population.

Sample might be childhood cancer

patients in the largest children’s

hospital in each State.

Potential Problem

Problems may arise if a person does not consider bias, use of language, ethics, cost and time, timing, privacy, cultural sensitivity.

Potential

Problem

What it means Example

Bias The question influences in favor

of, or against the topic of the

data collection.

Suppose a person asks: ‘Don’t you think

calories of McDonald’s foods are too

high?’

This person has a bias against the calories

of McDonald’s foods. The bias influences

how the survey questions are written.

Use of

Language

The use of language in question

could lead people to give a

particular answer.

‘Don’t you think calories of McDonald’s

foods are too high?’ may lead people to

answer yes. A better question would be

‘do you think calories of McDonald’s foods

are too low, low, medium, high, too high?’

Timing When the data are collected

could lead to particular results.

A survey is conducted to find opinions on

the need for a winter tire. The answer may

vary depending on whether Barbourville

has a lot of snow or not.

Potential

Problem


Privacy If the topic of the data

collection is personal, a

person may not want to

participate or may give an

untrue answer on purpose.

Anonymous surveys may

help.

Suppose you are a grade 9 teacher

and plan to conduct a survey about

smoking in classes he or she is

teaching. Students who smoke may be

afraid of punishment and may try to

avoid participating survey.

Cultural

Sensitivity

Cultural sensitivity means you

are aware of other cultures.

You must avoid being

offensive and asking

questions that do not apply

to that culture.

You go to Muslim community and

survey their favorite cooking method

of pork. For example, circle your

favorite method of cooking pork:

Fry Barbeque Bake

Potential

Problem


Ethics Ethics dictate that the

collected data must not be

used for purposes other than

those told to the participants.

Otherwise, your actions are

considered unethical.

Suppose you tell to your classmates

that you want to know their favorite

snacks to help you plan your birthday

party. If you use that to sell favorite

snacks to your classmates, then it is

unethical.

Cost The cost of collecting data

must be taken into account.

Printing questionnaires, Pay people to

collect data

Time The time needed for

collecting data must be

considered.

A survey that takes an hour to

complete may be too long. This will

limit the number of people who are

willing to participate.

Inferences

Learning Objective

Statistical Inference: How do we generalize the collected data from samples?

• Parameter vs. Statistic

• Confidence interval: related to 68-95-99.7 rule

• Central limit theorem: related to the sizes of each sample

• Law of large numbers: related to the number of samplings (experiments)

A survey question:

‘I like buying new clothes, but shopping is often frustration and time consuming.’

Circle your opinion:

Agree disagree

Sampling method: nationwide random selection of 2500 adults

Terminology Meaning Example

Statistical

inferences

Methods for drawing

conclusions about

the entire population

on the bases of data

from a sample.

Drawing conclusions

about an entire

population of 230

million American

adults.

Terminology Meaning Example

Parameter

Notation: p

Fixed (usually unknown)

number that describes a

population such as

proportion, mean or

standard deviation

Suppose that 60% of entire

American population agreed.

Then, 0.6 or 60% is the

parameter.

Statistic

Notation: p

Number that describes a

sample.

Known for the sample

we take

Varies from sample to

sample

Useful to estimate an

unknown parameter

Suppose that 1650 adults

from the random sample of

2500 adults answered that

they agree. Then, the statistic

is the proportion 0.66( 0r 66%

from 1650

2500)

To draw a conclusion for the entire American adult population, we take the following steps:

Steps, Terminology, Property

Example

Simulation: Drawing many

samples at random from a

population that we specify.

Assumption in SRS: All

possible samples of n objects

are equally likely to occur.

SRS (Simple Random

Sampling): Draw 1000

separate samples of size 100

from a population that we

suppose has a parameter

value p=0.6 by generating a

computer program.


Steps, Terminology,

Property

Example

Take a large number of

random samples from the

same population.

Draw 1000 separate samples of size

100 from a population that we

suppose has a parameter value p=0.6

Calculate the sample

proportion p for each

sample.

p

=Count of successes in the sample

size of sample

=# of Agree

100


Steps, Terminology,

Property

Example

Make a histogram of

the values of p


Steps, Terminology, Property

Example

Examine the distribution displayed in

the histogram for shape, center, and

spread, as well as outliers or other

deviations.

When we analyze the curve, we get the

following information. Center and

spread are based on the actual

calculation with specific sampling(1000

separate samples of size 100).

Shape: the sampling distribution of p

is approximately normal because

each sample sizes are 100.

Center: 0.598

Spread: Standard deviation =0.051

Theorem: (Sampling Distribution of a Sample Proportion)

Assumption: Choose an SRS of size n from a large population that contains population proportion p of successes.

Shape: For large sample sizes( 𝑛 ≥ 30), the sampling distribution of p is approximately normal. (Central limit theorem)

• Center: The mean of the

sampling distribution p = the parameter p.

• Spread: Standard deviation of the sampling distribution

of p =p(1−p)

n

Theorem:

Sampling Distribution of a

Sample Proportion

Example( Continued)

Assumption: Choose an SRS of

size n from a large population

that contains population

proportion p of successes.

n=100

We took a simple random

sample from the population

proportion p=0.6 of successes.

Theorem:

Sampling Distribution of a Sample

Proportion

Example( Continued)

Shape: For large sample sizes(

𝑛 ≥ 30), the sampling distribution

of p is approximately

normal.(Central limit theorem)

Center: The mean of the sampling

distribution p = the parameter p.

Spread: Standard deviation of the

sampling distribution of p =

p(1−p)

n

Shape: the sampling distribution of

p is approximately normal because

each sample sizes are 100.

Center:

Mean of the sampling distribution

p = 0.598 ≈ 0.6 = p (the parameter

p).

Spread: Standard deviation of the

sampling distribution of

p =0.6(1−0.6)

100= 0.0024 ≈ 0.04899

Terminology

Example

Margin of Error=2𝑝 (1−𝑝 )

𝑛

Meaning: It is 95% confident

that true p is within a range of

p ± 2𝑝 (1−𝑝 )

𝑛

Why we use margin of error? In

real life, we don’t know what

true parameter p is. Thus, we

want to draw conclusion about

true parameter p from the

chosen simple random sample.

Margin of error with the chosen sample

we have been considering is

20.049(1 − 0.049)

100≈ 0.043

It is 95% confident that true p is within a

range of p ± 2𝑝 (1−𝑝 )

𝑛= 0.598 ±

0.043 = 0. 555 𝑡𝑜 0.641

Remark: we can see that true parameter

p(=0.6) is in the range of 0.555 and

0.641.

Law of large numbers( informal )

If the number of times a situation is repeated becomes larger and larger, the proportion of successes( i.e., expected outcomes or events really happen) will tend to come closer and closer to the actual probability of success.

Law of large numbers(formal)

Observe any random phenomenon having numerical outcomes with finite mean 𝜇. As the random phenomenon is repeated a large number of times,

• The proportion of trials on which each outcome occurs gets closer and closer to the probability of that outcome.

• The mean 𝑥 of the observed values gets closer and closer to 𝜇.

chapter 5 exploring data: distributions - seongchun kwon's …skwon.org/statistics.pdf · ·...

Documents