chapter 5 exploring data: distributions - seongchun kwon's …skwon.org/statistics.pdf · ·...
TRANSCRIPT
Chapter 5 Exploring Data: Distributions
5.1. Displaying Distributions: Histograms
Learning Objective
• How to read statistical data?
• How to visualize statistical data?
How to draw Histogram( bar graph )
A small part of a data set collected from the students in a large statistics class by anonymous responses to a class
questionnaire.
Notation
• First Column: Student number 2 to 8
• Column A: Sex- female or male
• Column B: right-handedness or left-handedness
• Column C: height in inches
• Column D: time spent in studying in minutes per weeknight
• Column E: number of coins each student are carrying
How can we read the data in the table?
• What is the height of the student number 2?
• Who spends the least amount of time on studying per weeknight?
• How many students use right hand?
• How many students study at least 2 hours per weeknight?
• What does each row represent?
• What does each column represent?
Terminology
• Any characteristics are called variables.
• Individuals are the objects described by a set of data.
Variables, Individuals
• What are variables in the table?
• How many individuals are we considering?
The distribution of a variable gives information
( as a table, graph, or formula) about how often the variable takes certain values or intervals of values.
Examples of Frequency Distributions
• Distribution of Sex
• Distribution of Coins
Value F M
Frequency
relative frequency distribution
The relative frequency distribution of a variable states all observed values of the variable and what fraction (or percentage) of the time each value occurs.
Relative Frequency:= Frequency
Total number of individuals
relative frequency distribution
• Conversion of decimal to percent and percent to decimal
• Relative frequency distribution of Sex
• Relative frequency distribution of Coins
Grouped frequency distribution
If there are many individuals, then it is better to analyze the data based on the grouped frequency distribution.
Remark:
Visualization of grouped frequency distribution is Histogram.(Bar graph).
Individuals, Variables
• What are individuals?
• How many variables do we have?
Procedure to make a grouped frequency distribution
1. Find the minimum and
maximum data values
2. Group neighboring data values into consecutive non-overlapping intervals: You need to decide the interval length which shows the data effectively.
3. Record the relevant frequencies
Method to record the grouped frequency distribution: table or histogram (bar graph)
Class Count
0.0 to 4.9 27
5.0 to 9.9 13
10.0 to 14.9 2
15.0 to 19.9 4
20.0 to 24.9 0
25.0 to 29.9 1
30.0 to 34.9 2
35.0 5o 39.9 0
40.0 to 44.9 1
• In the above table, data values range between 0.7 and 42.1. In this case, it is better to group the data values so that they range between 0 and 45 with interval lengths 5.
• Data values are recorded to one decimal point. This affects how to record the table form of data, rather histogram.
Histogram: What is the interpretation of the bar a, b and c?
Miss Smith's Math class has just taken a test. In order to come up with meaningful grades, Miss Smith will make a histogram to represent the
distribution of grades.
Data
Student Grade
Bullwinkle 84
Rocky 91
Bugs 75
Daffy 68
Wylie 98
Mickey 78
Minnie 77
Lucy 86
Linus 94
Asterix 64
Obelix 59
Donald 54
Sam 89
Taz 76
1. What is the highest score?
2. What is the lowest score?
3. What does the horizontal axis in the histogram represent?
4. What does the vertical axis in the histogram represent?
5. Fill the blank about the table with five bins and draw the histogram.
Class Count
50 - 59
60 - 69
70 - 79
80 – 89
90 - 99
6. Construct the table with 10 bins and draw the histogram.
7. Which histogram shows students’ overall academic performance in a better way? Why?
Data
Student Grade
Bullwinkle 84
Rocky 91
Bugs 75
Daffy 68
Wylie 98
Mickey 78
Minnie 77
Lucy 86
Linus 94
Asterix 64
Obelix 59
Donald 54
Sam 89
Taz 76
5.2. Interpreting Histograms
Global shape of histograms: Determining global shapes of histograms may be somewhat subjective although sometimes,
it has significant pattern. Skewed to the left or positively-
skewed: the longer tail of the histogram is on the left side
Skewed to the right or negatively-skewed: The longer tail of the histogram is on the right side
Symmetric: the right and left sides of the histogram are approximately mirror images of each other
Non-specific
Outlier: individual value(observation) that falls outside the overall pattern-It may show some particular aspects of data. However, it may suggest that there is an error in recording.
Important aspects to be discussed in interpreting histogram
outliers, shape (peaks, skewed distribution, symmetric distribution), center( next section ), spread( next section )
Example (Percent of Adult Population of Hispanic Origin by State in 2000 Census revisited)
Class Count
0.0 to 4.9 27
5.0 to 9.9 13
10.0 to 14.9 2
15.0 to 19.9 4
20.0 to 24.9 0
25.0 to 29.9 1
30.0 to 34.9 2
35.0 5o 39.9 0
40.0 to 44.9 1
How many peaks are in the graph?
Example (Percent of Adult Population of Hispanic Origin by State in 2000 Census revisited)
Class Count
0.0 to 4.9 27
5.0 to 9.9 13
10.0 to 14.9 2
15.0 to 19.9 4
20.0 to 24.9 0
25.0 to 29.9 1
30.0 to 34.9 2
35.0 5o 39.9 0
40.0 to 44.9 1
What is the pattern of the graph? Skewed to the right? Skewed to the left? Or Symmetric?
Example (Percent of Adult Population of Hispanic Origin by State in 2000 Census revisited)
Class Count
0.0 to 4.9 27
5.0 to 9.9 13
10.0 to 14.9 2
15.0 to 19.9 4
20.0 to 24.9 0
25.0 to 29.9 1
30.0 to 34.9 2
35.0 5o 39.9 0
40.0 to 44.9 1
What are outliers?
5.4. Describing Center: Mean and Median
Terminology
• Mean: average
mean =Sum of all data values
Number of data value
𝑥 = 𝑥1 + ⋯+ 𝑥𝑛
𝑛
• Median: mid number in the ordered list
Median=1
2(𝑛 + 1) th value
• Mode: the number that is repeated more often than any other. If no number is repeated, then there is no mode for the list.
To consider mean, median and mode, we need the data with individual values because we have to use actual values to calculate. Thus, histogram doesn’t give enough information about mean and median. However, mean, median and mode give some information about the shape of histogram.
Example
Find the mean, median, and mode for the following list of values:
13, 18, 13, 14, 13, 16, 14, 21, 13
Remark
Mean and median don’t have to be a value from the original list.
Remark
Half of the values in the data set lie below the median and half lie above the median.
Remark
The median is the most commonly quoted figure used to measure property prices. The use of the median avoids the problem of the mean property price which is affected by a few expensive properties that are not representative of the general property market.
Example
Find the mean, median, and mode for the following list of values:
8, 9, 10, 10, 10, 11, 11, 11, 12, 13
Group Discussion
The marks of nine students in a physics test that had a maximum possible mark of 50 are given below:
50 35 37 32 38 39 36 34 35
Find the mean, median and mode of this set of data values. Round to the nearest tenth.
Group Discussion
We have a distribution which is skewed to the left.
1. Draw any histogram which is skewed to the left. You don’t have to mark any values.
2. List median, mean and mod in order from the least to the greatest. Explain why you think so.
5.5. Describing Spread: The Quartiles
Remark:
• Mean can be affected much by outliers. For example, several very expensive houses in Barbourville can affect the average value of houses in Barbourville.
• Median doesn’t reflect the values from outliers much.
• However, we can infer something by comparing mean and median. To know the spread more clearly, we consider range and quartiles
Terminology:
• Range:=largest value – smallest value
Quartile
• first quartile (designated Q1) = lower quartile = splits lowest 25% of data = 25th percentile
• second quartile (designated Q2) = median M = cuts data set in half = 50th percentile
• third quartile (designated Q3) = upper quartile = splits highest 25% of data, or lowest 75% = 75th percentile
Remark:
The method of calculation is slightly different, depending on whether the given information has values collected from even or odd number of individuals.
Example 1:(even number of data)
After sorting, the city mileages of the 12 gasoline-powered midsized cars are:
15, 16, 18, 19, 20, 20, 21, 21, 21, 22, 24, 27
1. Find the range.
2. Find the first quartile, median, third quartile.
Example 2(odd number of data)
After sorting, the city mileages of the 12 gasoline-powered midsized cars are:
15, 16, 18, 19, 20, 20, 21, 21, 21, 22, 24, 27, 48
1. Find the range.
2. Find the first quartile, median, third quartile.
5.6. The Five-Number Summary and Boxplots
The five-number summary
Minimum Q1 M Q3 Maximum
Boxplot( or Box-and-Whisker diagram)
• Visualization of the five-number summary
• Boxplots can be drawn either horizontally or vertically.
• It helps visualize the rough shape of histogram from the information on spread, skewness, and outliers.
Boxplot
Vertical Visualization Horizontal Visualization
Group Activity
Below are the exam scores of 30 students. Make a boxplot of these data.
24 31 38 49 51 55 56 59 62
63 65 66 69 72 72 74 76 81
84 84 86 86 86 88 88 88 91
91 92 99
Answer
Group Activity
Below are the ages of 30 people who died in a city hospital in one month. Make a boxplot of these data.
7 22 25 31 37 38 41 48 49
50 55 58 62 62 64 65 66 66
72 75 76 76 76 85 86 88 88
88 92 94
Answer
5.7. Describing Spread: The Standard Deviation
Word meaning of deviation
Deviate: to turn aside or move away from what is considered a correct or normal course, standard of behavior, way of thinking, etc (deviated, deviating)
Survey on the quality of a restaurant
0
20
40
60
80
100
120
Terrible(1)
Poor(2)
Average(3)
VeryGood(4)
Excellent(5)
Terrible(1
)
Poor
(2)
Averag
e(3)
Very
Good(4)
Excelle
nt(5)
Food 1 99
Servic
e
5 15 30 35 15
Value 25 75
Atmo
spher
e
10 90
Terminology
(sample) Standard deviation s (of n observations x1, …., xn) It measures standard or average amount of deviation from their mean
Formula:
𝒔 =(𝒙𝟏 − 𝒙 )𝟐+(𝒙𝟐 − 𝒙 )𝟐+⋯+ (𝒙𝒏 − 𝒙 )𝟐
𝒏 − 𝟏
Remark
• Standard deviation is also denoted by σ.
• Standard deviation is zero only when there is no spread.
• Standard deviation is more sensitive to outliers than mean. If there are some outliers or a distribution is a strongly skewed distribution, then a standard deviation doesn’t give much information about the spread of a distribution.
in the following figures, the standard deviation of a is bigger than the standard deviation of b. However, the standard deviation of c may be
bigger than a because of outliers.
Calculation of Standard deviation
• Calculate the mean. • Write the list of deviation:= observation value- mean • Write the list the squared
deviations. • Add all values in the list of the
squared deviations. • Divide by n-1, where n is the
number of observations. • Square the whole value.
𝒔 =(𝒙𝟏 − 𝒙 )𝟐+(𝒙𝟐 − 𝒙 )𝟐+⋯+ (𝒙𝒏 − 𝒙 )𝟐
𝒏 − 𝟏
Example
On six consecutive Sundays, a tow-truck operator received 9, 7, 11, 10, 13, 7 service calls. Calculate s.
• Calculate the mean. • Use
𝒔 =(𝒙𝟏 − 𝒙 )𝟐+(𝒙𝟐 − 𝒙 )𝟐+⋯+ (𝒙𝒏 − 𝒙 )𝟐
𝒏 − 𝟏
Group Activity
Find the mean and then, standard deviation for the following data series:
12, 6, 7, 3, 15, 10, 18, 5
5.8. Normal distribution
Normal distribution(bell-shaped distribution): distribution whose shape is described by a normal curve
Normal curve
• smoothed-out histogram. Normal curve is symmetric with the same mean, median and mode
• The area under the curve is exactly 1.
Normal curve
The area under the curve between two vertical lines=proportion (%) of all values of the variable lies in that interval.
Examples of normal distributions
heights of men or women, blood pressure, marks on a standardized test such as SAT
Remark: If you say “That man is tall or has normal height”, then you are talking about rough statistical sense. Relate that with the figure above.
Learning Objective of the remaining section
• Understanding the geometric meaning of the standard deviation in a normal curve.
• Use the standard deviation to obtain the first quartile(25%) and the third quartile(75%) in a normal curve
Concave Up and Down
Concave up
Concave down
Change of Curvature
The point where the curve changes its concavity.
Concave down to concave up
Or
Concave up to concave down
Standard deviation and Change of Curvature in normal curve
(Geometric Meaning)
Standard deviation =
distance from the center to
the change-of-curvature points on either side
Standard Deviation and Quartiles in Normal Distribution
(Quantitative meaning of standard deviation)
• First quartile= mean − (0.67 X standard deviation)
• Third quartile = mean + (0.67 X standard deviation)
Example The distribution of heights of American women aged 18-24 is approximately normal with mean 64.5 inches and standard deviation 2.5 inches. a. What is the first
quartile? What is the interpretation of this?
b. What is the third quartile?
c. Between what two values do the middle 50% of scores lie?
Example
The scores of students on a standardized test form a normal distribution with a mean score of 500 and a standard deviation of 100. Between what two values do the middle 50% of scores lie?
Example
The distribution of the scores on a standardized exam is approximately normal with mean 250 and standard deviation 20. Between what two values do the middle 50% of scores lie?
Summary of Normal Curve
• Area under the normal curve= 1
• Middle 50% = from the first quartile to the third quartile
• First quartile=mean-(0.67X standard deviation)
• Third quartile= mean+(0.67X standard deviation)
5.9. The 68-95-99.7 Rule
If you use the 68-95-99.7 rule, then you can interpret
the data more effectively. The following figure illustrates the 68-95-99.7 rule,
when mean is 0 and standard deviation is 1.
Normal Distributions 68-95-99.7 Rule
• 68% of the observations fall within 1 standard deviation of the mean.
• 95% of the observations fall within 2 standard deviations of the mean.
• 99.7% of the observations fall within 3 standard deviations of the mean.
Example 1 ( Heights of American Women)
The distribution of heights of American women aged 18-24 is approximately normal with mean 64.5 inches and standard deviation 2.5 inches. Use the 68-95-99.7 rule to interpret the data.
Calculate the intervals of 1, 2 and 3 standard deviations:
• The interval of 1 standard deviation:
• The interval of 2 standard deviation:
• The interval of 3 standard deviation:
Example 1 ( Heights of American Women)
• The interval of 1 standard deviation:64.5 − 2.5, 64.5 + 2.5 =[62, 67]
• The interval of 2 standard deviation:
[64.5 − 2 × 2.5, 64.5 + 2× 2.5] = [59.5, 69.5]
• The interval of 3 standard deviation: [64.5 − 3 × 2.5, 64.5 + 3
× 2.5] = [57, 72]
Apply the 68-95-99.7 rule. • About 68% of young women
are between 62 and 67 inches tall.
• About 95% of young women are between 59.5 and 69.5 inches tall.
• About 99.7% of young women are between 57 and 72 inches tall.
• About 2.5 % of young women are taller than 69.5 inches.
Example
The scores of students on a standardized test form a normal distribution with a mean of 400 and a standard deviation of 30. One thousand students took the test. Find the number of students who score above 460.
Example
The distribution of the scores on a standardized exam is approximately normal with mean 100 and standard deviation 15. What percentage of scores lie between 115 and 130?
Chapter 7 Data for Decisions
Sampling Terminology Example
The population is the entire group
from which you are getting
information.
The population for a study of
childhood cancer in the USA is all
childhood cancer patients in the USA.
A sample is used, when data are
collected from only part of the
population. This sample must be
representative of the population.
Valid conclusions are obtained when
the sample results represent those of
the population.
When a census is conducted, data are
collected from the entire population.
Sample might be childhood cancer
patients in the largest children’s
hospital in each State.
Potential Problem
Problems may arise if a person does not consider bias, use of language, ethics, cost and time, timing, privacy, cultural sensitivity.
Potential
Problem
What it means Example
Bias The question influences in favor
of, or against the topic of the
data collection.
Suppose a person asks: ‘Don’t you think
calories of McDonald’s foods are too
high?’
This person has a bias against the calories
of McDonald’s foods. The bias influences
how the survey questions are written.
Use of
Language
The use of language in question
could lead people to give a
particular answer.
‘Don’t you think calories of McDonald’s
foods are too high?’ may lead people to
answer yes. A better question would be
‘do you think calories of McDonald’s foods
are too low, low, medium, high, too high?’
Timing When the data are collected
could lead to particular results.
A survey is conducted to find opinions on
the need for a winter tire. The answer may
vary depending on whether Barbourville
has a lot of snow or not.
Potential
Problem
What it means Example
Privacy If the topic of the data
collection is personal, a
person may not want to
participate or may give an
untrue answer on purpose.
Anonymous surveys may
help.
Suppose you are a grade 9 teacher
and plan to conduct a survey about
smoking in classes he or she is
teaching. Students who smoke may be
afraid of punishment and may try to
avoid participating survey.
Cultural
Sensitivity
Cultural sensitivity means you
are aware of other cultures.
You must avoid being
offensive and asking
questions that do not apply
to that culture.
You go to Muslim community and
survey their favorite cooking method
of pork. For example, circle your
favorite method of cooking pork:
Fry Barbeque Bake
Potential
Problem
What it means Example
Ethics Ethics dictate that the
collected data must not be
used for purposes other than
those told to the participants.
Otherwise, your actions are
considered unethical.
Suppose you tell to your classmates
that you want to know their favorite
snacks to help you plan your birthday
party. If you use that to sell favorite
snacks to your classmates, then it is
unethical.
Cost The cost of collecting data
must be taken into account.
Printing questionnaires, Pay people to
collect data
Time The time needed for
collecting data must be
considered.
A survey that takes an hour to
complete may be too long. This will
limit the number of people who are
willing to participate.
Inferences
Learning Objective
Statistical Inference: How do we generalize the collected data from samples?
• Parameter vs. Statistic
• Confidence interval: related to 68-95-99.7 rule
• Central limit theorem: related to the sizes of each sample
• Law of large numbers: related to the number of samplings (experiments)
A survey question:
‘I like buying new clothes, but shopping is often frustration and time consuming.’
Circle your opinion:
Agree disagree
Sampling method: nationwide random selection of 2500 adults
Terminology Meaning Example
Statistical
inferences
Methods for drawing
conclusions about
the entire population
on the bases of data
from a sample.
Drawing conclusions
about an entire
population of 230
million American
adults.
Terminology Meaning Example
Parameter
Notation: p
Fixed (usually unknown)
number that describes a
population such as
proportion, mean or
standard deviation
Suppose that 60% of entire
American population agreed.
Then, 0.6 or 60% is the
parameter.
Statistic
Notation: p
Number that describes a
sample.
Known for the sample
we take
Varies from sample to
sample
Useful to estimate an
unknown parameter
Suppose that 1650 adults
from the random sample of
2500 adults answered that
they agree. Then, the statistic
is the proportion 0.66( 0r 66%
from 1650
2500)
To draw a conclusion for the entire American adult population, we take the following steps:
Steps, Terminology, Property
Example
Simulation: Drawing many
samples at random from a
population that we specify.
Assumption in SRS: All
possible samples of n objects
are equally likely to occur.
SRS (Simple Random
Sampling): Draw 1000
separate samples of size 100
from a population that we
suppose has a parameter
value p=0.6 by generating a
computer program.
To draw a conclusion for the entire American adult population, we take the following steps:
Steps, Terminology,
Property
Example
Take a large number of
random samples from the
same population.
Draw 1000 separate samples of size
100 from a population that we
suppose has a parameter value p=0.6
Calculate the sample
proportion p for each
sample.
p
=Count of successes in the sample
size of sample
=# of Agree
100
To draw a conclusion for the entire American adult population, we take the following steps:
Steps, Terminology,
Property
Example
Make a histogram of
the values of p
To draw a conclusion for the entire American adult population, we take the following steps:
Steps, Terminology, Property
Example
Examine the distribution displayed in
the histogram for shape, center, and
spread, as well as outliers or other
deviations.
When we analyze the curve, we get the
following information. Center and
spread are based on the actual
calculation with specific sampling(1000
separate samples of size 100).
Shape: the sampling distribution of p
is approximately normal because
each sample sizes are 100.
Center: 0.598
Spread: Standard deviation =0.051
Theorem: (Sampling Distribution of a Sample Proportion)
Assumption: Choose an SRS of size n from a large population that contains population proportion p of successes.
Shape: For large sample sizes( 𝑛 ≥ 30), the sampling distribution of p is approximately normal. (Central limit theorem)
• Center: The mean of the
sampling distribution p = the parameter p.
• Spread: Standard deviation of the sampling distribution
of p =p(1−p)
n
Theorem:
Sampling Distribution of a
Sample Proportion
Example( Continued)
Assumption: Choose an SRS of
size n from a large population
that contains population
proportion p of successes.
n=100
We took a simple random
sample from the population
proportion p=0.6 of successes.
Theorem:
Sampling Distribution of a Sample
Proportion
Example( Continued)
Shape: For large sample sizes(
𝑛 ≥ 30), the sampling distribution
of p is approximately
normal.(Central limit theorem)
Center: The mean of the sampling
distribution p = the parameter p.
Spread: Standard deviation of the
sampling distribution of p =
p(1−p)
n
Shape: the sampling distribution of
p is approximately normal because
each sample sizes are 100.
Center:
Mean of the sampling distribution
p = 0.598 ≈ 0.6 = p (the parameter
p).
Spread: Standard deviation of the
sampling distribution of
p =0.6(1−0.6)
100= 0.0024 ≈ 0.04899
Terminology
Example
Margin of Error=2𝑝 (1−𝑝 )
𝑛
Meaning: It is 95% confident
that true p is within a range of
p ± 2𝑝 (1−𝑝 )
𝑛
Why we use margin of error? In
real life, we don’t know what
true parameter p is. Thus, we
want to draw conclusion about
true parameter p from the
chosen simple random sample.
Margin of error with the chosen sample
we have been considering is
20.049(1 − 0.049)
100≈ 0.043
It is 95% confident that true p is within a
range of p ± 2𝑝 (1−𝑝 )
𝑛= 0.598 ±
0.043 = 0. 555 𝑡𝑜 0.641
Remark: we can see that true parameter
p(=0.6) is in the range of 0.555 and
0.641.
Law of large numbers( informal )
If the number of times a situation is repeated becomes larger and larger, the proportion of successes( i.e., expected outcomes or events really happen) will tend to come closer and closer to the actual probability of success.
Law of large numbers(formal)
Observe any random phenomenon having numerical outcomes with finite mean 𝜇. As the random phenomenon is repeated a large number of times,
• The proportion of trials on which each outcome occurs gets closer and closer to the probability of that outcome.
• The mean 𝑥 of the observed values gets closer and closer to 𝜇.