elementary statistics
TRANSCRIPT
Elementary Statistics
Davis LazarusAssistant Professor
ISIM, The IIS University
Too few categories
18 23 28
0
10
20
30
40
50
60
Age (in years)
Fre
quency (
Count)
Age of Spring 1998 Stat 250 Students
n=92 students
2 3 4
0
1
2
3
4
5
6
7
GPA
Fre
quen
cy (C
oun
t)GPAs of Spring 1998 Stat 250 Students
n=92 students
Too many categories
30
35
40
45
50
55
60
65
70
75
30 40 50 60 70 80
X
Y
•Scatter Plot
•Scatter diagram
•Scattergram
Classes Class boundaries
Tally Marks Freq.
x
70 – 78 61 – 69 52 – 60 43 – 51 34 – 42 25 – 33 16 – 24
69.5 – 78.560.5 – 69.551.5 – 60.5 42.5 – 51.533.5 – 42.524.5 – 33.515.5 – 24.5
//////////
///////-///////-/////-/////////-/////-/////-//
5 5
0
27
14 17
74655647382920
A frequency distribution table lists
categories of scores along with their corresponding frequencies.
The frequency for a particular category or class is the number of
original scores that fall into that class.
The classes or categories refer to the groupings of a frequency table
• The range is the difference between the highest value and the lowest value.
R = highest value – lowest value
The class width is the difference between two consecutive lower class
limits or class boundaries.
The class limits are the smallest or the largest
numbers that can actually belong to different classes.
• Lower class limits are the smallest numbers that can actually belong to the different classes.
• Upper class limits are the largest numbers that can actually belong to the different classes.
• The class boundaries are obtained by increasing the upper class limits and decreasing the lower class limits by the same amount so that there are no gaps between consecutive under classes. The amount to be added or subtracted is ½ the difference between the upper limit of one class and the lower limit of the following class.
Essential Question :
• How do we construct a frequency distribution table?
Process of Constructing a
Frequency Table • STEP 1: Determine the
range.
R = Highest Value – Lowest Value
• STEP 2. Determine the tentative number of classes (k)
k = 1 + 3.322 log N
• Always round – off • Note: The number of classes should be between
5 and 20. The actual number of classes may be affected by convenience or other subjective factors
• STEP 3. Find the class width by dividing the range by the number of classes.
(Always round – off )
k
Rc
classesofnumber
Rangewidthclass
• STEP 4. Write the classes or categories starting with the lowest score. Stop when the class already includes the highest score.
• Add the class width to the starting point to get the second lower class limit. Add the class width to the second lower class limit to get the third, and so on. List the lower class limits in a vertical column and enter the upper class limits, which can be easily identified at this stage.
• STEP 5. Determine the frequency for each class by referring to the tally columns and present the results in a table.
When constructing frequency tables, the following
guidelines should be followed.• The classes must be mutually
exclusive. That is, each score must belong to exactly one class.
• Include all classes, even if the frequency might be zero.
• All classes should have the same width, although it is sometimes impossible to avoid open – ended intervals such as “65 years or older”.
• The number of classes should be between 5 and 20.
Let’s Try!!!
• Time magazine collected information on all 464 people who died from gunfire in the Philippines during one week. Here are the ages of 50 men randomly selected from that population. Construct a frequency distribution table.
19 18 30 40 41 33 73 25
23 25 21 33 65 17 20 7647 69 20 31 18 24 35 2417 36 65 70 22 2565 1624 29 42 37 26 46 27 6321 27 23 25 71 37 75 2527 23
Using Table:• What is the lower class limit of the highest class? Upper class limit of the lowest class?
• Find the class mark of the class 43 – 51.
• What is the frequency of the class 16 – 24?
Classes Class boundaries
Tally Marks Freq.
x
70 – 78 61 – 69 52 – 60 43 – 51 34 – 42 25 – 33 16 – 24
69.5 – 78.560.5 – 69.551.5 – 60.5 42.5 – 51.533.5 – 42.524.5 – 33.515.5 – 24.5
//////////
///////-///////-/////-/////////-/////-/////-//
5 5
0
27
14 17
74655647382920
The manager of Hudson Auto would like to have a betterunderstanding of the cost of parts used in the engine
tune-ups performed in the shop.
She examines 50 customer invoices
for tune-ups. The costs of parts,
rounded off to the nearest dollar,
are listed on the next slide.
91 78 93 57 75 52 99 80 97 6271 69 72 89 66 75 79 75 72 76104 74 62 68 97 105 77 65 80 10985 97 88 68 83 68 71 69 67 7462 82 98 101 79 105 79 69 62 73
91 78 93 57 75 52 99 80 97 6271 69 72 89 66 75 79 75 72 76104 74 62 68 97 105 77 65 80 10985 97 88 68 83 68 71 69 67 7462 82 98 101 79 105 79 69 62 73
Example 1
CUMULATIVE FREQUENCY DISTRIBUTION
• The less than cumulative frequency distribution (F<) is constructed by adding the frequencies from the lowest to the highest interval while the more than cumulative frequency distribution (F>) is constructed by adding the frequencies from the highest class interval to the lowest class interval.
Tabular Summary Frequency Distribution of
engine tune-ups
50-59
60-69
70-79
80-89
90-99
100-109
2
13
16
7
7
5
50
Cost ($) Frequency
0.04
0.26
0.32
0.14
0.14
0.10
1.00
Relative Frequency
2 + 13
Cumulative Frequency
less than more than
2
15
31
38
45
50
50
48
35
18
12
5
5 + 7
45 tune-ups cost less than $ 100
12 tune-ups cost more than $ 89
Graphical Summary: Histogram
2244
6688
1010
1212
14141616
1818
Cost ($)Cost ($)
Fre
qu
ency
Fre
qu
ency
50-59 60-69 70-79 80-89 90-99 100-11050-59 60-69 70-79 80-89 90-99 100-110
Unlike a bar graph, a histogram has no naturalseparation between rectangles of adjacent classes.
Tune-upCost ($)Tune-upCost ($)
1010
2020
3030
4040
50 50F
req
uen
cyF
req
uen
cy
60 70 80 90 100 110 60 70 80 90 100 110
Ogiveless than ogive
median
more than ogive
Stem-and-Leaf Display
5
6
7
8
9
10
2 7
2 2 2 2 5 6 7 8 8 8 9 9 9
1 1 2 2 3 4 4 5 5 5 6 7 8 9 9 9 0 0 2 3 5 8 9
1 3 7 7 7 8 9
1 4 5 5 9
a stema leaf
A single digit is used to define each leaf
Leaf units may be 100, 10, 1, 0.1, and so on
Where the leaf unit is not shown, it is assumed to equal 1
In the above example, the leaf unit was 1
Leaf Unit = 0.1
8
9
10
11
6 8
1 4
2
0 7
8.6 11.7 9.4 9.1 10.2 11.0 8.8
Leaf Unit = 10 1806 1717 1974 1791 1682 1910 1838
16
17
18
19
8
1 9
0 3
1 7
The 82 in 1682is rounded downto 80 and isrepresented as an 8
Measures of Central Tendency Arithmetic Mean, Weighted Mean, Geometric Mean, Median, Mode, Partition Values – Quartiles, Deciles and Percentiles
Measures of DispersionRange, Mean deviation, Standard deviation, Variance, Co-efficient of variation
Measures of PositionQuartile deviation
Mean: the average obtained by finding the sum of the
numbers and dividing by the number of numbers in the sum.
Median: When the numbers are listed from highest to lowest
or lowest to highest, the median is the average number found
in the middle. If there are an even number of data, find the
average of the middle two numbers.
Mode: The number that occurs the most often.
• What is the “location” or “centre” of the data? (measures of location or central tendency)
• How do the data vary? (measures of variability or dispersion)
Mean is the most widely used measure of location and shows the central value of the data.
• all values are used• unique• sum of the deviations from the mean is 0• affected by unusually large or small data values
µ is the population meanN is the population size
Xi is a particular population value indicates the operation of addingN
X i
n
XX
µ is the sample meann is the sample size
xi is a particular sample value
xi
There are as many values above the median as below it
in the data array.
For an even set of values, the median will be the
arithmetic average of the two middle numbers and is
found at the (n+1)/2 ranked observation.
The MedianMedian is the midpoint of the values after they
have been ordered from the smallest to the largest.
unique not affected by extremely large or small values
good measure of location when such values occur
Data can have more than one mode.
If it has two modes, it is referred to as bimodal, three
modes, trimodal, and the like.
The ModeMode is another measure of location and represents
the value of the observation that appears most frequently.
GM X X X Xnn ( )( )( )...( )1 2 3
GM is used to average percents, indexes, and relatives.
Geometric MeanGeometric Mean of a set of n numbers is
defined as the nth root of the product of the n numbers.
)21
)2211
...(
...(
n
nnw
www
XwXwXwX
Weighted MeanWeighted Mean of a set of numbers X1, X2, ..., Xn,
with corresponding weights w1, w2, ...,wn
The interest rate on three bonds were 5, 21, and 4 percent.
The arithmetic mean is (5+21+4) / 3 =10.0
The geometric mean is
49.7)4)(21)(5(3 GM
The GM gives a more conservative profit figure because
it is not heavily weighted by the rate of 21%
Example 1
Another use of GM
is to determine the
percent increase in
sales, production
or other business
or economic series
from one time
period to another.
Grow th in Sales 1999-2004
0
10
20
30
40
50
1999 2000 2001 2002 2003 2004
Year
Sal
es in
Milli
ons(
$)
1period) of beginningat (Value
period) of endat Value(nGM
Example 2
0127.1000,755
000,8358 GM
The total number of females enrolled in American
colleges increased from 755,000 in 1992 to 835,000 in
2000. That is, the geometric mean rate of increase is
1.27%.
Example 3
Measures of Dispersion
•Range
• Mean Deviation
•Quartile Deviation
•Standard Deviation
•Variance
•Co-efficient of Variation
Dispersion Dispersion refers to the spread or variability in the data.
Range Range = Largest value – Smallest value
0
5
10
15
20
25
30
0 2 4 6 8 10 12
mean
The following represents the current year’s Return on Equity of the 25 companies in an investor’s portfolio.
-8.1 3.2 5.9 8.1 12.3-5.1 4.1 6.3 9.2 13.3-3.1 4.6 7.9 9.5 14.0-1.4 4.8 7.9 9.7 15.01.2 5.7 8.0 10.3 22.1
Highest value: 22.1 Lowest value: -8.1
Range = Highest value – lowest value = 22.1-(-8.1) = 30.2
Range Example
The arithmetic mean of the absolute values of the deviations from the arithmetic mean.
All values are used
in the calculation.
It is not unduly
influenced by large
or small values.
The absolute values
are difficult to
manipulate.
M D = X - X
n
Mean Deviation
The weights of a sample of crates containing books for the bookstore (in pounds ) are: 103, 97, 101, 106, 103
X = 102
4.25
541515
102103...102103
n
XXMD
Example 5
Population Variance (X - )2
N =
X is the value of an observation in the population
μ is the arithmetic mean of the population
N is the number of observations in the population
Population Standard Deviation,
the arithmetic mean of the squared deviations from the mean
Standard deviation = (variance)
Standard deviation and Variance
In Example 4, the variance and standard deviation are:
(-8 .1 -6 .6 2 ) 2 + (-5 .1 -6 .6 2 ) 2 + ... + (2 2 .1 -6 .6 2 ) 2
2 5
= 4 2 .2 2 7 = 6 .4 9 8
(X - )2
N =
Example 6
Sample variance
s 2 =(X - X ) 2
n -1 Sample standard deviation, s
The hourly wages earned by a sample of five students are $7, $5, $11, $8, $6.
40.75
37
n
XX
30.515
2.2115
4.76...4.77
1
2222
n
XXs
30.230.52 ss
Example 7
Example:
-1 1
3 9
-2 4
-3 9
2 4
1 1
Data: X = {6, 10, 5, 4, 9, 8}; N = 6
Total: 42 Total: 28
Standard Deviation:
76
42
N
XX
Mean:
Variance:2
2 ( ) 284.67
6
X Xs
N
16.267.42 ss
XX 2)( XX X
6
10
5
4
9
8
Empirical Rule:
For any symmetrical, bell-shaped distribution
About 68% of the observations will lie within 1s the mean
About 95% of the observations will lie within 2s of the mean
Nearly all the observations will be within 3s of the mean
68%
95%
99.7%
Interpretation and Uses of the Standard Deviation
Quartiles Q1, Q2, Q3 divides ranked
data into four equal parts
25%
Q3Q2Q1
25% 25%25%
10 Deciles: D1, D2, D3, D4, D5, D6, D7, D8, D9
divides ranked data into ten equal parts
10% 10% 10% 10% 10% 10% 10% 10% 10% 10%
D1 D2 D3 D4 D5 D6 D7 D8 D9
99 Percentiles: divides ranked data into 100 equal parts
Fractiles
Relative Standing
Percentiles
percentile of value x = ((number of values < x)/ total number of values)*100
(round the result to the nearest whole numberSuppose that in a class of 25 people we have the following averages (ordered in ascending order)
42, 59, 63, 67, 69, 69, 70, 73, 73, 74, 74, 74, 77, 78, 78, 79, 80, 81, 84, 85, 87, 89, 91, 94, 98
If you received a 77, what percentile are you?
percentile of 77 = (12/25)*100 = 48
Relative Standing
Quartiles
Instead of finding the percentile of a single data value as we did on the previous page, it is often useful to group the data into 4, or more, (nearly) equal groups. When grouping the data into four equal groupings, we call these groupings quartiles.
Let n = number of items in the data set
k = percent desired (ex. k= 25)
L = locator the value separating the first k percent of the data from the rest
L = (k/100) * n
Relative Standing
Let’s separate the 25 class grades into four quartiles.
•Step 1 – order the data in ascending order
42, 59, 63, 67, 69, 69, 70, 73, 73, 74, 74, 74, 77, 78, 78, 79, 80, 81, 84, 85, 87, 89, 91, 94, 98
Now find the 3 locators L25, L50, L75,
L25 = (25/100) * 25 = 6.25
L50 = (50/100) * 25 = 12.5
L75 = (75/100) * 25 = 18.75
7
13
19
Round fraction part up to the next integer
L25
Q1Q2 Q3
Relative Standing
Other measures of relative standing include•Interquartile range (IQR) = Q3 - Q1
•Semi-interquartile range = (Q3 - Q1)/ 2
•Midquartile = (Q3 + Q1)/2
•10 – 90 percentile range = P90 - P10
For the data on the previous page we have:
IQR = 84 – 70 = 16
Semi IQR = (84 – 70)/2 = 8
Midquartile = (84 + 70)/2 = 77
Measures of variation
Measure of central tendency
Box Diagram
65, 67, 68, 68, 69, 69, 71, 71, 71, 72, 72, 72, 73, 73, 73,
74, 74, 75, 75, 75, 75, 76, 76, 77, 77, 77, 77, 77, 77, 78,
78, 78, 78, 79, 79, 79, 79, 80, 81, 81, 81, 81, 81, 81, 81,
81, 82, 82, 83, 84, 85, 85, 85, 86, 86, 87, 87, 88, 89, 92
L25
L75
65 92 69 73 77 81 85 89
Q1 M Q3
median
To construct a box diagram to illustrate the extent to which the extreme data values lie beyond the interquartile range, draw a line with the low and high value highlighted at the two ends. Mark the gradations between these two extremes, then locate the quartile boundaries Q1, Med., and Q3 on this line. Construct a box about these values.
Q1 = (73 + 74)/2 = 73.5
Percentile of score a = * 100number of scores less than a
total number of scores
Relation between the different fractiles
• Q1 = P25
• Q2 = P50
• Q3 = P75
D1 = P10
D2 = P20
D3 = P30
• • •
D9 = P90
Interquartile Range: Q3 – Q1
Five pieces of data are needed to construct a box plot:
Minimum Value,
First Quartile, Q1
Median,
Third Quartile, Q3
Maximum Value.
graphical display, based on quartiles,
that helps to picture a set of data.Box plot
The box represents the interquartile
range which contains the 50% of
values.
The whiskers represent the range;
they extend from the box to the
highest and lowest values,
excluding outliers.
A line across the box indicates the
median.
Based on a sample of 20 deliveries, Buddy’s Pizza
determined the following information. The minimum
delivery time was 13 minutes and the maximum 30 minutes.
The first quartile was 15 minutes, the median 18 minutes, and
the third quartile 22 minutes. Develop a box plot for the
delivery times.
Example 8
1 2 1 4 1 6 1 8 2 0 2 2 2 4 2 6 2 8 3 0 3 2
Q 1 Q 3M a xM in M ed ia n
1.5 times the interquartile range1.5 times the IQ range
Symmetric distribution: A distribution having the same shape on either side of the centre
Skewed distribution: One whose shapes on either side of the center differ; a nonsymmetrical distribution.
Can be positively or negatively skewed, or bimodal
Skewness
measurement of the lack of symmetry of the distribution.
Relative Positions of the Mean, Median, and Mode in a Symmetric Distribution
M o d e
M ed ia n
M ea n
Relative Positions of the Mean, Median, and Mode in a Right Skewed or Positively Skewed Distribution
M o d e
M ed ia n
M ea n
Mean > Median > Mode
The Relative Positions of the Mean, Median, and Mode in a Left Skewed or Negatively Skewed Distribution
M o d eM ea n
M ed ia n
Mean < Median < Mode
Using the twelve stock prices, we find the mean to be 84.42, standard deviation, 7.18, median, 84.5.
= -.035( )
s
MedianXsk
-=
3
Example 9
The coefficient of skewness can range from -3.00 up to 3.00
A value of 0 indicates a symmetric distribution.
• derived from the Greek word κυρτός, kyrtos or kurtos,
meaning bulging
• measure of the "peakedness" of the probability
distribution of a real-valued random variable
• higher kurtosis means more of the variance is due to
infrequent extreme deviations, as opposed to frequent
modestly-sized deviations.
Kurtosis
distribution with positive kurtosis is called leptokurtic,
or leptokurtotic.
In terms of shape, a leptokurtic distribution has a more acute
"peak" around the mean (that is, a higher probability than a
normally distributed variable of values near the mean) and
"fat tails" (that is, a higher probability than a normally
distributed variable of extreme values).
distribution with negative kurtosis is called platykurtic,
or platykurtotic.
In terms of shape, a platykurtic distribution has a smaller
"peak" around the mean (that is, a lower probability than a
normally distributed variable of values near the mean) and
"thin tails" (that is, a lower probability than a normally
distributed variable of extreme values).
Normal distribution - Mesokurtic
Other distribution – Leptokurtic
Normal distribution - Mesokurtic
Other distribution – Platykurtic
Comparing Standard Deviations
Mean = 15.5 s = 3.338
11 12 13 14 15 16 17 18 19 20 21
Data A
11 12 13 14 15 16 17 18 19 20 21
Data B
Mean = 15.5 s = .9258
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5 s = 4.57
Data C
• Measures relative variation
• Always in percentage (%)
• Shows variation relative to mean
• Is used to compare two or more sets of data measured in different units
100%S
CVX
Co-efficient of variation
When the mean value is near zero, the coefficient of variation is sensitive to change in the standard deviation, limiting its usefulness.
Stock A:Average price last year = $50Standard deviation = $5
Stock B:Average price last year = $100Standard deviation = $5
$5100% 100% 10%
$50
SCV
X
$5100% 100% 5%
$100
SCV
X