qt1 - 03 - measures of central tendency
TRANSCRIPT
Measures of Central Tendency
Measures of
Central Tendency and Dispersion
QUANTTECHINTEUQIASEVIT10S
Contents
Summary Statistics
Measures of Central Tendency
Mean
Median
Mode
Measures of Dispersion
Range
Quartiles
Standard Deviation
Frequency Distribution
Relative Frequency Distribution
Frequency of each value can be expressed as a fraction or percentage of the total number of observations
This could help us compare data from samples that are of different sizes
Let us compare two sets of data
mid term and end term marks for same students
What can we say about the results ?
Did the class in general fare better ?
Did most of the students get similar marks ?
Summary Statistics
Tables and Graphs illustrate trends and patterns in the data
To take hard decisions, we need to more exact measures
Single numbers
Calculated Mathematically
4 Principal Measures
Central Tendency
Dispersion
Skewness
Kurtosis
Values which can be calculated mathematically
Measure of Central Tendency
4 situations where the relative frequency distribution over a certain range is as given above
In B data clusters tends towards the left, centres around 15
In D data clusters tends towards the right, centres around 25
Is it really 15 and 25 ?
Measure of Dispersion
4 situations where the relative frequency distribution over a certain range is as given above
In A data centres around 15, but closely dispersed
In B data centres around 15, but more widely dispersed
What is close and what is wide ?
Skewness
Skewness is a measure of the lack of symmetry in the data. Not only is the data concentrated in one part of the range, but even there it is asymmetrical
Different from central tendency
Kurtosis
This is a measure of the peakedness of the data !
Which curve is more peaked than the other
Event though they might have same central tendency and dispersion
Measures of Central Tendency
Arithmetic Mean
Mean of Grouped Data
Weighted Mean
Geometric Mean
Median
Median of Grouped Data
Mode
The Arithmetic Mean
Simply average of all the values
calculated by adding the values of all the observations
and then dividing by the number of observations
x1+ x2+ x3+...xn
n
x
n
To calculate this mean, we have to use every piece of data ...
For a population this is actually impossible
For a sample, this can be quite difficult if the number of observations in the sample is quite high
X =
X =
Arithmetic Mean of Grouped Data
f1x1 + f2x2 + f3x3 + ... fnxn
n
( f x )
n
where
fi = frequency of ith class
xi = midpoint of ith class
n = number of observations
X =
X =
Arithmetic Mean : Spreadsheet
Arithmetic Mean
Advantages
A single number that represents a whole set of data
Intuitive and simple to understand
Easy to calculate
Allows a quick comparison of two sets of data
Disadvantages
Affected by extreme values of the data
Suppose we have one very high number !
Could be difficult to calculate if size of set is high
Largely overcome by computers
In case of open-ended datasets, we cannot compute the number
The weighted mean
Allows us to calculate an average that factors in the significance or importance of each data
Consider the following
2 Managers
Salary of Rs 1 Lakh each
10 Workers
Salary of Rs 20K each
1 Peon
Salary of Rs 5K
What is the average salary per-employee ?
Is it ?
(1L + 20K + 5K )/ 3
Is it ?
[(2 x 1L ) + (10 x 20K) + (1 x 5K) ] / 13
The calculation is a more accurate reflection of the average salary
Weighted Mean Grouped Data
f1x1 + f2x2 + f3x3 + ... fnxn
n
( f x )
n
where
fi = frequency of ith class
xi = midpoint of ith class
n = number of observations
X =
X =
The Weight is logically and mathematically equivalent to midpoint of the class.
Since we assume that all members of the class have the same value
So lower class boundary, upper class boundary and midpoint are all same !!
Median
Measure of Central Tendency that does NOT represent all the values in the dataset.
It is the value of the most central or middle most item.
Half the values are above this value ( the median value) and the other half are below
To calculate the median
Arrange the data in either ascending or descending order
Choose the value of the data that lies in the middle of the array.
What happens if you have a an even number of data values ?
Median of Grouped Data
Philosophical approach
Determine the class where the middle data should lie
Use frequency distribution
This is the median class
Median class has
Lower Boundary
Upper Boundary
Extrapolate !
Practical Approach
Use this formula
(n+1)/2 (F+1)
w + Lm
fm
N = total number of observations
F = cumulative frequency till the previous class
fm frequency of median class
w = class width
Lm = Lower limit of median class
Median : Spreadsheet
Median
Advantages
Unaffected by extreme values of the data
Disadvantages
Complex to calculate
Some loss of accuracy when you work with grouped data
If the data is extremely irregular then not much sense can be made from the median
Mode
Mode is that value that is seen, or that occurs, most frequently in the data set
For discreet distributions this is easy to determine
Marks awarded to students in an exam
For continuous distributions it is very unlikely that the same value will appear twice
CO2 emissions from twenty engines
Modal Class
For continuous distributions we define classes and note the class that has the highest frequency
This is the modal class
It is quite possible that two or more modal classes might occur in the case of MultiModal distributions
Measures of Dispersion
Why is dispersion important ?
It helps us understand the significance or reliability of the central tendency.
Helps us to compare two or more samples
Range
Interfractile Range
Interquartile Range
Average Deviation Measures
Population Variance
Population Standard Deviation
Standard Score
Sample Variance
Sample Standard Deviation
Range
Range : is simply the difference between the highest and lowest value in the observation
Easy to understand
Not terribly useful!
Fractile / Quartile / Percentile
Fractile
A value which is higher than a certain percentage of observations
Median
Is 0.5 fractile because 50% of the data lies below this value
1st Quartile
25% of the data lies at or below this value
3rd Quartile
75% of the data lies at or below this value
89 percentile
89% of the data lies at or below this value
An n fractile is a value below which a fraction n of the data is resident
and n is a number between 0 and 1
1st Quartile fractile or 25 percentile
90 Percentile 9/10th fractile
Calculation of Fractiles
Interquartile Range
Where do half the values lie ? Q3 - Q1
25% below this
75% below this
Variance & Standard Deviation
Every population has a variance s2 ( sigma squared) defined as the following
S ( x m)2 S x2
s2 = = -- m2
N N
Where
s2 is the variance and s is the standard deviation of population
X is the item or observation
m is the population mean
N is the number of observations
S is symbol of summation
Variance & Standard Deviation
S ( x m)2 S x2
s2 = = -- m2
N N
Here we usethe formulaprovided by the spreadsheet
Significance of s
The standard deviation s enables us to determine with a great deal of accuracy where the values of frequency distribution lie with respect to the mean m
Chebyshev's Theorem states that for any distribution
75% of all data will be in the between m-2s and m+2s
89% of all data will be in the between m-3s and m+3s
For a smooth, symmetrical distribution
68% of all data will be in the between m-s and m+s
95% of all data will be in the between m-2s and m+2s
99% of all data will be in the between m-3s and m+3s
Variance & Standard Deviation
for Grouped Data
Variance s2 is given by
S fi( xi m)2 S fi xi 2
s2 = = -- m2
N N
Where
s2 is the variance and s is the standard deviation of population
Xi is the midpoint of the ith class
fi is the frequency in the ith class
m is the population mean
N is the number of observations
S is symbol of summation
Variance & Standard Deviation
for Grouped Data
S fi( xi m)2 S fi xi 2
s2 = = -- m2
N N
Variance : Population and Sample
S ( x m)2 S x2
s2 = = -- m2
N N
S ( x x)2 S x2 n x 2
s2 = = --
n-1 n-1 n-1
PopulationStatisticsSampleStatisticsSuspiciously similar but not quite ... why is this ?
Population and Sample
Population refers to the totality of all data that is possible
Impossible to get this
So we will never be able to calculate
Either mean m
Or variance s2
Sample refers to the data from the population that we can collect
Using this data we can calculate
Sample mean : x
Sample variance : s2
Objective of the statistician is to ESTIMATE
the population statistic m,s
from the sample statistic x,s
Population & Sample Formulae
are different in a Spreadsheet
samplepopulation
Last two statistics
Standard Score
A measure of how far an individual piece of data is from the mean
(X m)
=
s
Question
What would be the mean and standard deviation of the standard score ?
Coefficient of Variation of a population
s
= ( 100 )
m
A relative measure that gives us a feel of how dispersed the data is when compared to the mean
Click to edit the title text format
Click to edit the outline text format
Second Outline Level
Third Outline Level
Fourth Outline Level
Fifth Outline Level
Sixth Outline Level
Seventh Outline Level
Eighth Outline Level
Ninth Outline Level
prithwis mukerjee
Mid Semester Marks
0234578
0234578
0234588
02345688
02345688
0245688
0245688
0245688
024568024568034568034568034568034568034568134568234568234568234568234568
???Page ??? (???)18/05/2008, 15:43:30Page /