qt1 - 03 - measures of central tendency

Measures of Central Tendency

Measures of
Central Tendency and Dispersion

QUANTTECHINTEUQIASEVIT10S

Contents

Summary Statistics


Mean

Median

Mode

Measures of Dispersion

Range

Quartiles

Standard Deviation

Frequency Distribution

Relative Frequency Distribution

Frequency of each value can be expressed as a fraction or percentage of the total number of observations

This could help us compare data from samples that are of different sizes

Let us compare two sets of data
mid term and end term marks for same students

What can we say about the results ?

Did the class in general fare better ?

Did most of the students get similar marks ?

Summary Statistics

Tables and Graphs illustrate trends and patterns in the data

To take hard decisions, we need to more exact measures

Single numbers

Calculated Mathematically

4 Principal Measures

Central Tendency

Dispersion

Skewness

Kurtosis

Values which can be calculated mathematically

Measure of Central Tendency

4 situations where the relative frequency distribution over a certain range is as given above

In B data clusters tends towards the left, centres around 15

In D data clusters tends towards the right, centres around 25

Is it really 15 and 25 ?

Measure of Dispersion

4 situations where the relative frequency distribution over a certain range is as given above

In A data centres around 15, but closely dispersed

In B data centres around 15, but more widely dispersed

What is close and what is wide ?

Skewness

Skewness is a measure of the lack of symmetry in the data. Not only is the data concentrated in one part of the range, but even there it is asymmetrical

Different from central tendency

Kurtosis

This is a measure of the peakedness of the data !

Which curve is more peaked than the other

Event though they might have same central tendency and dispersion


Arithmetic Mean

Mean of Grouped Data

Weighted Mean

Geometric Mean

Median

Median of Grouped Data

Mode

The Arithmetic Mean

Simply average of all the values

calculated by adding the values of all the observations

and then dividing by the number of observations

x1+ x2+ x3+...xn

n

x

n

To calculate this mean, we have to use every piece of data ...

For a population this is actually impossible

For a sample, this can be quite difficult if the number of observations in the sample is quite high

X =

X =

Arithmetic Mean of Grouped Data

f1x1 + f2x2 + f3x3 + ... fnxn

n

( f x )

n

where

fi = frequency of ith class

xi = midpoint of ith class

n = number of observations

X =

X =

Arithmetic Mean : Spreadsheet

Arithmetic Mean

Advantages

A single number that represents a whole set of data

Intuitive and simple to understand

Easy to calculate

Allows a quick comparison of two sets of data

Disadvantages

Affected by extreme values of the data

Suppose we have one very high number !

Could be difficult to calculate if size of set is high

Largely overcome by computers

In case of open-ended datasets, we cannot compute the number

The weighted mean

Allows us to calculate an average that factors in the significance or importance of each data

Consider the following

2 Managers

Salary of Rs 1 Lakh each

10 Workers

Salary of Rs 20K each

1 Peon

Salary of Rs 5K

What is the average salary per-employee ?

Is it ?

(1L + 20K + 5K )/ 3

Is it ?

[(2 x 1L ) + (10 x 20K) + (1 x 5K) ] / 13

The calculation is a more accurate reflection of the average salary

Weighted Mean Grouped Data

f1x1 + f2x2 + f3x3 + ... fnxn

n

( f x )

n

where

fi = frequency of ith class

xi = midpoint of ith class

n = number of observations

X =

X =

The Weight is logically and mathematically equivalent to midpoint of the class.

Since we assume that all members of the class have the same value

So lower class boundary, upper class boundary and midpoint are all same !!

Median

Measure of Central Tendency that does NOT represent all the values in the dataset.

It is the value of the most central or middle most item.

Half the values are above this value ( the median value) and the other half are below

To calculate the median

Arrange the data in either ascending or descending order

Choose the value of the data that lies in the middle of the array.

What happens if you have a an even number of data values ?

Median of Grouped Data

Philosophical approach

Determine the class where the middle data should lie

Use frequency distribution

This is the median class

Median class has

Lower Boundary

Upper Boundary

Extrapolate !

Practical Approach

Use this formula

(n+1)/2 (F+1)

w + Lm

fm

N = total number of observations

F = cumulative frequency till the previous class

fm frequency of median class

w = class width

Lm = Lower limit of median class

Median : Spreadsheet

Median

Advantages

Unaffected by extreme values of the data

Disadvantages

Complex to calculate

Some loss of accuracy when you work with grouped data

If the data is extremely irregular then not much sense can be made from the median

Mode

Mode is that value that is seen, or that occurs, most frequently in the data set

For discreet distributions this is easy to determine

Marks awarded to students in an exam

For continuous distributions it is very unlikely that the same value will appear twice

CO2 emissions from twenty engines

Modal Class

For continuous distributions we define classes and note the class that has the highest frequency

This is the modal class

It is quite possible that two or more modal classes might occur in the case of MultiModal distributions

Measures of Dispersion

Why is dispersion important ?

It helps us understand the significance or reliability of the central tendency.

Helps us to compare two or more samples

Range

Interfractile Range

Interquartile Range

Average Deviation Measures

Population Variance

Population Standard Deviation

Standard Score

Sample Variance

Sample Standard Deviation

Range

Range : is simply the difference between the highest and lowest value in the observation

Easy to understand

Not terribly useful!

Fractile / Quartile / Percentile

Fractile

A value which is higher than a certain percentage of observations

Median

Is 0.5 fractile because 50% of the data lies below this value

1st Quartile

25% of the data lies at or below this value

3rd Quartile


89 percentile


An n fractile is a value below which a fraction n of the data is resident

and n is a number between 0 and 1

1st Quartile fractile or 25 percentile

90 Percentile 9/10th fractile

Calculation of Fractiles

Interquartile Range

Where do half the values lie ? Q3 - Q1

25% below this

75% below this

Variance & Standard Deviation

Every population has a variance s2 ( sigma squared) defined as the following

S ( x m)2 S x2

s2 = = -- m2

N N

Where

s2 is the variance and s is the standard deviation of population

X is the item or observation

m is the population mean

N is the number of observations

S is symbol of summation


S ( x m)2 S x2

s2 = = -- m2

N N

Here we usethe formulaprovided by the spreadsheet

Significance of s

The standard deviation s enables us to determine with a great deal of accuracy where the values of frequency distribution lie with respect to the mean m

Chebyshev's Theorem states that for any distribution

75% of all data will be in the between m-2s and m+2s


For a smooth, symmetrical distribution

68% of all data will be in the between m-s and m+s



for Grouped Data

Variance s2 is given by

S fi( xi m)2 S fi xi 2

s2 = = -- m2

N N

Where

s2 is the variance and s is the standard deviation of population

Xi is the midpoint of the ith class

fi is the frequency in the ith class

m is the population mean

N is the number of observations

S is symbol of summation

for Grouped Data

S fi( xi m)2 S fi xi 2

s2 = = -- m2

N N

Variance : Population and Sample

S ( x m)2 S x2

s2 = = -- m2

N N

S ( x x)2 S x2 n x 2

s2 = = --

n-1 n-1 n-1

PopulationStatisticsSampleStatisticsSuspiciously similar but not quite ... why is this ?

Population and Sample

Population refers to the totality of all data that is possible

Impossible to get this

So we will never be able to calculate

Either mean m

Or variance s2

Sample refers to the data from the population that we can collect

Using this data we can calculate

Sample mean : x

Sample variance : s2

Objective of the statistician is to ESTIMATE

the population statistic m,s

from the sample statistic x,s

Population & Sample Formulae
are different in a Spreadsheet

samplepopulation

Last two statistics

Standard Score

A measure of how far an individual piece of data is from the mean

(X m)

=

s

Question

What would be the mean and standard deviation of the standard score ?

Coefficient of Variation of a population

s

= ( 100 )

m

A relative measure that gives us a feel of how dispersed the data is when compared to the mean

Click to edit the title text format

Click to edit the outline text format

Second Outline Level

Third Outline Level

Fourth Outline Level

Fifth Outline Level

Sixth Outline Level

Seventh Outline Level

Eighth Outline Level

Ninth Outline Level

prithwis mukerjee

Mid Semester Marks

0234578

0234578

0234588

02345688

02345688

0245688

0245688

0245688

024568024568034568034568034568034568034568134568234568234568234568234568

???Page ??? (???)18/05/2008, 15:43:30Page /

qt1 - 03 - measures of central tendency

Education