qt1 - 03 - measures of central tendency

Download QT1 - 03 - Measures of Central Tendency

If you can't read please download the document

Upload: prithwis-mukerjee

Post on 21-May-2015

3.629 views

Category:

Education


1 download

DESCRIPTION

Class notes used in Quantitative Techniques - I course at Praxis Business School, Calcutta

TRANSCRIPT

  • 1. Measures ofCentral Tendency and Dispersion Q U A N T T E C H I N T E U Q I A S E V I T 1 0 S

2. Contents

  • Summary Statistics
  • Measures of Central Tendency
    • Mean
    • Median
    • Mode
  • Measures of Dispersion
    • Range
    • Quartiles
    • Standard Deviation

3. Frequency Distribution 4. Relative Frequency Distribution

  • Frequency of each value can be expressed as a fraction or percentage of the total number of observations
  • This could help us compare data from samples that are of different sizes

5. Let us compare two sets of data mid term and end term marks for same students

  • What can we say about the results ?
    • Did the class in general fare better ?
    • Did most of the students get similar marks ?

6. Summary Statistics

  • Tables and Graphs illustrate trends and patterns in the data
  • To take hard decisions, we need to more exact measures
    • Single numbers
    • Calculated Mathematically
  • 4 Principal Measures
    • Central Tendency
    • Dispersion
    • Skewness
    • Kurtosis
  • Values which can be calculated mathematically

7. Measure ofCentral Tendency

  • 4 situations where the relative frequency distribution over a certain range is as given above
    • In B data clusterstendstowards the left,centresaround 15
    • In D data clusterstendstowards the right,centresaround 25
  • Is it really 15 and 25 ?

8. Measure ofDispersion

  • 4 situations where the relative frequency distribution over a certain range is as given above
    • In A datacentresaround 15, butclosely dispersed
    • In B datacentresaround 15, but morewidely dispersed
  • What is close and what is wide ?

9. Skewness

  • Skewness is a measure of the lack of symmetry in the data. Not only is the data concentrated in one part of the range, but even there it is asymmetrical
    • Different from central tendency

10. Kurtosis

  • This is a measure of thepeakednessof the data !
    • Which curve is morepeakedthan the other
    • Event though they might have same central tendency and dispersion

11. Measures of Central Tendency

  • Arithmetic Mean
    • Mean of Grouped Data
  • Weighted Mean
  • Geometric Mean
  • Median
    • Median of Grouped Data
  • Mode

12. The Arithmetic Mean

  • Simply average of all the values
    • calculated by adding the values of all the observations
    • and then dividing by the number of observations
  • x 1 + x 2 + x 3 +...x n
  • n
  • x
  • n
  • To calculate this mean, we have to use every piece of data ...
    • For apopulationthis is actually impossible
    • For asample , this can be quite difficult if the number of observations in the sample is quite high

X = X = 13. Arithmetic Mean of Grouped Data

    • f 1 x 1+ f 2 x 2+ f 3 x 3+ ... f n x n
    • n
    • ( f x )
    • n
    • where
    • f i= frequency of i thclass
    • x i= midpoint of i thclass
    • n = number of observations

X = X = 14. Arithmetic Mean : Spreadsheet 15. Arithmetic Mean

  • Advantages
    • A single number that represents a whole set of data
    • Intuitive and simple to understand
    • Easy to calculate
    • Allows a quick comparison of two sets of data
  • Disadvantages
    • Affected by extreme values of the data
      • Suppose we have one very high number !
    • Could be difficult to calculate if size of set is high
      • Largely overcome by computers
    • In case of open-ended datasets, we cannot compute the number

16. The weighted mean

  • Allows us to calculate an average that factors in thesignificance or importanceof each data
  • Consider the following
    • 2 Managers
      • Salary of Rs 1 Lakh each
    • 10 Workers
      • Salary of Rs 20K each
    • 1 Peon
      • Salary of Rs 5K
  • What is the average salary per-employee ?
  • Is it ?
    • (1L + 20K + 5K )/ 3
  • Is it ?
    • [(2 x 1L ) + (10 x 20K)+ (1 x 5K) ] / 13
  • The calculation is a more accurate reflection of the average salary

17. Weighted Mean Grouped Data

    • f 1 x 1+ f 2 x 2+ f 3 x 3+ ... f n x n
    • n
    • ( f x )
    • n
    • where
    • f i= frequency of i thclass
    • x i= midpoint of i thclass
    • n = number of observations

The Weight is logically and mathematically equivalent to midpoint of the class. Since we assume that all members of the class have the same value So lower class boundary, upper class boundary and midpoint are all same !! X = X = 18. Median

  • Measure of Central Tendency that does NOT represent all the values in the dataset.
  • It is the value of the most central or middle most item.
  • Half the values are above this value ( the median value) and the other half are below
  • To calculate the median
    • Arrange the data in either ascending or descending order
    • Choose the value of the data that lies in the middle of the array.
  • What happens if you have a an even number of data values ?

19. Median of Grouped Data

  • Philosophical approach
    • Determine the class where the middle data should lie
      • Use frequency distribution
      • This is the median class
    • Median class has
      • Lower Boundary
      • Upper Boundary
    • Extrapolate !
  • Practical Approach
    • Use this formula
  • (n+1)/2 (F+1)
  • w + L m
  • f m
  • N = total number of observations
  • F = cumulative frequency till the previous class
  • f mfrequency of median class
  • w = class width
  • L m= Lower limit of median class

20. Median : Spreadsheet 21. Median

  • Advantages
    • Unaffected by extreme values of the data
  • Disadvantages
    • Complex to calculate
    • Some loss of accuracy when you work with grouped data
    • If the data is extremely irregular then not much sense can be made from the median

22. Mode

  • Mode is that value that is seen, or that occurs, most frequently in the data set
    • For discreet distributions this is easy to determine
      • Marks awarded to students in an exam
    • For continuous distributions it is very unlikely that the same value will appear twice
      • CO 2emissions from twenty engines
  • Modal Class
    • For continuous distributions we define classes and note the class that has the highest frequency
    • This is the modal class
    • It is quite possible that two or more modal classes might occur in the case of MultiModal distributions

23. Measures of Dispersion

  • Why is dispersion important ?
    • It helps us understand the significance or reliability of the central tendency.
    • Helps us to compare two or more samples
  • Range
    • Interfractile Range
    • Interquartile Range
  • Average Deviation Measures
    • Population Variance
      • Population Standard Deviation
    • Standard Score
    • Sample Variance
      • Sample Standard Deviation

24. Range

  • Range : is simply the difference between the highest and lowest value in the observation
    • Easy to understand
    • Not terribly useful!

25. Fractile / Quartile / Percentile

  • Fractile
    • A value which is higher than a certain percentage of observations
  • Median
    • Is 0.5 fractile because 50% of the data lies below this value
  • 1 stQuartile
    • 25% of the data liesat or belowthis value
  • 3 rdQuartile
    • 75% of the data liesat or belowthis value
  • 89 percentile
    • 89% of the data liesat or belowthis value
  • An n fractile is a value below which a fraction n of the data is resident
  • and n is a number between 0 and 1
  • 1 stQuartile fractile or 25 percentile
  • 90 Percentile 9/10 th fractile

26. Calculation of Fractiles 27. Interquartile Range

  • Where do half the values lie ? Q3 - Q1

25% below this 75% below this 28. Variance & Standard Deviation

  • Every population has a variances 2 ( sigma squared)defined as the following
  • S ( x m) 2Sx 2
  • s 2==--m 2
  • NN
  • Where
    • s 2 is the variance andsis the standard deviation of population
    • Xis the item or observation
    • mis the population mean
    • N is the number of observations
    • Sis symbol of summation

29. Variance & Standard Deviation

  • S ( x m) 2Sx 2
  • s 2==--m 2
  • NN

Here we use the formula provided by the spreadsheet 30. Significance ofs

  • The standard deviationsenables us to determine with a great deal of accuracy where the values of frequency distribution lie with respect to the meanm
  • Chebyshev's Theorem states that forany distribution
      • 75% of all data will be in the betweenm-2s andm+2s
      • 89% of all data will be in the betweenm-3s andm+3s
  • For a smooth, symmetrical distribution
      • 68% of all data will be in the betweenm-s andm+s
      • 95% of all data will be in the betweenm-2s andm+2s
      • 99% of all data will be in the between m-3s andm+3s

31. Variance & Standard Deviation forGroupedData

  • Variances 2 is given by
  • Sf i ( x im) 2Sf ix i 2
  • s 2==-- m 2
  • NN
  • Where
    • s 2 is the variance andsis the standard deviation of population
    • X i is the midpoint of the ith class
    • f iis the frequency in the ith class
    • mis the population mean
    • N is the number of observations
    • Sis symbol of summation

32. Variance & Standard Deviation forGroupedData

  • Sf i ( x im) 2Sf ix i 2
  • s 2==--m 2
  • NN

33. Variance : Population and Sample

  • S ( x m ) 2Sx 2
  • s 2==--m 2
  • NN
      • S ( x x ) 2Sx 2nx 2
  • s 2==--
  • n-1 n-1n-1

Suspiciously similar but not quite ... why is this ? Population Statistics Sample Statistics 34. Population and Sample

  • Populationrefers to the totality of all data that is possible
    • Impossible to get this
    • So we will never be able to calculate
    • Either meanm
    • Or variances 2
  • Samplerefers to the data from the population that we can collect
    • Using this data we can calculate
    • Sample mean : x
    • Sample variance : s 2
  • Objective of the statistician is to ESTIMATE
    • the population statisticm,s
    • from the sample statistic x,s

35. Population & Sample Formulae are different in a Spreadsheet sample population 36. Last two statistics

  • Standard Score
  • A measure of how far an individual piece of data is from the mean
  • (X m)
  • =
  • s
  • Question
    • What would be the mean and standard deviation of the standard score ?
  • Coefficient of Variation of a population
  • s
  • =( 100 )
  • m
  • A relative measure that gives us a feel of how dispersed the data is when compared to the mean