describing data descriptive statistics: central tendency and variation

22
Describing Data Descriptive Statistics: Central Tendency and Variation

Upload: lester-hampton

Post on 18-Jan-2016

234 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Describing Data Descriptive Statistics: Central Tendency and Variation

Describing Data

Descriptive Statistics:

Central Tendency and Variation

Page 2: Describing Data Descriptive Statistics: Central Tendency and Variation

Lecture Objectives

You should be able to:

1. Compute and interpret appropriate measures of centrality and variation.

2. Recognize distributions of data.

3. Apply properties of normally distributed data based on the mean and variance.

4. Compute and interpret covariance and correlation.

Page 3: Describing Data Descriptive Statistics: Central Tendency and Variation

Summary Measures

1. Measures of Central Location Mean, Median, Mode2. Measures of Variation Range, Percentile, Variance, Standard

Deviation3. Measures of Association Covariance, Correlation

Page 4: Describing Data Descriptive Statistics: Central Tendency and Variation

It is the Arithmetic Average of data values:

The Most Common Measure of Central Tendency

Affected by Extreme Values (Outliers)

n

xn

ii

1 n

xxx ni 2

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Mean = 5 Mean = 6

xSample Mean

Measures of Central Location:The Arithmetic Mean

Page 5: Describing Data Descriptive Statistics: Central Tendency and Variation

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Median = 5 Median = 5

Important Measure of Central Tendency

In an ordered array, the median is the “middle” number.If n is odd, the median is the middle number.If n is even, the median is the average of the 2 middle numbers.

Not Affected by Extreme Values

Median

Page 6: Describing Data Descriptive Statistics: Central Tendency and Variation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9

A Measure of Central TendencyValue that Occurs Most Often

Not Affected by Extreme ValuesThere May Not be a ModeThere May be Several ModesUsed for Either Numerical or Categorical Data

0 1 2 3 4 5 6

No Mode

Mode

Page 7: Describing Data Descriptive Statistics: Central Tendency and Variation

Measures of Variability

Range The simplest measure

PercentileUsed with Median

Variance/Standard DeviationUsed with the Mean

Page 8: Describing Data Descriptive Statistics: Central Tendency and Variation

Range

Range = 12 - 7 = 5

7 8 9 10 11 12

7 8 9 10 11 12

Range = 12 - 7 = 5

Difference Between Largest & Smallest Observations: Range =

Ignores How Data Are Distributed:

SmallestLa xx rgest

Page 9: Describing Data Descriptive Statistics: Central Tendency and Variation

Percentile

ObsMedals

ObsMedals

ObsMedals

ObsMedals

ObsMedals

1 110 12 24 23 10 34 6 45 3

2 100 13 19 24 9 35 6 46 3

3 72 14 18 25 8 36 6 47 2

4 47 15 18 26 8 37 5 48 2

5 46 16 16 27 7 38 5 49 2

6 41 17 15 28 7 39 5 50 2

7 40 18 14 29 7 40 4 51 2

8 31 19 13 30 6 41 4 52 1

9 28 20 11 31 6 42 4 53 1

10 27 21 10 32 6 43 4 54 1

11 25 22 10 33 6 44 3 55 1

2008 Olympic Medal Tally for top 55 nations. What is the percentile score for a country with 9 medals? What is the 50th percentile?

Page 10: Describing Data Descriptive Statistics: Central Tendency and Variation

Percentile - solutions

Order all data (ascending or descending).

1. Country with 9 medals ranks 24th out of 55. There are 31 nations (56.36%) below it and 23 nations (41.82%) above it. Hence it can be considered a 57th or 58th percentile score.

2. The medal tally that corresponds to a 50th percentile is the one in the middle of the group, or the 28th country, with 7 medals. Hence the 50th percentile (Median) is 7.

Now compute the first and third quartile values.

Page 11: Describing Data Descriptive Statistics: Central Tendency and Variation

Box Plot

The box plot shows 5 points, as follows:

Median

Q1 Q3LargestSmallest

Page 12: Describing Data Descriptive Statistics: Central Tendency and Variation

Outliers

Interquartile Range (IQR) = [Q3 – Q1] = 60-40 = 201 Step = [1.5 * IQR] = 1.5*20 = 30

Q1 – 30 = 40 - 30 = 10Q3 + 30 = 60 + 30 = 90

Any point outside the limits (10, 90) is considered an outlier.

20 40 60 8050

105Outlier

Page 13: Describing Data Descriptive Statistics: Central Tendency and Variation

Variance

N

X i

2

2

1

2

2

n

XXs i

For the Population:

For the Sample:

Variance is in squared units, and can be difficult to interpret. For instance, if data are in dollars, variance is in “squared dollars”.

Page 14: Describing Data Descriptive Statistics: Central Tendency and Variation

Standard Deviation

N

X i

2

1

2

n

XXs i

For the Population:

For the Sample:

Standard deviation is the square root of the variance.

Page 15: Describing Data Descriptive Statistics: Central Tendency and Variation

Computing Standard Deviation

Computing Sample Variance and Standard Deviation 

Mean of X =   6    

   

    Deviation    

X From Mean Squared  

3 -3 9  

4 -2 4  

6 0 0  

8 2 4  

9 3 9  

    26 Sum of Squares

    6.50 Variance = SS/n-1

    2.55 Stdev = Sqrt(Variance)

Page 16: Describing Data Descriptive Statistics: Central Tendency and Variation

The Normal Distribution

A property of normally distributed data is as follows:

Distance from Mean

Percent of observations included in that range

± 1 standard deviation

Approximately 68%

± 2 standard deviations

Approximately 95%

± 3 standard deviations

Approximately 99.74%

Page 17: Describing Data Descriptive Statistics: Central Tendency and Variation

Comparing Standard Deviations

11 12 13 14 15 16 17 18 19 20 21

Data A

11 12 13 14 15 16 17 18 19 20 21

Data B

Data C

11 12 13 14 15 16 17 18 19 20 21

Mean = 15.5 s = 3.338

Mean = 15.5 s = .9258

Mean = 15.5 s = 4.57

Page 18: Describing Data Descriptive Statistics: Central Tendency and Variation

Outliers

Typically, a number beyond a certain number of standard deviations is considered an outlier.

In many cases, a number beyond 3 standard deviations (about 0.25% chance of occurring) is considered an outlier.

If identifying an outlier is more critical, one can make the rule more stringent, and consider 2 standard deviations as the limit.

Page 19: Describing Data Descriptive Statistics: Central Tendency and Variation

Coefficient of Variation

100%

X

SCV

Standard deviation relative to the mean.

Helps compare deviations for samples with different means

Page 20: Describing Data Descriptive Statistics: Central Tendency and Variation

Computing CV

Stock A: Average Price last year = $50

Standard Deviation = $5

Stock B: Average Price last year = $100

Standard Deviation = $5

Coefficient of Variation:

Stock A: CV = 10%

Stock B: CV = 5%

Page 21: Describing Data Descriptive Statistics: Central Tendency and Variation

Standardizing Data

Obs Age Income Z-Age Z-Income

1 25 25000 -1.05 -1.13

2 28 52000 -0.86 -0.63

3 35 63000 -0.41 -0.43

4 36 74000 -0.34 -0.22

5 39 69000 -0.15 -0.31

6 45 80000 0.23 -0.11

7 48 125000 0.42 0.72

8 75 200000 2.15 2.11

         

Mean 41.3886000.0

0    

Std Dev 15.6353973.5

4    

Which of the two numbers for person 8 is farther from the mean? The age of 75 or the income of 200,000?

Z scores tell us the distance from the mean, measured in standard deviations

Page 22: Describing Data Descriptive Statistics: Central Tendency and Variation

Measures of Association

Covariance and CorrelationMean       Mean

2       9

Stdev 1   3.6  

X Dev Product Dev Y

1 -1 3 -3 6

2 0 0 -1 8

3 1 4 4 13

    7    

Covariance 3.5    

Correlation 0.97    

Covariance measures the average product of the deviations of two variables from their means.

Correlation is the standardized form of covariance (divided by the product of their standard deviations).

Correlation is always between -1 and +1.