descriptive statistics (cont.) - variability

24
1 STAT03 - Descriptive statistics (cont.) - variability Descriptive statistics (cont.) - variability Lecturer: Smilen Dimitrov Applied statistics for testing and evaluation – MED4

Upload: urbana

Post on 12-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Applied statistics for testing and evaluation – MED4. Descriptive statistics (cont.) - variability. Lecturer: Smilen Dimitrov. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Descriptive  statistics (cont.) - variability

1

STAT03 - Descriptive statistics (cont.) - variability

Descriptive statistics (cont.) - variability

Lecturer:Smilen Dimitrov

Applied statistics for testing and evaluation – MED4

Page 2: Descriptive  statistics (cont.) - variability

2

STAT03 - Descriptive statistics (cont.) - variability

Introduction

• We previously discussed measures of central tendency (location) of a data sample (collection) in descriptive statistics – arithmetic mean, median and mode; and also the range as a measure of statistical dispersion (variability)

• Here we continue with other important measures of variability – namely variance and standard deviations

• We will also get acquainted with some parameters leading to their definitions

• We will look at how we perform these operations in R, and a bit more about plotting as well

Page 3: Descriptive  statistics (cont.) - variability

3

STAT03 - Descriptive statistics (cont.) - variability

Variability and deviations

• A measure of variability is perhaps the most important quantity in statistical analysis. – The greater the variability in the data, the greater will be

our uncertainty in the values of the parameters estimated from the data, and

– the lower will be our ability to distinguish between competing hypotheses about the data.

• Measures of variability – a single number describing the variability of data – eventually we look for variance and standard deviation

Page 4: Descriptive  statistics (cont.) - variability

4

STAT03 - Descriptive statistics (cont.) - variability

Variability and deviations

• Deviations – distances of the individual values in the data sample, from the mean value

• Plotting – using lines in a for loop

Page 5: Descriptive  statistics (cont.) - variability

5

STAT03 - Descriptive statistics (cont.) - variability

Variability and deviations

• The longer the lines – the more variable the data• Could we use the sum of the deviations as a measure of

variability?• No – because of the

definition of arithmetic mean, it is the line positioned such that the sum of the deviations cancels out.

01

1111

N

xNxxNxxxd

N

iiN

ii

N

ii

N

ii

N

ii

• Quick proof

Page 6: Descriptive  statistics (cont.) - variability

6

STAT03 - Descriptive statistics (cont.) - variability

Absolute deviations

• The minus signs of the deviations could be seen as the reason for cancellation of the sum

• We could try using the absolute deviations xxdD iii

• Their sum will be obviously different from 0.

• However, hard to compute – need an easier way

Page 7: Descriptive  statistics (cont.) - variability

7

STAT03 - Descriptive statistics (cont.) - variability

Squared deviations and sum of squares

• Squaring the deviations is computationally less intensive

22 xxd ii

N

ii xxSS

1

2

• Their sum will, again, be obviously different from 0.

• It is the well known sum of squares:

• More properly – it is the sum of squared deviations• An unscaled, or unadjusted measure of dispersion

Page 8: Descriptive  statistics (cont.) - variability

8

STAT03 - Descriptive statistics (cont.) - variability

Scaling the sum of squares – Mean Squared Deviation

• Now, what would happen to the sum of squares if we added an [additional] data point? – It would get bigger, of course.

• So usually, the sum of squares will grow with the size of the data collection. – That is a manifestation of the fact that it is unscaled.– Scaling (also known as normalizing) means adjusting the sum of

squares so that it does not grow as the size of the data collection grows.

• We don't want our measure of variability to depend on sample size in this way, so the obvious solution is to divide by the number of samples, to get the mean squared deviation

• The MSD can be taken to be the wanted variance parameter, but…

N

iiN xx

NN

SSsMSD

1

22 1

Page 9: Descriptive  statistics (cont.) - variability

9

STAT03 - Descriptive statistics (cont.) - variability

Degrees of freedom

• Suppose we had a sample of five numbers and their average was 4, What was the sum of the five numbers? It must have been 20, otherwise the mean would not have been 4. So now let us think about each of the five numbers in turn:

• We are going to put a number in each of the five boxes. • If we allow that the numbers could be positive or negative real

numbers, we ask how many values could the first number take.

         

Page 10: Descriptive  statistics (cont.) - variability

10

STAT03 - Descriptive statistics (cont.) - variability

Degrees of freedom

• If we allow that the numbers could be positive or negative real numbers, we ask how many values could the first number take.

• You will realize it could take any value. Suppose it was a 2.

         

2        

Page 11: Descriptive  statistics (cont.) - variability

11

STAT03 - Descriptive statistics (cont.) - variability

Degrees of freedom

• How many values could the next number take? It could be anything.

• Say it was a 7.

2   7      

2        

Page 12: Descriptive  statistics (cont.) - variability

12

STAT03 - Descriptive statistics (cont.) - variability

Degrees of freedom

• And the third number could be anything.

• Suppose it was a 4.

2   7   4    

2   7      

Page 13: Descriptive  statistics (cont.) - variability

13

STAT03 - Descriptive statistics (cont.) - variability

Degrees of freedom

• The fourth number could be anything at all.

• Say it was 0.

2   7   4    0  

2   7   4     

Page 14: Descriptive  statistics (cont.) - variability

14

STAT03 - Descriptive statistics (cont.) - variability

Degrees of freedom

• Now, how many values could the last number take?

• Just one - it has to be another 7 because the numbers have to add up to 20 because the mean of the five numbers is 4.

2   7   4   0   7

2   7   4  0  

Page 15: Descriptive  statistics (cont.) - variability

15

STAT03 - Descriptive statistics (cont.) - variability

Degrees of freedom

• We have total freedom in selecting the first number - and the second, third and fourth numbers.

• But we have no choice at all in selecting the fifth number. • We have four degrees of freedom when we have five numbers

(and their mean).

• In general we have (n-1) degrees if freedom if we estimated the mean from a sample of size n.

• More generally still, we can propose a formal definition of degrees of freedom: degrees of freedom is the sample size, N, minus the number of parameters, p, estimated from the data.

2   7   4   0   7

Page 16: Descriptive  statistics (cont.) - variability

16

STAT03 - Descriptive statistics (cont.) - variability

Scaling the sum of squares – variance

• The mean is a parameter estimated from the data itself – hence we lose one degree of freedom

• Thus we finally arrive at a definition for variance – sum of squares divided by the degrees of freedom

• Only difference between MSD and variance – division with N or N-1, respectively

p)-(N freedom of degrees

(SS) squares of sumvariance

11

1

1

221

N

SSxx

Ns

N

iiN

Page 17: Descriptive  statistics (cont.) - variability

17

STAT03 - Descriptive statistics (cont.) - variability

Standard deviation

• Variance has a unit of measure which is squared (cm2 ) in relation to the original units (cm)

• Therefore, another measure is used – standard deviation – measured in same units as the data

N

iiN xx

Nss

1

221 1

1

Page 18: Descriptive  statistics (cont.) - variability

18

STAT03 - Descriptive statistics (cont.) - variability

Sample and population parameters

• Usually you are interested in drawing conclusions about the population from which your (random) sample of data is drawn.

• It is very important to keep in mind the difference between the descriptive statistics that characterise your sample, and the corresponding parameters that characterise the population from which your sample is drawn.

Population (finite, infinite)

“true” parameters

Sample (finite)Estimates of population

parameters

x

xs2xs

2

mean

variance

standard deviation

Ex. All raisin boxes ever produced by the company/factory

Ex. The particular data collection for only 17 particular raisin boxes

Needs (probability) distributions

Page 19: Descriptive  statistics (cont.) - variability

19

STAT03 - Descriptive statistics (cont.) - variability

Geometric interpretations - quantity graph

• Standard deviation – same units as the quantity

Page 20: Descriptive  statistics (cont.) - variability

20

STAT03 - Descriptive statistics (cont.) - variability

Geometric interpretations - quantity graph

• Variance - area

Page 21: Descriptive  statistics (cont.) - variability

21

STAT03 - Descriptive statistics (cont.) - variability

Geometric interpretations - quantity graph

• Variance - area

Page 22: Descriptive  statistics (cont.) - variability

22

STAT03 - Descriptive statistics (cont.) - variability

Geometric interpretation - histogram (frequency count)

• More commonly – geometric interpretation on a histogram. • Makes it easier to see

the spread

• If no deviations – standard deviation is 0 – the whole histogram collapses to a single peak

Page 23: Descriptive  statistics (cont.) - variability

23

STAT03 - Descriptive statistics (cont.) - variability

Review

• Arithmetic mean• Median• Mode

• Range• Variance• Standard deviation

Measures of

Central tendency (location)

Measure of

Statistical variability (dispersion - spread)

Descriptive statistics

Page 24: Descriptive  statistics (cont.) - variability

24

STAT03 - Descriptive statistics (cont.) - variability

Exercise for mini-module 3 – STAT03

Exercise

Use the following data: The data in the following table come from three garden markets. The data show the ozone concentrations in parts per hundre million (pphm) on ten consecutive summer days

• 1. Import the data into R, and for each garden, find the the central tendency parameters of the ozone concentrations.

• 2. Using R, for each garden, find dispersion parameters - the sample variance and sample standard deviation.

• 3. Using R, plot the relative frequency histogram for each of the gardens. Mark graphically the arithmetic mean on each graph and the one standard deviation range.

Delivery:Deliver the collected data (in tabular format), the found statistics and the requested

graphs for the assigned years in an electronic document. You are welcome to include R code as well.