statistics for water science module 17.1: descriptive statistics

29
Statistics for Water Science Module 17.1: Descriptive Statistics

Upload: gillian-parks

Post on 28-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Statistics for Water Science Module 17.1: Descriptive Statistics

Statistics for Water Science

Module 17.1: Descriptive Statistics

Page 2: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s2

Module 17: Statistics

Statistics A branch of mathematics dealing with the

collection, analysis, interpretation and presentation of masses of numerical dataDescriptive Statistics (Lecture 17.1)

Basic description of a variableExploratory Data Analysis (Lecture 17.2)

Techniques for understanding dataHypothesis Testing (Lecture 17.3)

Asks the question – is X different from Y?

Page 3: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s3

Simple graphical representations of data

Descriptive statistics

Describe basic characteristics of a population of numbers Central Tendency or

“Middleness”Means, medians and others

Variance or “spread” of dataStandard Deviation

The range of dataMin, Max and Percentiles

Page 4: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s4

Adapted from Ratti and Garton (1994)

Precision, accuracy and bias

Precision: Tendency to have

values closely clustered around the mean

Accuracy: Tendency of an

estimator to predict the value it was intended to estimate

Bias: A systematic error in

prediction

Page 5: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s5

Unbiased Biased

Not

Pre

cise

Pre

cise

The yellow curlingrocks representmeans from repeated samples

Green dots are the mean value

Spread is analogous to the standard error Accurate Not Accurate

Page 6: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s6

Finding the middle:The arithmetic mean

Between 1998 and 2002, the Ice Lake RUSS unit collected 2120 temperatures readings at depths of 1-4 m

What is the average June temperature?Surface Temperature

050

100150200

250300350400

Temperature

# o

f O

bse

rva

tio

ns

Surface Temperature

Page 7: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s7

Not too hard - Add’em up, divide by n

Surface Temperature

050

100150200

250300350400

Temperature

# of

Obs

erva

tions

Surface Temperature

Finding the middle:The arithmetic mean

39179.3 2120

= 18.48 C

Sum of temperatures = 39179.3

Page 8: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s8

Expressing variability: Standard deviation (SD)

Note that there is ‘scatter’ around the mean The Standard Deviation quantifies how wide or

narrow this scatter is: For this data set, the SD is 2.34 C Mean and SD are often combined:

18.48 +/- 2.34

Page 9: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s9

Let’s consider a second data set, shown in blue. This is the mean seasonal temperature in the lower reaches of the lake (8-13 m)

n = 3097

Comparing data sets

Page 10: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s10

Comparing data sets

Two things to note: It’s a lot colder at the bottom of the lake! The temperatures are much less variable – why?

Page 11: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s11

Mean SD

Surface 18.48 2.34

Bottom 5.96 0.85

Means and standard deviations for epilimnetic and hypolimnetic temperatures

Page 12: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s12

Standard deviation: Fun facts

The SD is always in the same units as the mean Roughly 68% of the values are included in +/- 1

SD of the mean, 95% within +/- 2 SD If the SD is larger than the mean (e.g. 20 +/- 24),

your data is pretty flaky Definition of flaky – the data are so widely

scattered that the mean is, well, meaningless. In this case, use some other measure of middleness, such as the geometric mean or median

Page 13: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s13

Using geometric means: Fecal coliform example

What about data that are not well behaved? Fecal coliform counts are often used by

management agencies as an indicator of water quality

For non-contact water recreation (boating and fishing), Colorado Public Health state that fecal coliform count shall not exceed 2000 fecal coliforms per 100 mL (based on geometric mean of representative samples)

Page 14: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s14

Fecal coliform counts can range over several orders of magnitude.

For such data, the geometric mean is a more appropriate indicator of central tendency.

SampleF. coli.counts

1 160

2 700

3 60

7 12000

ArithmeticMean

3230

12000

Boulder Creek Longitudinal Fecal Coliform Profiles for July, 2000

The problem

Page 15: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s15

Multiply ’em together, take the nth root To be honest, this is a pain without a good

calculator, but there’s a shortcut…

Geometric mean = 160 * 700 * 60 * 120004

The geometric mean

Page 16: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s16

Take the logarithm of each data point (easy)

Sample F. coli. counts Log(10)

1 160 2.20

2 700 2.85

3 60 1.78

7 12000 3.51

The geometric mean: The easy way

Page 17: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s17

The geometric mean

Take the logarithm of each data point (easy) Average the log values (easier)

Sample F. coli. counts Log

1 160 2.20

2 700 2.85

3 60 1.78

7 12000 3.51

Average 2.88

Page 18: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s18

The geometric mean

Take the logarithm of each data point (easy) Average the log values (easier) Calculate the antilog (sounds hard, is easy)

SampleF. coli.counts

Log

1 160 2.20

2 700 2.85

3 60 1.78

7 12000 3.51

Average 2.88

Antilog= 10^2.88= 764.1

The geometric mean is 764.1 cells/ 100 ml

Lower than the state regulatory standard of 2000 cells/ 100 ml

Page 19: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s19

Fun facts about geometric means

The geometric mean is always less then the arithmetic mean.

The ‘shortcut’ calculation works with either natural logs or base 10 logs.

The geometric mean tends to dampen the effect of very low or very high values, and is useful when values range from 10-10,000 over a given period.

Excel has a GEOMEAN function. Life is good. Use of the geometric mean is a standard for most

wastewater discharge and beach monitoring programs: Beach standards are typically 200 counts/100 ml.

Page 20: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s20

Ice Lake Mean SD Min Max Median

Surface 19.59 2.28 12.1 27.1 18.2

Bottom 5.96 0.85 4.3 9.0 5.9

Descriptive statistics: Min, Max, and Median

Page 21: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s21

When to use medians: Stream turbidity levels

Background:

• Turbidity in streams makes the water appear cloudy (muddy), mostly from suspended sediments. It’s bad for fish, their eggs and their food (bugs) – particularly cold water species such as brook trout.

• Minnesota Water Pollution Rules set a Chronic Standard of 10 NTU - the highest level to which these organisms can be exposed indefinitely without causing chronic toxicity (see Notes for reference website).

• Tischer Creek is a trout stream in Duluth, MN with a nearly continuous turbidity record in summer/fall 2002. Let’s look at a 30 day period in midsummer and decide what the level of exposure was for the fish.

Page 22: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s22

Medians: the middlemost value

Prevents being mislead by a few very small or very large values

Consider salaries within a hypothetical company Which is the more

appropriate measure of a typical salary?

Mean $104,000 Median $24,000

CEO $350,000

Middle manager

88,000

Worker 1 24,000

Worker 2 22,000

Worker 3 18,000

Mean $104,400

Median $24,000

Page 23: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s23

Medians: a real world exampleTischer Creek: July 13 - Aug 12, 2002

Mean 13.1Standard Error 0.93Median 1.0Mode 0.0Standard Deviation 48.0Sample Variance 2301.1Kurtosis 153.9Skewness 9.6Range 1017.2Minimum 0Maximum 1017.2Sum 35061.7Count 2679Confidence Level(95.0%) 1.82

~ 30 days straddling the late July storm

13 Jul 02- 12 Aug 02 Tischer Turbidity Tischer Creek 13 Jul - 12 Aug 2002 30d spanning late July Storm

0

100

200

300

400

11-Jul 21-Jul 31-Jul 10-Aug

Date 2002

Tu

rbid

ity

(N

TU

s)

Summary Mean+/- s.d. Median Range 30 d: 7/13 - 8/12/02 13.1+ 0.9 1 0.0 - 1017

Page 24: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s24

Frequency Distribution: Jul 13- Aug 12 Tischer Creek – Summer 2002

0

500

1000

1500

2000

2500

0 60 120

180

239

299

359

419

479

539

598

658

718

778

838

898

957

Mor

e

Turbidity (NTUs)

Fre

quen

cy

Note that these data are highly skewed, with >80% of the values in the 20-40 NTU range

There is one value of 1017 NTU, no valid reason to delete it.

Page 25: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s25

Tischer Creek –Summer 2002 Storm Period

Stream Data Visualization

Page 26: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s26

Another plot of Tischer from midsummer 2002

Page 27: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s27

Means vs Medians: Which represent the data better?

The mean of 13 NTU for the 30 day period suggests that the chronic toxicity standard was violated

The standard deviation of the mean was high (48 NTUs) relative to the mean and so the coefficient of variation was a whopping 369%: CV = (48/13)*100

Although the range was high, from 0 to 1017 NTU, “most of the time” the stream ran clear with values <<10 . The mode (most common value) was in fact = 0

The median value was 1.0 NTU and perhaps best characterizes the state of turbidity in the stream and the level of exposure of the fish (the 50th percentile).

Determining chronic exposure values for “flashy” data is not trivial

Page 28: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s28

Mean @average()

Median @median()

Standard Deviation @stdev()

Minimum @min()

Maximum @max()

Geometric mean @geomean()

Excel functions for descriptive statistics: Format - @statistic(datarange)

Page 29: Statistics for Water Science Module 17.1: Descriptive Statistics

Developed by: Host Updated 2/2004: U5-m17-s29

Upcoming: How can we tell if two populations of numbers are different?