statistics: data presentation & analysis fr clinic i

Post on 21-Dec-2015

222 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Statistics: Data Presentation & Analysis

Fr Clinic I

Overview

• Tables & Graphs• Populations & Samples• Mean, Median, & Variance• Error Bars

– Standard Deviation, Standard Error & 95% Confidence Interval (CI)

• Comparing Means of Two Populations• Linear Regression (LR)

Warning• Statistics is a huge field, I’ve simplified considerably

here. For example:– Mean, Median, and Standard Deviation

• There are alternative formulas

– 95% Confidence Interval• There are other ways to calculate CIs (e.g., z statistic instead of t;

difference between two means, rather than single mean…)

– Error Bars• Don’t go beyond the interpretations I give here!

– Comparing Means of Two Data Sets• We just cover the t test for two means when the variances are

unknown but equal, there are other tests

– Linear Regression• We only look at simple LR and only calculate the intercept, slope and

R2. There is much more to LR!

Tables

Water

(1)

Turbidity (NTU)

(2)

True Color (Pt-Co)

(3)

Apparent Color

(Pt-Co) (4)

Pond Water 10 13 30

Sweetwater 4 5 12

Hiker 3 8 11

Table 1: Average Turbidity and Color of Water Treated by Portable Water Filters

Consistent Format, Title, Units, Big FontsDifferentiate Headings, Number Columns

Figures

11

Figure 1: Turbidity of Pond Water, Treated and Untreated

0

5

10

15

20

25

Pond Water Sweetwater Miniworks Hiker Pioneer Voyager

Turb

idit

y (N

TU

)

Filter

20

10

75

1

11

Consistent Format, Title, UnitsGood Axis Titles, Big Fonts

Populations and Samples• Population

– All possible outcomes of experiment or observation • US population• Particular type of steel beam

• Sample– Finite number of outcomes measured or observations made

• 1000 US citizens• 5 beams

• Use samples to estimate population properties– Mean, Variance

• E.g., Height of 1000 US citizens used to estimate mean of US population

Central Tendency

• Mean and MedianMean = xbar = Sum of values divided by sample size

= (1+3+3+6+8+10)/6 = 5.2 NTU

Median = m = Middle number Rank - 1 2 3 4 5 6Number - 1 3 3 6 8 10

For even number of sample points, average middle two

= (3+6)/2 = 4.5

13368

10

Excel: Mean – AVERAGE; Median - MEDIAN

Variability

• Variance, s2

– sum of the square of the deviation about the mean divided by degrees of freedom

– s2 = n(xi – xbar)2/(n-1)

– Where xi = a data point and n = number of data points

• Example (cont.)– s2 = [(1-5.2)2 + (3-5.2)2 + (3-5.2)2 + 6-5.2)2 + (8-5.2)2

+ (10-5.2)2] /(6-1) = 11.8 NTU2

Excel: Variance – VAR

Error Bars

• Show data variability on plot of mean values • Types of error bars include:

• Max/min, ± Standard Deviation, ± Standard Error, ± 95% CI

0

2

4

6

8

10

Filter 1 Filger 2 Filter 3

Filter Type

Tu

rbid

ity

(NT

U)

Standard Deviation, s

• Square-root of variance• If phenomena follows Normal Distribution

(bell curve), 95% of population lies within 1.96 standard deviations of the mean

• Error bar is s above & below mean

Normal Distribution

-4 -2 0 2 4

Standard Deviation

-1.96 1.96

95%

Standard Deviations from Mean

2ss

Excel: standard deviation – STDEV

Standard Error of Mean• Also called St-Err or sxbar

• For sample of size n taken from population with standard deviation estimated as s

• As n ↑, sxbar estimate↓, i.e., estimate of population mean improves

• Error bar is St-Err above & below mean

n

ssX

Xs

95% Confidence Interval (CI) for Mean

• A 95% Confidence Interval is expected to contain the population mean 95 % of the time (i.e., of 95%-CIs from 100 samples, 95 will contain pop mean)

• t95%,n-1 is a statistic for 95% CI from sample of size n– t95%,n-1 = TINV(0.05,n-1)– If n 30, t95%,n-1 ≈ 1.96 (Normal Distribution)

• Error bar is above & below mean

X1n%,95 stX

Xn st 1%,95

Using Error Bars to compare data• Standard Deviation

– Demonstrates data variability, but no comparison possible

• Standard Error– If bars overlap, any difference in means is not statistically significant– If bars do not overlap, indicates nothing!

• 95% Confidence Interval– If bars overlap, indicates nothing!– If bars do not overlap, difference is statistically significant

• We’ll use 95 % CI in this class– Any time you have 3 or more data points, determine mean,

standard deviation, standard error, and t95%,n-1, then plot mean with error bars showing the 95% confidence interval

Adding Error Bars to an Excel Graph• Create Graph

– Column, scatter,…

• Select Data Series• In Layout Tab-Analysis Group, select Error Bars • Select More Error Bar Options• Select Custom and Specify Values and select

cells containing the valuesXn st 1%,95

Example 1: 95% CITurbidity Data +/- 95% CI

1 2 3 mean St Dev n St-Err t95%,2 t95%,2St-Err

NTU NTU NTU NTU NTU NTUFilter 1 2.1 2.1 2.2 2.1 0.06 3 0.03 4.30 0.14Filter 2 3.2 4.4 5 4.2 0.92 3 0.53 4.30 2.28Filter 3 4.3 4.2 4.5 4.3 0.15 3 0.09 4.30 0.38

2.1

4.2 4.3

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Filter 1 Filter 2 Filter 3

Portable Water Filter

Tu

rbid

ity

(NT

U)

What can we do?

• Lift weight multiple times using different solar panel combinations (or hyrdoturbines, or gear boxes) and plot mean and 95 % Confidence interval error bars.– If error bars overlap between to different test conditions,

indicates nothing!– If error bars do not overlap, difference is statistically

significant

T Test

• A more sophisticated way to compare means• Use t test to determine if means of two

populations are different• E.g., lift times with different solar panel combinations

or turbines or…

Comparing Two Data Sets using the t test

• Example - You lift weight with two panels in series and two in parallel.– Series: Mean = 2 min, s = 0.5 min, n = 20– Parallel: Mean = 3 min, s = 0.6 min, n = 20

• You ask the question - Do the different panel combinations result in different lift times?– Different in a statistically significant way

Are the Lift Times Different?• Use TTEST (Excel)

• Fractional probability of being wrong if you claim the two populations are different– We’ll say they are significantly different if

probability is ≤ 0.05

Series Parallel1.5 3

2 2.42.2 2.21.8 2.6

3 3.41.6 3.61.2 3.82.1 3.51.9 2.72.2 2.42.6 3.51.7 3.81.8 2.11.5 2.52.4 3.42.5 3.32.7 2.41.4 3.61.5 2.32.6 3.7

Marbles

Linear Regression

• Fit the best straight line to a data set

y = 1.897x + 0.8667R2 = 0.9762

0

5

10

15

20

25

0 2 4 6 8 10 12

Gra

de

Po

int

Ave

rag

e

Height (m)

Right-click on data point and select “trendline”. Select options to show equation and R2.

R2 - Coefficient of multiple Determination

• R2 = n(ŷi - ybar)2 / n(yi - ybar)2

– ŷi = Predicted y values, from regression equation

– yi = Observed y values

– Ybar = mean of y

• R2 = fraction of variance explained by regression– R2 = 1 if data lies along a straight line

top related