statistics introduction 1.)all measurements contain random error results always have some...

25
Statistics Introduction 1.) All measurements contain random error results always have some uncertainty 2.) Uncertainty are used to determine if two or more experimental results are equivalent or different Statistics is used to accomplish this task Masuzaki, H., et. al Science (2001), 294(5549), 2166 Is the mutant (transgenic) mouse significantly fatter than the normal (wild-type) mouse? Statistical Methods Provide Unbiased Means to Answer Such Questions.

Upload: claud-wood

Post on 24-Dec-2015

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Introduction

1.) All measurements contain random error results always have some uncertainty

2.) Uncertainty are used to determine if two or more experimental results are equivalent or different

Statistics is used to accomplish this task

Masuzaki, H., et. al Science (2001), 294(5549), 2166

Is the mutant (transgenic) mouse significantly fatter than the normal (wild-type) mouse?

Statistical Methods Provide Unbiased Means to Answer Such Questions.

Page 2: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Gaussian Curve

1.) For a series of experimental results with only random error:(i) A large number of experiments done under identical conditions will yield a

distribution of results.

(ii) Distribution of results is described by a Gaussian or Normal Error Curve

Nu

mb

er o

f O

ccu

rren

ces

Value

High population about correct value

low population far from correct value

Page 3: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Gaussian Curve

2.) Any set of data (and corresponding Gaussian curve) can be characterized by two parameters:

(i) Mean or Average Value ( )

where: n = number of data pointsxi = value of data point number i = value1 + value2 + value3 …

valuen

(ii) Standard Deviation (s)

x

n

xx

n

1ii

n

1iix

1n

xxs

n

1i

2i

Smaller the standard deviation is, more precise the measurement is.

Page 4: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Gaussian Curve

3.) Other Terms Used to Describe a Data Set(i) Variance: Related to the standard deviation

Used to describe how “wide” or precise a distribution of results is

variance = (s)2

where: s = standard deviation

(ii) Range: difference in the highest and lowest values in a set of data Example: measurments of 4 light bulb lifetimes

821, 783, 834, 855

High Value = 855 hoursLow Value = 783 hours

Range = High Value – Low Value = 855 – 783 = 72 hours

Page 5: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Gaussian Curve

3.) Other Terms Used to Describe a Data Set(iii) Median: The value in a set of data which has an equal number of data

values above it and below it For odd number of data points, the median is actually the middle

value For even number of data points, the median is the value halfway

between the two middle values Example:

Data Set: 1.19, 1.23, 1.25, 1.45 ,1.51 mean( ) = 1.33

Data Set: 1.19, 1.23, 1.25, 1.45 mean( ) = 1.28

median = 1.24

Statistics

Median value

x

x

Median value

Page 6: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Gaussian Curve

(iii) Example:

For the following bowling scores 116.0, 97.9, 114.2, 106.8 and 108.3, find the mean, median, range and standard deviation.

Page 7: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Gaussian Curve

4.) Relating Terms Back to the Gaussian Curve(i) Formula for a Gaussian curve

where e = base of natural logarithm (2.71828…) ≈ (mean) ≈ s (standard deviation)

2

2

2)x(

e2

1y

x

mean

± standard deviation

Entire area under curve is normalized to one

Page 8: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

1.) By knowing the standard deviation (s) and the mean ( ) of a set of result (and the corresponding Gaussian curve)

(i) The probability of the next result falling in any given range can be calculated by:

(ii) The probability of a result falling in that portion of the Gaussian curve is equal to the normalized area of the curve in that portion.

(iii) Example:

68.3% of the area of a Gaussian curve occurs between the values -1s and +1s ( ± 1s)

Thus, any new result has a 68.3% chance of falling within this range.

x

s

xxz

x x xStandard Deviation (s) Probability

±1s 68.3%

±2s 95.5%

±3s 99.7%

±4s 99.9%

Probability of Measuring a value in a certain range is equal to the area of that range

Page 9: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

s

xxz

- Area under curve from mean value and result.- Total ½ area is 0.5.- Remaining area is 0.5 – Area.- Example:

z = 1.3area from mean to 1.3 is 0.403 area from infinity to 1.3 is 0.5 – 0.403 = 0.097

Page 10: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

(iii) Example: A bowler has a mean score of 108.6 and a standard deviation of 7.1.

What fraction of the bowler’s scores will be less than 80.2?

Page 11: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

2.) Knowing the standard deviation (s) of a data set indicates the precision of a measurement

(i) Common intervals used for expressing analytical results are shown below:

(ii) The precision of many analytical measurements is expressed as:

There is only a ~5% chance (1 out of 20) that any given measurement on the sample will be outside of this range

Range Percent of Results Expected in Range

x ±1s 68.3% (31.7 outside)

x ±2s 95.5% (4.5% outside

x ±3s 99.7% (0.3% outside)

s2x

Page 12: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

4.) The precision of a mean (average) result is expressed using a confidence interval

(i) Relationship between the true mean value () and the measured mean ( ) is given by:

where: s = standard deviationn = number of measurementst = student’s t value

degrees of freedom = (n-1)

x

n

tsx

Confidence interval

Note: As n increases, the confidence interval becomes smaller ( becomes more precisely known)

Page 13: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

4.) The precision of a mean (average) result is expressed using a confidence interval

(ii) Student’s t Statistical tool frequently used to express confidence intervals

A probability distribution that addresses the problem of estimating the mean of a normally distributed population when the sample size is small.

Population standard deviation () is unknown and has to be estimated from the data using s.

From number of measurements (n-1)

Page 14: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

4.) The precision of a mean (average) result is expressed using a confidence interval

(iii) The meaning of Confidence Interval To determine the “true” mean need to collect an infinite number of

data points.- obviously not possible

Confidence interval tells us the probability that the range of numbers contains the “true” mean.

50% confidence interval range of numbers only contains true mean 50% of the time90% confidence interval range of numbers contains true mean 90% of the time.

“true” mean

50% of data sets do not contain true mean

Page 15: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Standard Deviation and Probability

(iii) Example: For the following bowling scores 116.0, 97.9, 114.2, 106.8 and 108.3, a bowler has a mean score of 108.6 and a standard deviation of 7.1.

What is the 90% confidence interval for the mean?

Statistics

Page 16: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

5.) Comparison of Two Data Sets(i) To determine if two results obtained by the same method are

statistically the same, use the following formula to determine a calculated t:

where: = mean results of samples 1 & 2n1, n2 = number of measurements of

samples 1 & 2spooled = “pooled” standard deviation

21

2121

nn

nn

s

xxt

pooledcalculated

21 x,x

2nn

xxxxs

21

1Set 2Set

22j

21i

pooled

Requires standard deviation from the two data sets be similar.

Page 17: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

5.) Comparison of Two Data Sets(ii) Compare calculated t to the corresponding value in the Student’s t

probability table. Use the desired %confidence level at the appropriate Degrees of

freedom Degrees of Freedom = (n1 + n2 -2)

(iii) If calculated t is greater than the value in the Student’s t probability table, then the two results are significantly different at the given % confidence level. Easier to achieve for lower %confidence level

Calculated t needs to be less than table value

Page 18: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

5.) Comparison of Two Data Sets(iv) Example:

The amount of 14CO2 in a plant sample is measured to be: 28, 32, 27, 39 & 40 counts/min (mean = 33.2). The amount of radioactivity in a blank is found to be: 28, 21, 28, & 20 counts/min (mean = 24.2). Are the mean values significantly different at a 95% confidence level?

245

).3320().2428().2421().2428().3340().3339().3327().3332().3328(s

22

22

22

22

22

22

22

22

22

pooled

2nn

xxxxs

21

1Set 2Set

22j

21i

pooled

4.5s pooled 4821

2121calculated .2

45

)4)(5(

4.5

2.242.33

nn

nn

s

xxt

Page 19: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

5.) Comparison of Two Data Sets(iv) Example:

Degrees of Freedom = (5 + 4 – 2) = 7

From Student’s t probability table:

Degrees of Freedom (7) 95% Confidence level

Calculated t (2.48) > 2.365

The results are significantly different at a 95% confidence level, but not at 98% or higher confidence levels

Page 20: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

6.) Comparison of Two Methods(i) To determine if the results of two methods for the same sample are the

same, use the following formula to determine a calculated t:

where: = difference in the mean values of

the two methodsn = number of samples analyzed by

each methodsd =

(ii) Degree of Freedom = (n - 1)(iii) If calculated t is greater than the value in the Student’s t probability

table, then the two methods are significantly different at the given % confidence level.

ns

dt

dcalculated

d

1n

dds

2

id

Page 21: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

6.) Comparison of Two Methods(iv) Example:

Two methods for measuring cholesterol in blood provide the following results:

Are these methods significantly different at the 95% confidence level?

Cholesterol content (g/L)

Plasma sample

Method A Method B Difference (di)

1 1.46 1.42 0.04

2 2.22 2.38 -0.16

3 2.84 2.67 0.17

4 1.97 1.80 0.17

5 1.13 1.09 0.04

6 2.35 2.25 0.10

= +0.060d

Page 22: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Standard Deviation and Probability

6.) Comparison of Two Methods(iv) Example:

16

)06.010.0()06.004.0()06.017.0()06.017.0()06.016.0()06.004.0(s

20

20

20

20

20

20

d

1n

dds

2

id

2d 12.0s

Degrees of Freedom (6-1 =5) 95% Confidence level

Calculated t (1.20) ≤ 2.571

The results are not significantly different at a 95% confidence level.

20.1612.0

06.0n

s

dt

2

0

dcalculated

Page 23: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Dealing with Bad Data

1.) Q Test(i) Method used to decide whether or not to reject a “bad” data point.(ii) Procedure:

1. Arrange Data in order of increasing value.2. Determine the lowest and highest values and the total range of

values.

Example:

12.47 12.48 12.53 12.5612.67

3. Determine the difference between the “bad” data point and the nearest value.- Calculate the “Q value”

Range = 0.20Questionable point

gap = 0.11

55.020.0

11.0

Range

GapQ

Page 24: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Dealing with Bad Data

1.) Q Test(ii) Procedure:

4. Compare the calculated Q value to those in Tables at the same value of n and the desired %confidence level.- n: total number of values or observations

- For example, at n = 5 and 90% confidence, the value of Q is 0.64- Since:

Q (calculated) ≤ Q (table)

0.55 ≤ 0.64- data point 12.67 can not be rejected at the 90% confidence level

(iii) Although the Q-test is valuable in eliminating bad data, common sense and repeating experiments with questionable results are usually more helpful.

Values of Q for rejection of data

Q (90% confidence) 0.94 0.76 0.64 0.56 0.51 0.47 0.44 0.41

Number of Observations

3 4 5 6 7 8 9 10

Page 25: Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or

Statistics Dealing with Bad Data

1.) Q Test(ii) Example:

For the following bowling scores 116.0, 97.9, 114.2, 106.8 and 108.3, a bowler has a mean score of 108.6 and a standard deviation of 7.1.

Using the Q test, decide whether the number 97.9 should be discarded.