topic 1- statistical analysis

Series1

Topic 1- Statistical Analysis

Why?O The scientific method involves

making observations and collecting measurable data.

O When measuring data from a sample, the sample must be representative of the entire population of that sample

O Statistics allows us to sample small populations and draw conclusions of the larger populations

Why?O It allows us to measure differences

and relationships between sets of data.

O All conclusions drawn from an experiment have a certain level of confidence, but nothing in science is 100% certain.

What is a representative sample?

O A small group whose characteristics accurately reflect those of the larger population from which it is drawn.

O A representative sample is needed in order to make more accurate generalizations of the larger population

O Example: If approximately 15% of the United States’ population is of Hispanic descent, a sample of 100 Americans also ought to include around 15 Hispanic people to be representative.

How do we get a representative sample?

O Avoiding selection bias- when sampling is not representative as a result of convenience sampling (using just mpsj students) , undercoverage (not targeting a specific group of a population), judgement sampling (targeting individuals you pre-assume to fit a criteria) and non-response (people choose not to complete the experiment)

O Larger sample sizes- ensures the sample is more similar to the original population

O Random Sampling- selecting individuals from random areas ,times or with different methods

O This results in better data collection quality and experimenter bias or placebo effect

Reliable and valid dataO reliability is used to describe the overall

consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions. For example, measurements of people’s height and weight are often extremely reliable.

O validity is the extent to which a concept,

conclusion or measurement is well-founded and corresponds accurately to the real world. “You are measuring what you’re supposed to measure”

RangeO Measures the spread of dataO The difference between the largest

and smallest observed valuesO If one data point is unusually

large/small, it has a great effect on the range and is called an outlier (Outliers can often indicate an error in the experiment and are often eliminated).

AveragesO Averages are the central tendencies

of the data. There are three types;O Mean- sum of all the results divided

by the number of resultsO Median- the middle value of a range

of resultsO Mode: the value that appears the

greatest number of times

ExampleO Find the mean, median and mode of the

following data setO 1, 2, 2, 5, 6, 7, 11, 11, 11, 12

O Mean- = 6.8O Median- = 6.5O Mode-11

O When no numbers repeat then you do not have a mode

O If the mean, median and mode are all approximately the same then we can assume a normal distribution

AveragesO Averages do not tell us everything

about a sample.O May not be representative of the

entire populationO Two samples of a populations could

be different from one another. Bound to have natural variation

Sample 1

Sample 2

5 round1 oval

2 round3 oval

Standard DeviationO Samples can be very uniform-

bunched around the mean or spread out a long way from the mean

O The statistic that measures this spread is called the standard deviation

Standard deviationO A measure of how the individual data

points are distributed around the meanO Allows us to compare the means\spread

of data between two or more samplesO Tells us how tightly the data points are

clustered around the mean and therefore how many outliers there are in the data.

O When the data points are clustered, the SD is very small and when spread apart the SD is large

Standard deviation and error bars

O A graphical representation of variability

O Can be used to show range of data or SD

O In design labs, students often use their SD to represent their error bars on their graphs

O A large SD indicates large error or non-valid results

ExampleO Calculate the SD of a sample- Four

children are aged 5; 6; 8 and 9.O Step 1: find the mean x= O x1= 5, x2=6, x3=8, x4=9 and N(population

=4)

O x= (x1 + x2 + x3 + x4)O x=7

O Step 2 Find the SD σ:O σ =

O σ =

O σ = (5-7)2 + (6-7)2 + (8-7)2 + (9-7)2

O σ= 1.58

O Therefore the average age of the children is 7

DistributionO Consider a population of bean plants with a mean

height of 7cmO Normal Distribution- A spread of data that is

equally distributed before and after the meanO A flat bell curve- data widely spreadO A tall and narrow curve- data is very close to the

meanO Standard normal curve- 68% of all values lie within

+/- 1 SD from the mean and 95% of all values lie within +/- 2 SD from the mean

O As the distribution of a bell curve changes the SD value will change to account for the 68% and 95% of the data set.

= 68% or +/- 1

The t-testO To assess whether the means of two

groups are statistically different from each other

O Used when you want to compare the means of two groups

O Ex. Is there a statistical difference in the mean height between a group of boys and girls at the age of 12?

The t-test• Notice that all three examples below have the

same difference between means• Yet they all tell different stories. They all have

different variability.• The two groups with low variability from their

mean are visibly most different from each other and the groups with high variability are most similar to each other

T-testO We can judge the difference between

means relative to their spread or variability using the t-test

O The formula is a ratio;

The formula

ExampleO Problem: Sam Sleepresearcher hypothesizes

that people who are allowed to sleep for only four hours will score significantly lower than people who are allowed to sleep for eight hours on a cognitive skills test. He brings 8 participants into his sleep lab and randomly assigns them to one of two groups. In one group he has participants sleep for eight hours and in the other group he has them sleep for four. The morning after he administers the SCAT (Sam's Cognitive Ability Test) to the participants. (Scores on the SCAT range from 1-9 with high scores representing better performance).

SCAT scores

8 hours sleep group (X)

5 7 5 3 5 3 3 9

4 hours sleep group (Y)

8 1 4 6 6 4 1 2

Step 1- calculate degree of freedomDf (paired t-test)= sample size-1Df (unpaired) = n1+n2 - 2

Mx8hours= 5My4hours= 4SD8=4.571SD4=6.571

N8hours=8 and n4=8

Step 1: Find the means for both groups and subtract

Step 2: Calculate the variance (SD2)Step 3: Divide each variance by the sample

sizeStep 4: Square root the denominator

Step 3- use t-tableO Once the t-value is calculated you look it up in

a table of significance to see whether the ratio (t-value) is large enough to say that the difference between the groups is not likely due to chance

O Statisticians like to be 95% confident that their conclusions are significant. So we use the risk value or pvalue of p<0.05. Differences are due to chance 5% of the time vs. p=0.1 where error occurs 10% of the time

O If p>0.05, this indicates the means are not statistically different

O according to the t sig/probability table with df =n-1= 7, t must be at least 1.895 to be significant

O since our t=0.847 and therefore p>0.05, (it would fall at a lower confidence level between .25 and .1) this difference is not statistically significant

Correlation vs. Causation

O “correlation does not imply causation”- means that correlation cannot be used to infer a causal relationships, but rather that the causes underlying the correlation may be indirect or unknown

O Cause: a carefully designed experiment and its evidence can determine that A causes B

O Correlation: observations, without a controlled experiment, can only show that A and B are related

Fallacy ExamplesO Ice cream sales correlate with the number

of people who drown at sea. Therefore ice cream causes people to drown.

O Children who sleep with a light on are more likely to develop myopia (nearsightedness)O Does light cause myopia?

O Atmospheric CO2 has been climbing in conjunction with increased crimeO Does CO2 cause crime?

O A mathematical correlation test produces a value r, which signifies the correlation between two eventsO r+1 positive correlation (as X

increases so does Y)O r =0 no correlationO r -1 negative correlation (as X

increases Y decreases)

Accuracy & PrecisionO Accuracy: how close a measured

value is to the true valueO Precision: how close the measured

values are to each other

Errors and Uncertainties

O Examples:O Human errors- can occur when tools

or instruments are used or read incorrectly. (E.g a thermometer reading must be taken after stirring and the bulb still in the liquid but not touching the bottom)

O Systematic- experimenter does not know how to use the equipment or something wrong with equipment.

O Random – unknown or unpredictable changes

SystematicO Note that systematic and random errors refer to

problems associated with making measurements. Mistakes made in the calculations or in reading the instrument are not considered in error analysis. It is assumed that the experimenters are careful and competent! (Not acceptable in your design lab)O Can be reduced if equipment is regularly checked

or calibrated to ensure proper functionO Procedural systematic errors are acceptable. I.e.

identifying a problem with your procedure/controls.

RandomO Random errors are statistical fluctuations

(in either direction) in the measured data due to the precision limitations of the measurement device. Random errors usually result from the experimenter's inability to take the same measurement in exactly the same way to get exact the same number.

O In biology this can be a result of changes in the materials used, changes in conditions

O Controlled by carefully selecting material and careful control variables and repeating trials

Uncertainties & Significant Figures

O Uncertainties – used in biology since they are the best choice for quantitative lab work

O Sig Figs- are useful when doing calculations from a textbook and you do not know the accuracy of the measuring device.

O They are mutually exclusive systems…you use one of the other!

Things to RememberO When adding or subtracting add

uncertaintiesO When dividing convert to percent

uncertainty, then add percent uncertaintiesO If units are for ex. g/ml convert back to

uncertaintyO If units are percent change then convert back

then multiply by 100 to get back to % unitsO When taking an average divide your

uncertainty by N

The act of measuringO When a measurement is taken, this

can affect the environment of the experiment.

O Ex. When a cold thermometer is used to measure warm water. The thermometer may cool the water

O Ex. The presence of the experimenter influences the behaviour of the animal being observed

Replicates and SamplesO Biological systems because of their

complexity and variability require replicate observations and multiple samples of material.

O In IB you can choose to do a 5X5 or a 2X10O 5 changes to the independent variable

measured 5 timesO 2 changes to the independent variable

measured 10 times

Degrees of precisionO If it is digital the use the value of

the least known digit (e.g the mass on the scale says 1.01g, then your uncertainty is +/- 0.01g)

O If it is analog like in the case of a thermometer then use least known digit divided by 2

O Always include you degrees of precision for every measuring device in your lab (especially in your tables)

topic 1- statistical analysis

Documents

representative sample

natural variation sample

experimentlarger sample

sets of data

measurable data

meansspread of data

individual data points

better data collection