correlation forensic statistics cis205. introduction chi-squared shows the strength of relationship...

17
Correlation Forensic Statistics CIS205

Upload: ella-waters

Post on 18-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

Correlation

Forensic Statistics CIS205

Page 2: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

Introduction

• Chi-squared shows the strength of relationship between variables when the data is of count form

• However, many variables measured in a lab are on a continuous scale, such as concentrations of chemicals, time, and most machine responses

• The term for the strength of the relation between continuous variables is correlation

• Any continuous variables which have some sort of systematic relationship are said to covary, and any variable which covaries with another is said to be a covariate.

• A basic tool for the investigation of correlation is the scatterplot. Usually only two variables are plotted, but three can be accommodated.

Page 3: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

Correlation Coefficient

• A statistical measure of correlation is called the correlation coefficient, which can only take on values between -1 and 1.

• Both 1 and -1 mean that the variables are absolutely related• 1 means that as one variable increases, so does the other• -1 means that as one variable increases, the other decreases.• 0 means that the variables are unrelated.• The strength of relationship is independent of the form of

relationship. Most commonly relationships are linear (plotting one variable against another yields a straight line), next most commonly loglinear (a graph of one variable against the logarithm of the other is linear).

Page 4: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,
Page 5: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

Ageing properties of the dye methyl violet (Grim et al., 2002)

• This example will be used to demonstrate the process involved in the calculation of a linear correlation coefficient

• Laser desorption mass spectrometry was used to examine the ageing properties of the dye methyl violet, a dye used in inks from the 1950s.

• Documents written in methyl violet ink were artificially aged with ultra violet radiation.

• After various times the average molecular weight for the methyl violet compound was measured.

• The raw data is shown in table 6.1, and plotted in figure 6.2

Page 6: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

Table 6.1. Average molecular weight of the dye methyl violet and UV irradiation time from an

accelerated ageing experiment.

Time (min) Weight (Da)

0.0 367.20

15.3 368.97

30.6 367.42

45.3 366.19

60.2 365.91

75.5 365.68

90.6 365.12

105.7 363.59

Page 7: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,
Page 8: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

Correlation coefficient r

• Visual inspection of Fig. 6.2 suggests that there is a negative linear correlation between time and mean molecular weight.

• A suitable measure of this linear correlation r is:

}{ 22yyxx

yyxxr

ii

ii

Page 9: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

.Time (min)

x – mean x

(x – mean x)²

Weight (Da)

y – mean y

(y – mean y)²

(x – mean x)(y – mean y)

0.0 -52.90 2798.41 367.20 0.94 0.883 -49.72

15.3 -37.61 1414.51 368.97 2.71 7.344 -101.92

30.6 -22.83 498.63 367.42 1.16 1.345 -25.90

45.3 -7.61 57.91 366.19 -0.07 0.005 0.53

60.2 7.33 53.73 365.91 -0.35 0.122 -2.57

75.5 22.61 511.21 365.68 -0.58 0.336 -13.11

90.6 37.67 1419.03 365.12 -1.14 1.300 -42.94

105.7 52.84 2792.06 363.59 -2.67 7.129 -141.08

mean x = 52.89

Σ = 9545.50

mean y = 366.26

Σ = 18.465

Σ =

-376.72

Page 10: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

Substituting these values into the equation for r we have:

• This means that as the irradiation time increases the average molecular weight of methyl violet ions decreases, and as -0.89 is close to -1, the negative linear relationship is quite strong

8973.0465.185.9545

72.376

r

Page 11: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

Significance tests for correlation coefficients

• A linear correlation coefficient of -0.89 sounds quite high, but is it significantly high? Is it possible that such a coefficient would occur in data drawn randomly from a bivariate normal distribution?

• Also, what about the effect of sample size? It makes sense that a high coefficient based on lots of x,y pairs is somehow more significant than an equal correlation based on only a few observations.

• For the null hypothesis that the correlation coefficient is 0, a suitable test statistic is:

• t = r * √df / √ (1 - r²).

Page 12: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

Substituting for the methyl violet example

• t = r * √df / √ (1 - r²).• t is the ordinate (horizontal axis) on the t-distribution• df is degrees of freedom equal to n – 2 (here = 6 because

we have 8 x,y pairs)• The linear correlation coefficient was -0.89, so:• t = -0.89 * √6 / √ (1 - -0.89²) = -4.78• If we look at the values of the t-distribution table for df = 6

we see that 95% of the area is within ± 2.447. • Our value of -4.78 is beyond -2.447, so we can say that the

correlation coefficient is significant at 95% confidence.

Page 13: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

Correlation coefficients for non-linear data

• Andrasko and Ståhling measured three compounds associated with the discharge of firearms, napthalene, TEAC-2 and nitroglycerin over a period of time by solid phase microextraction (SPME) of the gaseous residue from the expended cartridge.

• They found that the concentrations of these compounds would decrease with time, and that this property would be of use in estimating the time since discharge for this type of cartridges.

• Table 6.3 is a table of the peak area for nitroglycerine and time elapsed since discharge for a Winchester SKEET 100 cartridge stored at 7°C, shown as scatterplots in Figure 6.3

Page 14: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

Time since discharge (days)

Nitroglycerin (peak height)

1.21 218.34

2.42 216.16

3.62 100.00

4.69 75.55

7.49 56.52

9.42 50.62

11.60 31.00

14.69 41.44

21.50 15.53

25.70 14.63

29.86 10.41

37.20 5.16

42.42 7.26

Page 15: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,
Page 16: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

Log-linear relationships

• A common model for loss in chemistry (e.g. radioactive decay) is called inverse exponential decay, which entails a log-linear relationship between the two variables

• The right hand scatterplot of Figure 6.3 shows the log to the base e (or natural logarithm) of the nitroglycerine peak height against time. Here we can see that the data looks much more linear.

• The linear correlation coefficient is -0.95, which is quite high, and suggests that this may be a reasonable transformation of the variables

• The calculations for the log-linear correlation coefficient are exactly the same kind as in table 6.2, only the log to the base e of the y variable has been used, rather than the untransformed y.

Page 17: Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,

The coefficient of determination

• The coefficient of determination is a direct measure of how much the variance in one of the covariates is attributed to the other.

• We can imagine that the total variance in the nitroglycerin peak is made up of two parts, that which is attributable to the relationship with x (time), and that which can be seen as random noise.

• The coefficient of determination describes what proportion of the variance is attributable to relationship with time.

• The coefficient of determination is simply the square of the correlation coefficient.

• If r = - 0.95, r² = 0.90. • Often the coefficient of determination is described as a percentage,

which in the example above would mean that 90% of the variance in nitroglycerin peak area is attributable to time.