finding relations between variables - unibo.it · 2015. 10. 22. · the term is often attributed to...

42
FINDING RELATIONS BETWEEN VARIABLES Correlation

Upload: others

Post on 02-Aug-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

FINDING RELATIONS BETWEEN VARIABLES

Correlation

Page 2: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Relation between coupled variables

What couples of variables are in relation?

Page 3: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Correlated variables

Page 4: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Uncorrelated variables

Page 5: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Definition The covariance of two random variable X and Y

is

Theorem For any two random variables X and Y.

Cov( , ) E[( [ ])( E[ ])]X Y X X Y Y

Var[ ] Var[ ] Var[ ] 2Cov( , ).X Y X Y X Y

Variance and Moments of a Random Variable

Page 6: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Independent variables COV=0

][][][

][][][][][][][),(

YEXEXYE

YEXEXEYEYEXEXYEYXCov

][][)()()()(

)(

][

,

,

YEXEyPyxPxyPxPyx

yxPyx

XYE

j

jj

i

ii

ji

jiji

ji

jiji

For discrete variables (for continuous, integral instead of sum)

For Independent variables

X,Y independent COV (X,Y) =0

The viceversa is not always true

Page 7: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Covariance and Pearson’s Correlation index

Variable 1 Variable 2

Item1 x11 x21

Item 2 x12 x22

Item i x1i x2i

Item m x1m x2m

Mean M1- M2-

m

i

i

m

i

i

xn

M

xn

M

1

22

1

11

1

1

m

i

ii

m

i

ii

MxMx

nxxcorr

MxMxn

xx

1 21

221121

1

221121

1

1),(

1

1),cov(

Page 8: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Correlation

1,1),( 21 xxcorr

Page 9: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

When is a correlation significant?

m

i

ii MxMx

nxxr

1 21

221121

1

1),(

Given a correlation index:

A test variable can be computed under the null hypothesis that r=0

t is distributed as Student’s t test with n-2 degrees of freedom It assumes normality of x

Page 10: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Graph showing the minimum value of Pearson's correlation coefficient that is significantly different from zero at the 0.05 level, for a given sample size.

Page 11: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Example: discovery of a misconduct

Repeatability test: 2 different experimentalist were asked to take the same solution and to perform 24 independent ELISA assays on a 6x4 plate.

They submitted to the assessor the following results out of the spectrophotometer, ordered following the well

Page 12: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

S1 S2P1 0,481 0,496P2 0,485 0,501P3 0,479 0,495P4 0,506 0,522P5 0,467 0,48P6 0,474 0,491P7 0,469 0,48P8 0,475 0,489P9 0,514 0,52P10 0,52 0,524P11 0,526 0,531P12 0,494 0,509P13 0,535 0,54P14 0,524 0,526P15 0,481 0,492P16 0,502 0,509P17 0,479 0,484P18 0,491 0,495P19 0,503 0,515P20 0,472 0,481P21 0,481 0,486P22 0,503 0,512P23 0,448 0,454P24 0,519 0,526

The assessor suspectedthat the experimenter submitted two reads of the same plate

How to prove it?

Page 13: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Example: discovery of a misconduct

0,44

0,45

0,46

0,47

0,48

0,49

0,5

0,51

0,52

0,53

0,54

0,55

0,44 0,46 0,48 0,5 0,52 0,54

S2

S1

R=0.978

Page 14: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Example: discovery of a misconduct

R=0.978 n=24 t=22.05

Objection: the test is valid only when data are normally distributed and we cannot prove that.

Any other idea?

Page 15: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Use the data themselves to generate random experiments

S1 S2P1 0,481 0,496P2 0,485 0,501P3 0,479 0,495P4 0,506 0,522P5 0,467 0,48P6 0,474 0,491P7 0,469 0,48P8 0,475 0,489P9 0,514 0,52P10 0,52 0,524P11 0,526 0,531P12 0,494 0,509P13 0,535 0,54P14 0,524 0,526P15 0,481 0,492P16 0,502 0,509P17 0,479 0,484P18 0,491 0,495P19 0,503 0,515P20 0,472 0,481P21 0,481 0,486P22 0,503 0,512P23 0,448 0,454P24 0,519 0,526

S1 Random(S2)P1 0,481 0,495P2 0,485 0,522P3 0,479 0,48P4 0,506 0,491P5 0,467 0,48P6 0,474 0,489P7 0,469 0,52P8 0,475 0,524P9 0,514 0,531P10 0,52 0,509P11 0,526 0,54P12 0,494 0,526P13 0,535 0,492P14 0,524 0,509P15 0,481 0,484P16 0,502 0,495P17 0,479 0,515P18 0,491 0,481P19 0,503 0,486P20 0,472 0,512P21 0,481 0,454P22 0,503 0,526P23 0,448 0,496P24 0,519 0,501

Page 16: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

0,44

0,46

0,48

0,5

0,52

0,54

0,56

0,44 0,46 0,48 0,5 0,52 0,54

Rand

om (S2)

S1

R=0.25

Page 17: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Building a distribution by random resampling

Iterate the process of shuffling and computation of r many times (say 1000)

Compute a cumulative histogram counting the resamplings scoring with correlation≥r

0,00

200,00

400,00

600,00

800,00

1000,00

0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90

Page 18: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Building a distribution by random resampling

The plot gives the probability (per thousand) of obtaining a given correlation with random pairings of the original data P-value independent on the assumptions on the data distribution

0,00

200,00

400,00

600,00

800,00

1000,00

0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90

This is only an example plot: Compute by yourself the plot corresponding to the data available in misconduct.xls file

Page 19: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Bootstrapping

The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where the main character pulls himself out of a swamp by his hair (specifically, his pigtail), but the Baron does not, in fact, pull himself out by his bootstraps

Page 20: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Correlation index assumes linear dependence

R=0.816

Page 21: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Non parametric correlation: Spearman

Given a set of paired (xi,yi) sort separately the two variables, obtaining the ranks.

The Sperman’s correlation is the Pearson’s correlation of the ranked variables: (Rxi,Ryi),

Page 22: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where
Page 23: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Under the null hypothesis (r=0)

Is distributed as a Student’s t test with n-2 degrees of freedom

Page 24: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Categorical data: Matthews correlation index

Secreted Non Secreted Total

With Signal peptide a b a + b

Without Signal Peptide c d c + d

total a + c b + d n

cdbdcaba

ad-bcMCC

Page 25: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

REGRESSION

Page 26: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Regression

Regression analysis: any techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variableand one or more independent variables.

Plot of Height vs Weight

100 140 180 220 260

Weight

4.6

5

5.4

5.8

6.2

6.6

7

Heig

ht

Page 27: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Linear regression: Which is the linethat best fits the data?

?

?

?

Page 28: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Linear regression

residuals or errors

iii baxy

baxy

Interpolating line

Data Points:

x y

1 6

2 1

3 9

4 5

5 17

6 12

Page 29: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Least squares line

baxy Choose the line (a,b) that minimize

m

i

ii

m

i

i baxyE1

2

1

2

Page 30: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Minimizing

m

i

ii

m

i

i baxyE1

2

1

2

2

1

2

1

1

2

1

11

2

1 1

111

2

11

111

),cov(

1

1

0020

11020

x

m

i

i

m

i

ii

m

i

i

m

i

ii

m

i

i

m

i

i

m

i

m

i

iii

m

i

i

m

i

i

m

i

i

m

i

iii

m

i

ii

m

i

i

m

i

i

m

i

ii

yxa

xxxm

xyyxm

xxmx

xymyx

xxx

xyyx

a

xxaxyxayxxbaxya

E

xaybxm

aym

bbaxyb

E

Page 31: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Data Points:

x y

1 6

2 1

3 9

4 5

5 17

6 12

Page 32: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Polynomial interpolation

The same technique can be applied byimposing a polynomial regression model

p is the degree of the polynomial

ak and b are the trainable coefficients

p

k

k

k bxaxPy1

)(

Page 33: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Polynomials can perfectly interpolate a set of points

A set of data consisting of m points can be perfectlyinterpolated with a polynomial of degree m-1 2 points define a unique line, 3 points define a unique parabola (or a

line, if aligned) and so on…

Increasing the degree of the polynomial corresponds to decreasingthe error.

Bishop C, Pattern recognition and Machine Learning, Springer

Page 34: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

values for parameters

M=0 M=1 M=3 M=9

b 0.19 0.82 0.31 0.35

a1 -1.27 7.99 232.37

a2 -25.43 -5321

a3 17.37 48568

a4 -231639

a5 640042

a6 -1061800

a7 1042400

a8 -557682

a9 125201

Page 35: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

But…

The interpolation is useless and do not gives a goodpredictive model for extrapolationOVERFITTING

We cannot use the model for predicting Y in this region

Page 36: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Overfitting

Page 37: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Low number of points increasesrisk of overfitting

Bishop C Pattern Recognition and Machine Learning

Page 38: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

High values for parameters

M=0 M=1 M=3 M=9

b 0.19 0.82 0.31 0.35

a1 -1.27 7.99 232.37

a2 -25.43 -5321

a3 17.37 48568

a4 -231639

a5 640042

a6 -1061800

a7 1042400

a8 -557682

a9 125201

Bishop C Pattern Recognition and Machine Learning

Page 39: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Regularized Error function

Discouraging high values for coefficients

p

k

k

m

i

ii axPyE1

2

1

2)(

Parameter weighting the strength of regularization

Page 40: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Values for parameters for M=9

ln λ=0 ln λ=-18 ln λ=-∞

w0 0.13 0.35 0.35

w1 -0.05 4.74 232.37

w2 -0.06 -0.77 -5321

w3 -0.05 -31.97 48568

w4 -0.03 -3.89 -231639

w5 -0.02 55.28 640042

w6 -0.01 41.32 -1061800

w7 0.00 -45.95 1042400

w8 0.00 -91.53 -557682

w9 0.01 72.68 125201

Bishop C Pattern Recognition and Machine Learning

Page 41: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

λ=0Bishop C Pattern Recognition and Machine Learning

Page 42: FINDING RELATIONS BETWEEN VARIABLES - unibo.it · 2015. 10. 22. · The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where

Regularized regression avoids overfitting

Bishop C Pattern Recognition and Machine Learning