Variance and covariance
n
ii
T
nT
n
a
aaa
a
a
a
1
2
212
1
......
UU
UU
n
ii
n
ii
T
an
an
Variance1
22
1
2
11
11
......
MM
n
ii
T
nT
n
Variancena
aaa
a
a
a
1
2
21
2
1
)1(
...
...
VV
V
V
TT
nnVariance ))((
11
11 2 MUMUVV
M contains the mean
Covariance)1(
...;
...
1
2
1
2
1
nba
b
b
b
a
a
a
n
iBiAi
T
Bn
B
B
An
A
A
AB
BA
iancevarCon B
TA
)()(
11
XBXA
Sums of squares
General additive models
iancevarCon B
TA
)()(
11
XBXA
The coefficient of correlation
yx
xy
yx
xyr
)cov(
)()'(1
1)var(
)()'(1
1)var(
)()'(1
1)cov(
YY
X
YX
ΜXΜX
ΜXΜX
ΜYΜX
ny
nx
nxy
Y
XX
)()')(()'(
)()'(
YY
YX
ΜYΜYΜXΜX
ΜYΜXR
XX
For a matrix X that contains several variables
holds
The matrix R is a symmetric distance matrix that contains all correlations between the variables
11
11 )()'(1
1
XX
XX
DΣΣR
ΣΜXΜXΣRn
The diagonal matrix SX contains the standard deviations as entries.
X-M is called the central matrix.
We deal with samples
)()'(1
1ΜXΜXD
n MatrixCov
Xn
X
X
X
000
0...00
000
000
2
1
Σ
11
11 )()'(1
1
XX
XX
DΣΣR
ΣΜXΜXΣRn
XΣX XTr
Xn
X
/1
...
...
/1 1
X XnXT /1....../1 1X
Pre-and postmultiplication
nnnn
n
n
X
....
............
...
...
21
22221
11211
Σ
nnnnn
n
n
nX
/1
...
...
/1
....
............
...
...
/1....../1
1
21
22221
11211
1ΣR
Premultiplication Postmultiplication
n
i
n
jij
n
i
n
j ji
ijnnnn scalarr
1 11 11;;;111
XΣXR
n
/1...00
............
0.../10
0...0/1
2
1
X
nnnn
n
n
X
....
............
...
...
21
22221
11211
ΣΣXXXXΣXXR
For diagonal matrices X holds
Linear regression
European bat species and environmental correlates
ln(Area)ln(Number
of species)
10.26632 3.2580976.148468 011.33704 3.2188767.696213 0.6931478.519989 2.7080512.24361 2.89037210.3264 2.995732
10.84344 3.17805412.40519 2.89037211.61702 3.4965088.891512 2.1972255.703782 1.6094389.068777 3.0445229.019059 2.83321310.94366 3.5263617.824046 1.0986129.132379 2.89037211.27551 3.17805410.67112 2.6390577.887209 2.63905710.71945 2.3978957.243513 012.73123 2.39789513.20664 3.46573612.78555 3.2188761.871802 1.60943811.7905 3.496508
11.44094 3.33220511.54248 011.16014 2.39789512.6162 3.433987
9.615805 2.56494911.07637 2.7725895.075174 1.79175911.08702 2.6390577.858641 2.89037210.1401 3.178054
6.670766 1.6094385.755742 2.07944210.42552 3.0445220.667829 08.265136 2.1972259.557046 2.07944212.68838 2.39789512.65321 3.17805411.42796 3.17805412.37772 3.40119715.25979 3.4011974.110874 010.07799 2.99573211.53468 3.36729610.14353 3.09104210.80058 3.3322059.917045 3.29583713.13427 3.46573611.03568 013.01692 2.89037210.62825 3.36729610.63432 2.83321310.07593 3.25809713.31114 3.258097-0.82098 0
N=62
)ln()ln( 10 AaaS
1
02
1
2
1
102
1
1
......
1
1
...
1
...
1
1
... a
a
x
x
x
x
x
x
aa
y
y
y
nnn
Y
XAY Matrix approach to linear regression
YXXXA
AIAXAXXXYXXX
XAXYX
''
''''
''
1
11
X is not a square matrix, hence X-1 doesn’t exist.
The species – area relationship of European bats
ln(Number of
species)Constant ln(Area) X'
3.258097 1 10.26632 1 1 1 1 1 1 1 1 1 1 1 1 10 1 6.148468 10.26632 6.148468 11.33704 7.696213 8.519989 12.24361 10.3264 10.84344 12.40519 11.61702 8.891512 5.703782 9.068777
3.218876 1 11.337040.693147 1 7.696213 X'X2.70805 1 8.519989 62 607.1316
2.890372 1 12.24361 607.1316 6518.161
2.995732 1 10.3264 (X'X)-1
3.178054 1 10.84344 0.183521 -0.017092.890372 1 12.40519 -0.01709 0.0017463.496508 1 11.617022.197225 1 8.8915121.609438 1 5.703782 X'Y3.044522 1 9.068777 154.29372.833213 1 9.019059 1647.9083.526361 1 10.943661.098612 1 7.824046
2.890372 1 9.132379 (X'X)-1(X'Y)3.178054 1 11.27551 a0 0.1468082.639057 1 10.67112 a1 0.2391442.639057 1 7.8872092.397895 1 10.71945
0 1 7.2435132.397895 1 12.731233.465736 1 13.206643.218876 1 12.785551.609438 1 1.8718023.496508 1 11.7905
y = 0.2391x + 0.1468R² = 0.4614
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
-5 0 5 10 15 20
ln(#
spe
cies
)
ln (Area)
What about the part of variance explained by our model?
11 )()'(1
1
XX ΣΜXΜXΣRn
24.024.015.0 16.1
15.0ln24.0ln
AAeS
AS
1.16: Average number of species per unit area (species density)
0.24: spatial species turnover
11 )()'(1
1
XX ΣΜXΜXΣRn
X-M (X-M)'
0.769488 0.473878 0.769488 -2.48861 0.730267 -1.79546 0.219442 0.401763-2.48861 -3.64398 0.473878 -3.64398 1.54459 -2.09623 -1.27246 2.4511640.730267 1.54459-1.79546 -2.096230.219442 -1.27246 (X-M)'(X-M) (X-M)'(X-M) / (n-1)0.401763 2.451164 71.0087 136.9954 1.164077 2.2458260.507124 0.533954 136.9954 572.8582 2.245826 9.3911190.689445 1.0509910.401763 2.612741 x1.007899 1.824579 1.078924 0-0.29138 -0.90093 0 3.064493-0.87917 -4.08866
0.555914 -0.72367 x-1
0.344605 -0.77339 0.926849 01.037752 1.151213 0 0.326318
-1.39 -1.9684
0.401763 -0.66007 x-1 (X-M)'(X-M) / (n-1)0.689445 1.48306 1.078924 2.0815420.150449 0.878671 0.732854 3.0644930.150449 -1.90524-0.09071 0.927004-2.48861 -2.54893 x-1 (X-M)'(X-M) / (n-1) x-1-0.09071 2.938785 1 0.6792450.977127 3.414195 0.679245 10.730267 2.993105-0.87917 -7.92064
1.007899 1.998051 x-1 (X-M)'(X-M) / (n-1) x-1)2
0.843596 1.64849 1 0.461374-2.48861 1.750039 0.461374 1-0.09071 1.3676980.945379 2.8237520.076341 -0.17664
y = 0.2391x + 0.1468R² = 0.4614
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
-5 0 5 10 15 20
ln(#
spe
cies
)
ln (Area)
n
iiMY YY
n 1
22; )(
11y = 0.2391x + 0.1468
R² = 0.4614
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
-5 0 5 10 15 20
ln(#
spe
cies
)
ln (Area)
How to interpret the coefficient of determination
n
ii
n
ii
n
ii
n
iii
YY
YXY
YYn
XYYn
variance Totalvariance Residual
R
1
2
1
2
1
2
1
2
2
)(
))((
)(1
1
))((1
1
11
dfR
RF 2
2
1Statistical testing
is done by an F or a t-test.
2);(
2)(;
2; MXYXYYMY
n
iiiXYY XYY
n 1
22)(; ))((
11
n
iiMXY YXY
n 1
22);( ))((
11
dfR
Rt
Ft
21
Total variance
Rest (unexplained) variance
Residual (explained) variance
LaNaTaAaaS T 403210 )ln()ln(
The general linear model
n
iiinn XaaXaXaXaXaaY
103322110 ...
A model that assumes that a dependent variable Y can be expressed by a linear combination of predictor variables X is called a linear model.
XAY
nnmm
n
n
m y
y
a
xx
xx
xx
y
y
y
...
...1
.........1
...1
...1
...1
0
,1,
,21,2
,11,1
2
1
YXXXA
AIAXAXXXYXXX
XAXYX
''
''''
''
1
11
ΕXAY
nnnmm
n
n
m y
y
a
xx
xx
xx
y
y
y
......
...1
.........1
...1
...1
...1
0
1
0
,1,
,21,2
,11,1
2
1
The vector E contains the error terms of each regression. Aim is to minimize E.
The general linear model
n
iiinn XaaXaXaXaXaaY
103322110 ...
n
iiinn XaaXaXaXaXaaY
103322110 ...
If the errors of the preictor variables are Gaussian the error term e should also be Gaussian and means and variances are additive
)()()(1
0
n
iii XaaY )()( 2
10
22
n
iii XaaY
Total variance
Explained variance
Unexplained (rest)
variance
)()()(
)( 2
22
21
02
2
YY
Y
Xaa
R
n
iii
LaNaAaaS T 40310 )ln()ln(
1. Model formulation2. Estimation of model parameters
3. Estimation of statistical significance
YXXXA
XAY
'' 1
Y
Country/Islandln(Number
of species)
Constant ln(Area)Days below zero
Latitude of capitals (decimal degrees)
Albania 3.258097 1 10.26632 34 41.33Andorra 0 1 6.148468 60 42.5Austria 3.218876 1 11.33704 92 48.12Azores 0.693147 1 7.696213 1 37.73Baleary Islands 2.70805 1 8.519989 18 39.55Belarus 2.890372 1 12.24361 144 53.87Belgium 2.995732 1 10.3264 50 50.9Bosnia and Herzegovina 3.178054 1 10.84344 114 43.82British islands 2.890372 1 12.40519 64 51.15Bulgaria 3.496508 1 11.61702 102 42.65Canary Islands 2.197225 1 8.891512 1 27.93Channel Is. 1.609438 1 5.703782 12 49.22Corsica 3.044522 1 9.068777 11 41.92Crete 2.833213 1 9.019059 1 35.33Croatia 3.526361 1 10.94366 114 45.82Cyclades Is. 1.098612 1 7.824046 1 37.1Cyprus 2.890372 1 9.132379 2 35.15Czech Republic 3.178054 1 11.27551 119 50.1Denmark 2.639057 1 10.67112 85 55.63Dodecanese Is. 2.639057 1 7.887209 2 36.4Estonia 2.397895 1 10.71945 143 59.35Faroe Is. 0 1 7.243513 35 62Finland 2.397895 1 12.73123 169 60.32France 3.465736 1 13.20664 50 48.73Germany 3.218876 1 12.78555 97 52.38Gibraltar 1.609438 1 1.871802 0 36.1Greece 3.496508 1 11.7905 2 37.9Hungary 3.332205 1 11.44094 100 47.43Iceland 0 1 11.54248 133 64.13
X
Multiple regression
X'
1 1 1 1 1 1 1 110.26632 6.148468 11.33704 7.696213 8.519989 12.24361 10.3264 10.84344
34 60 92 1 18 144 50 11441.33 42.5 48.12 37.73 39.55 53.87 50.9 43.82
X'X62 607.1316 4328 2906.4
607.1316 6518.161 48545.59 29086.574328 48545.59 534136 228951.7
2906.4 29086.57 228951.7 141148.1
(X'X)-1
1.019166 -0.02275 0.00261 -0.02053-0.02275 0.002458 -7.5E-05 8.3E-050.00261 -7.5E-05 1.3E-05 -5.9E-05
-0.02053 8.3E-05 -5.9E-05 0.000509
(X'X)-1X'0.025783 0.163309 0.013407 0.07203 0.060295 0.010457 -0.13031 0.1703470.003376 -0.00859 0.002243 -0.00078 0.00013 0.001069 0.003124 -0.00097-0.00017 0.000405 9.87E-05 -0.00019 -0.00014 0.000364 -0.00054 0.000676-0.00066 -0.00195 -0.00056 -0.00074 -0.00076 -0.00064 0.003269 -0.00409
(X'X)-1X'Y X'Y (X'X)-1(X'Y)a0 2.679757 154.2937 2.679757a1 0.290121 1647.908 0.290121a2 0.002155 11289.32 0.002155a3 -0.06789 7137.716 -0.06789
Multiple R and R2
n
ii
n
ii
n
ii
n
iii
YY
YXY
YYn
XYYn
variance Totalvariance Residual
R
1
2
1
2
1
2
1
2
2
)(
))((
)(1
1
))((1
1
11
The coefficient of determination
1......
...............
.........
...1
...1
12
212
1211
21
ny
y
my
myyy
rr
rr
rrr
rrr
R
y x1 x2 xm
XXXY
YX
RR
RR
1
The correlation matrix can be devided into four compartments.
11 )()'(1
1
XX ΣΜXΜXΣRn
TTR XYXXXYYXXXXY RRRRRR 112
)det()det(
1)det(
)det()det(2
XXXX
XXRR
RR
RR
ln(Number of species)
ln(Area)Days below zero
Latitude of capitals (decimal degrees)
X-M X-M X-M X-M (X-M)'
3.2580965 10.26632 34 41.33 0.769488 0.473878 -35.8065 -5.54742 0.769488 -2.48861 0.730267 -1.79546 0.219442 0.401763 0.507124 0.689445 0.401763 1.007899 -0.291380 6.148468 60 42.5 -2.48861 -3.64398 -9.80645 -4.37742 0.473878 -3.64398 1.54459 -2.09623 -1.27246 2.451164 0.533954 1.050991 2.612741 1.824579 -0.90093
3.2188758 11.33704 92 48.12 0.730267 1.54459 22.19355 1.242581 -35.8065 -9.80645 22.19355 -68.8065 -51.8065 74.19355 -19.8065 44.19355 -5.80645 32.19355 -68.80650.6931472 7.696213 1 37.73 -1.79546 -2.09623 -68.8065 -9.14742 -5.54742 -4.37742 1.242581 -9.14742 -7.32742 6.992581 4.022581 -3.05742 4.272581 -4.22742 -18.94742.7080502 8.519989 18 39.55 0.219442 -1.27246 -51.8065 -7.327422.8903718 12.24361 144 53.87 0.401763 2.451164 74.19355 6.992581 (X-M)'(X-M) (X-M)'(X-M)/(n-1)2.9957323 10.3264 50 50.9 0.507124 0.533954 -19.8065 4.022581 71.0087 136.9954 518.6241 -95.1758 1.164077 2.245826 8.502034 -1.560263.1780538 10.84344 114 43.82 0.689445 1.050991 44.19355 -3.05742 136.9954 572.8582 6163.884 625.8081 2.245826 9.391119 101.0473 10.259152.8903718 12.40519 64 51.15 0.401763 2.612741 -5.80645 4.272581 518.6241 6163.884 232013.7 26066.26 8.502034 101.0473 3803.503 427.31573.4965076 11.61702 102 42.65 1.007899 1.824579 32.19355 -4.22742 -95.1758 625.8081 26066.26 4903.6 -1.56026 10.25915 427.3157 80.386892.1972246 8.891512 1 27.93 -0.29138 -0.90093 -68.8065 -18.94741.6094379 5.703782 12 49.22 -0.87917 -4.08866 -57.8065 2.342581 3.0445224 9.068777 11 41.92 0.555914 -0.72367 -58.8065 -4.95742 1.078924 0 0 0
2.8332133 9.019059 1 35.33 0.344605 -0.77339 -68.8065 -11.5474 0 3.064493 0 03.5263605 10.94366 114 45.82 1.037752 1.151213 44.19355 -1.05742 0 0 61.67255 01.0986123 7.824046 1 37.1 -1.39 -1.9684 -68.8065 -9.77742 0 0 0 8.9658742.8903718 9.132379 2 35.15 0.401763 -0.66007 -67.8065 -11.7274
3.1780538 11.27551 119 50.1 0.689445 1.48306 49.19355 3.222581 1
2.6390573 10.67112 85 55.63 0.150449 0.878671 15.19355 8.7525812.6390573 7.887209 2 36.4 0.150449 -1.90524 -67.8065 -10.4774 0.926849 0 0 0
2.3978953 10.71945 143 59.35 -0.09071 0.927004 73.19355 12.47258 0 0.326318 0 00 7.243513 35 62 -2.48861 -2.54893 -34.8065 15.12258 0 0 0.016215 0
2.3978953 12.73123 169 60.32 -0.09071 2.938785 99.19355 13.44258 0 0 0 0.1115343.4657359 13.20664 50 48.73 0.977127 3.414195 -19.8065 1.852581
3.2188758 12.78555 97 52.38 0.730267 2.993105 27.19355 5.502581 1D1.6094379 1.871802 0 36.1 -0.87917 -7.92064 -69.8065 -10.7774 1.078924 2.081542 7.880104 -1.446133.4965076 11.7905 2 37.9 1.007899 1.998051 -67.8065 -8.97742 0.732854 3.064493 32.97357 3.347747
3.3322045 11.44094 100 47.43 0.843596 1.64849 30.19355 0.552581 0.137858 1.638448 61.67255 6.9287840 11.54248 133 64.13 -2.48861 1.750039 63.19355 17.25258 -0.17402 1.144244 47.66024 8.965874
2.3978953 11.16014 23 53.43 -0.09071 1.367698 -46.8065 6.552581
3.4339872 12.6162 18 41.8 0.945379 2.823752 -51.8065 -5.07742 1D-1
2.5649494 9.615805 110 52.7 0.076341 -0.17664 40.19355 5.822581 1 0.679245 0.127773 -0.16129 Det RXX R2
2.7725887 11.07637 124 56.96 0.28398 1.283927 54.19355 10.08258 0.679245 1 0.534656 0.373388 0.286065 0.6664621.7917595 5.075174 90 47.67 -0.69685 -4.71727 20.19355 0.792581 0.127773 0.534656 1 0.772795 Det R2.6390573 11.08702 130 54.62 0.150449 1.294578 60.19355 7.742581 -0.16129 0.373388 0.772795 1 0.0954132.8903718 7.858641 93 49.62 0.401763 -1.9338 23.19355 2.742581
1D-1
1 0.679245 0.127773 -0.16129 Det RXX R2
0.679245 1 0.534656 0.373388 0.286065 0.6664620.127773 0.534656 1 0.772795 Det R-0.16129 0.373388 0.772795 1 0.095413
1D-1)-1
1.408029 -0.86031 0.139099-0.86031 3.008345 -2.003610.139099 -2.00361 2.496439
1D-1)-1RXY 1D-1)-1RXYRYX
0.824037 0.6664620.123194-0.56418
TTR XYXXXYYXXXXY RRRRRR 112
)det()det(
1)det(
)det()det(2
XXXX
XXRR
RR
RR
Adjusted R2
6307.383
136233354.066646.01
121
2
2
22
21
kkn
RR
dfdf
F
R: correlation matrixn: number of cases
k: number of independent variables in the model
)( parameterSEparameter
t
11
)1(1 22
kn
nRRadj
D<0 is statistically not significant and should
be eliminated from the model.
1)1)(( 21
knRRtrace
SE
Y
Country/Islandln(Number
of species)
Constant ln(Area)Days below zero
Latitude of capitals (decimal degrees)
Latitude2 X'
Albania 3.258097 1 10.26632 34 41.33 1708.169 1 1 1 1 1 1 1 1Andorra 0 1 6.148468 60 42.5 1806.25 10.26632 6.148468 11.33704 7.696213 8.519989 12.24361 10.3264 10.84344Austria 3.218876 1 11.33704 92 48.12 2315.534 34 60 92 1 18 144 50 114Azores 0.693147 1 7.696213 1 37.73 1423.553 41.33 42.5 48.12 37.73 39.55 53.87 50.9 43.82Baleary Islands 2.70805 1 8.519989 18 39.55 1564.203 1708.169 1806.25 2315.534 1423.553 1564.203 2901.977 2590.81 1920.192Belarus 2.890372 1 12.24361 144 53.87 2901.977Belgium 2.995732 1 10.3264 50 50.9 2590.81 X'XBosnia and Herzegovina 3.178054 1 10.84344 114 43.82 1920.192 62 607.1316 4328 2906.4 141148.1British islands 2.890372 1 12.40519 64 51.15 2616.323 607.1316 6518.161 48545.59 29086.57 1441737Bulgaria 3.496508 1 11.61702 102 42.65 1819.023 4328 48545.59 534136 228951.7 12488619Canary Islands 2.197225 1 8.891512 1 27.93 780.0849 2906.4 29086.57 228951.7 141148.1 7106497Channel Is. 1.609438 1 5.703782 12 49.22 2422.608 141148.1 1441737 12488619 7106497 3.71E+08Corsica 3.044522 1 9.068777 11 41.92 1757.286Crete 2.833213 1 9.019059 1 35.33 1248.209 (X'X)-1
Croatia 3.526361 1 10.94366 114 45.82 2099.472 6.45421 0.000497 0.001087 -0.25606 0.002409Cyclades Is. 1.098612 1 7.824046 1 37.1 1376.41 0.000497 0.002557 -8.1E-05 -0.00092 1.03E-05Cyprus 2.890372 1 9.132379 2 35.15 1235.523 0.001087 -8.1E-05 1.34E-05 6.63E-06 -6.8E-07Czech Republic 3.178054 1 11.27551 119 50.1 2510.01 -0.25606 -0.00092 6.63E-06 0.010716 -0.0001Denmark 2.639057 1 10.67112 85 55.63 3094.697 0.002409 1.03E-05 -6.8E-07 -0.0001 1.07E-06Dodecanese Is. 2.639057 1 7.887209 2 36.4 1324.96Estonia 2.397895 1 10.71945 143 59.35 3522.423 (X'X)-1X'Faroe Is. 0 1 7.243513 35 62 3844 0.028519 -0.00857 -0.18332 0.227512 0.119213 -0.18587 -0.27812 -0.01106Finland 2.397895 1 12.73123 169 60.32 3638.502 0.003388 -0.00932 0.001402 -0.00011 0.000382 0.000229 0.002492 -0.00174France 3.465736 1 13.20664 50 48.73 2374.613 -0.00017 0.000453 0.000154 -0.00024 -0.00016 0.000419 -0.00049 0.000727Germany 3.218876 1 12.78555 97 52.38 2743.664 -0.00078 0.0055 0.007968 -0.00748 -0.00331 0.007864 0.009674 0.003767Gibraltar 1.609438 1 1.871802 0 36.1 1303.21 1.21E-06 -7.6E-05 -8.7E-05 6.89E-05 2.61E-05 -8.7E-05 -6.6E-05 -8E-05Greece 3.496508 1 11.7905 2 37.9 1436.41Hungary 3.332205 1 11.44094 100 47.43 2249.605 (X'X)-1X'YIceland 0 1 11.54248 133 64.13 4112.657 a0 -3.40816Ireland 2.397895 1 11.16014 23 53.43 2854.765 a1 0.264082Italy 3.433987 1 12.6162 18 41.8 1747.24 a2 0.003862Kaliningrad Region 2.564949 1 9.615805 110 52.7 2777.29 a3 0.195932Latvia 2.772589 1 11.07637 124 56.96 3244.442 a4 -0.0027
X
A mixed model2
430210 lnln LaLaDaAaaS T
20 0027.0196.0004.0ln26.041.3ln LLDAS T
The final model
Is this model realistic?
Negative species density
Realistic increase of species richness with
area
Increase of species richness with winter
length
Increase of species richness at higher
latitudes
A peak of species richness at
intermediate latitudes
The model makes a series of unrealistic predictions.Our initial assumptions are wrong despite of the high degree of variance explanation
Our problem arises in part from the intercorrelation between the
predictor variables (multicollinearity).
We solve the problem by a step-wise approach eliminating the variables that are either not
significant or give unreasonable parameter values
The variance explanation of this final model is higher than that of the previous one.
y = 0.6966x + 0.7481R² = 0.6973
-1-0.5
00.5
11.5
22.5
33.5
44.5
0 1 2 3 4
ln(#
spe
cies
pre
dict
ed)
ln (# species observed)
......... 33221
3223
2222221
3113
211211110 XaXaXaXaXbXaXaXaXaaY nnn
Multiple regression solves systems of intrinsically linear algebraic equations
YXXXA '' 1
• The matrix X’X must not be singular. It est, the variables have to be independent. Otherwise we speak of multicollinearity. Collinearity of r<0.7 are in most cases tolerable.
• Multiple regression to be safely applied needs at least 10 times the number of cases than variables in the model.
• Statistical inference assumes that errors have a normal distribution around the mean.• The model assumes linear (or algebraic) dependencies. Check first for non-linearities. • Check the distribution of residuals Yexp-Yobs. This distribution should be random.• Check the parameters whether they have realistic values.
y = 0.6966x + 0.7481R² = 0.6973
-1-0.5
00.5
11.5
22.5
33.5
44.5
0 1 2 3 4
ln(#
spe
cies
pre
dict
ed)
ln (# species observed)
Multiple regression is a hypothesis testing and not a hypothesis generating
technique!!
Polynomial regression General additive model
Standardized coefficients of correlation
x
ZZ-tranformed distributions have a mean of 0 an a standard deviation of 1.
YXXX ZZZZB '' 1n
i i n ni 1 i i
X Yi 1 i 1X Y X Y
(X X)(Y Y)(X X) (Y Y)1 1 1
r Z Zn 1 s s n 1 s s n 1
nnn
n
iiiiini
nniii
rr
rr
nR
ZxZxZxZx
ZxZxZxZx
......
............
............
.......
'1
1
......
............
............
......
'
1
111
1
111
ZZZZ
XYxx RRB 1
In the case of bivariate regression Y = aX+b, Rxx = 1.Hence B=RXY.
Hence the use of Z-transformed values results in standardized correlations coefficients, termed b-values
ZZRΣΜXΜXΣR XX '1
1)()'(
11 11
nnBRR XXXY
How to interpret beta-values
If then Beta values are generalisations of simple coefficients of correlation. However, there is an important difference. The higher the correlation between two or more predicator variables (multicollinearity) is, the less will r depend on the correlation between X and Y. Hence other variables might have more and more influence on r and b. For high levels of multicollinearity it might therefore become more and more difficult to interpret beta-values in terms of correlations. Because beta-values are standardized b-values they should allow comparisons to be make about the relative influence of predicator variables. High levels of multicollinearity might let to misinterpretations. Beta values above one are always a sign of too high multicollinearity
Hence high levels of multicollinearity might· reduce the exactness of beta-weight estimates· change the probabilities of making type I and type II errors· make it more difficult to interpret beta-values.
We might apply an additional parameter, the so-called coefficient of structure. The coefficient of structure ci is defined as
where riY denotes the simple correlation between predicator variable i and the dependent variable Y and R2 the coefficient of determination of the multiple regression. Coefficients of structure measure therefore the fraction of total variability a given predictor variable explains. Again, the interpretation of ci is not always unequivocal at high levels of multicollinearity.
BXY Y
XiXiXi B
2R
rc iY
i
Partial correlations
X
Y
Z
rxy rzy
rzx
X X(Y) X(Z)
Y Y(X) Y(Z)
/ 2 21 1
XY XZ YZ
XY Z
XZ YZ
r r rr
r r
Semipartial correlation
XY XZ YZ(X|Y)Z 2
YZ
r r rr
1 r
A semipartial correlation correlates a variable with one residual only.
y = 1.02Z + 0.41
0
0.5
1
1.5
2
0 0.5 1Z
X X
y = 1.70Z + 0.600
0.5
1
1.5
2
2.5
0 0.5 1Z
Y
Y
The partial correlation rxy/z is the correlation of the residuals DX and DY
Path analysis and linear structure models
Y
X3X2 X4X1
e
Multiple regression
YX3
X2
X4X1
ee
e
e
e
Path analysis tries to do something that is logically impossible, to derive causal relationships from sets of observations.
Path analysis defines a whole model and tries to separate correlations into direct and indirect effects
eXaXaXaXaaY 443322110
The error term e contain the part of the variance in Y that is not explained by the model. These errors are called residuals
Regression analysis does not study the relationships between the predictor
variables
X Z
Y
WpXW pZX
pZY
pXY
e e
e
e
Path analysis is largely based on the computation of partial coefficients of correlation.
Path coefficients
Path analysis is a model confirmatory tool. It should not be used to generate models or even to seek for models that fit the data set.
xw
xy
zx zy
W p X e
X p Y e
Z p X p Y e
xw
xy
zx zy
p X W e 0
X p Y e 0
p X p Y Z e 0
We start from regression functions
From Z-transformed values we get
X
W
Z
Y
pXW
pYX
pXZ pYZ
W xw X
X xy Y
Z zx X zy Y
W Y xw X Y Y
X W xy Y W W
Z W zx X W zy Y W W
X Z xy Y X X
X Y xy Y Y Y
Z Y zx X Y zy Y Y Y
WY xw XY
XW xy YW
ZW zx XW zy YW
XZ xy YX
XY
Z p Z e
Z p Z e
Z p Z p Z e
Z Z p Z Z eZ
Z Z p Z Z eZ
Z Z p Z Z p Z Z eZ
Z Z p Z Z eZ
Z Z p Z Z eZ
Z Z p Z Z p Z Z eZ
r p r
r p r
r p r p r
r p r
r
xy
ZY zx XY zy
p
r p r p
eZY = 0
ZYZY = 1
ZXZY = rXY
xw
xy
zx zy
p X W e 0
X p Y e 0
p X p Y Z e 0
Path analysis is a nice tool to generate hypotheses.It fails at low coefficients of correlation and circular
model structures.
Target symptom
X A B C D E Expected values X'1 0 1 1 0 1 0.848615 A 0 0 1 0 11 0 1 1 0 1 0.848615 B 1 1 0 1 00 1 0 0 0 0 -0.2092 C 1 1 0 1 01 0 1 1 1 1 1.108631 D 0 0 0 1 00 1 0 0 0 1 0.106749 E 1 1 0 1 11 0 1 1 1 1 1.1086310 1 0 0 0 0 -0.2092 X'X1 0 1 1 1 1 1.108631 A B C D E1 1 1 1 1 1 0.899435 A 8 5 1 2 40 1 1 0 0 1 0.19602 B 5 11 6 6 91 0 0 1 1 1 1.01936 C 1 6 10 8 101 0 0 1 1 1 1.01936 D 2 6 8 11 111 0 0 0 1 1 0.575961 E 4 9 10 11 151 0 1 0 1 1 0.665233
0 1 1 0 0 0 -0.11992 (X'X)-1
0 1 1 0 0 0 -0.11992 0.205969 -0.09304 0.098145 0.0242 -0.082281 0 0 1 1 1 1.01936 -0.09304 0.224792 -0.05216 0.028233 -0.095991 0 0 1 1 1 1.01936 0.098145 -0.05216 0.361387 -0.06158 -0.190640 1 1 0 1 1 0.456037 0.0242 0.028233 -0.06158 0.368379 -0.25249
-0.08228 -0.09599 -0.19064 -0.25249 0.458457Sum 8 11 10 11 15
+L$25*B23+L$26*C23+L$27*D23+L$28*E23+L$29*F23
X'Y (X'X)-1X'Y1 -0.20927 0.089271
10 0.44339910 0.26001612 0.315945
Symptoms
-0.5
0
0.5
1
1.5
0 1
Pred
icte
d va
lue
Observed occurrences
Non-metric multiple regression
R2 (X'X)-1
0.828365 0.205969 -0.09304 0.098145 0.0242 -0.08228
1-R2 -0.09304 0.224792 -0.05216 0.028233 -0.095990.171635 0.098145 -0.05216 0.361387 -0.06158 -0.19064
N df 0.0242 0.028233 -0.06158 0.368379 -0.2524919 13 -0.08228 -0.09599 -0.19064 -0.25249 0.458457
B SE(B) t PA -0.2092 0.035351 0.052147 -4.01163 0.001479
B 0.089271 0.038582 0.054478 1.638668 0.125246C 0.443399 0.062027 0.069074 6.419143 2.27E-05D 0.260016 0.063227 0.069739 3.728397 0.002529E 0.315945 0.078687 0.0778 4.060988 0.001348
Statistical inference
1)1)(( 21
knRRtrace
SE
n
ii
n
iii
YYn
XYYn
variance Totalvariance Residual
R
1
2
1
2
2
)(1
1
))((1
1
1
-0.5
0
0.5
1
1.5
0 1
Pred
icte
d va
lue
Observed occurrences
Target symptom
Predicted values
Predicted values
Total variance
Explained variance
Unexplained
varianceX A B C D E1 0 1 1 0 1 0.848615 1 0.135734 0.047105 0.0229171 0 1 1 0 1 0.848615 1 0.135734 0.047105 0.0229170 1 0 0 0 0 -0.2092 0 0.398892 0.706903 0.0437631 0 1 1 1 1 1.108631 1 0.135734 0.227579 0.0118010 1 0 0 0 1 0.106749 0 0.398892 0.275446 0.0113951 0 1 1 1 1 1.108631 1 0.135734 0.227579 0.0118010 1 0 0 0 0 -0.2092 0 0.398892 0.706903 0.0437631 0 1 1 1 1 1.108631 1 0.135734 0.227579 0.0118011 1 1 1 1 1 0.899435 1 0.135734 0.071747 0.0101130 1 1 0 0 1 0.19602 0 0.398892 0.189711 0.0384241 0 0 1 1 1 1.01936 1 0.135734 0.150374 0.0003751 0 0 1 1 1 1.01936 1 0.135734 0.150374 0.0003751 0 0 0 1 1 0.575961 1 0.135734 0.003093 0.1798091 0 1 0 1 1 0.665233 1 0.135734 0.001133 0.1120690 1 1 0 0 0 -0.11992 0 0.398892 0.564758 0.0143820 1 1 0 0 0 -0.11992 0 0.398892 0.564758 0.0143821 0 0 1 1 1 1.01936 1 0.135734 0.150374 0.0003751 0 0 1 1 1 1.01936 1 0.135734 0.150374 0.0003750 1 1 0 1 1 0.456037 0 0.398892 0.030815 0.207969
Mean0.631579 0.421053 0.578947 0.526316 0.578947 0.789474 0.245614 0.249651 0.042156
True R2 1
Approximated R2 0.828365
Symptoms
Rounding errors due to different precisions cause the residual variance to be larger
than the total variance.
Logistic and other regression techniques
A B CMale 5.998 0.838 2.253Male 3.916 0.992 1.964Male 4.511 0.904 1.930Male 5.940 0.795 1.171Male 6.532 0.574 1.390Male 6.513 1.036 0.571Male 3.052 0.584 2.179Male 3.512 1.126 1.843Male 6.676 0.992 2.288Male 6.976 0.502 1.062
Female 5.649 0.913 2.231Female 5.712 0.474 2.237Female 5.112 0.277 1.009Female 3.681 0.329 2.420Female 5.239 0.922 1.592Female 5.180 0.546 2.418Female 2.133 0.300 3.087Female 5.361 0.472 2.175Female 6.460 0.321 1.007Female 6.839 0.426 3.179
1
1 1
y
y y
eZ
e e
01
n
i ii
Y a a x
n
n 0 i ii 10 i i
i 1n
0 i ii 1
a a xn a a x
0 i ia a xi 1
p p eln a a x e p
1 p 1 p1 e
01
011
n
i ii
n
i ii
a a x
a a x
eZ
e
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20
Y
Z
Threshold
Indecisive region
Surely males Surely females
We use odds
The logistic regression model
0
0.2
0.4
0.6
0.8
1
Ma
le
Ma
le
Ma
le
Ma
le
Ma
le
Fe
ma
le
Fe
ma
le
Fe
ma
le
Fe
ma
le
Fe
ma
le
Sex
Z
0.19 0.2 6.36 1.77
0.19 0.2 6.36 1.771
A B C
A B C
eZ
e
0 i i
0
a a x
1
bY
1 b e
Generalized non-linear regression models
0
1
0 5 10x
Y b1= 3
b2= 40
1
0 5 10x
Y
b1= 1
b2=0.5
A special regression model that is used in pharmacology
2
00 b
1
bY b
X1
b
b0 is the maximum response at dose saturation. b1 is the concentration that produces a half maximum response.b2 determines the slope of the function, that means it is a measure how fast the response increases with increasing drug dose.