Statistics, Visualization and More Using 'R' (298.916 ... · Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression Statistics,

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Statistics, Visualization and More Using ”R” (298.916)

Block I+II: Distributions & simulations, loops and functions, linearregression

Ass.-Prof. Dr. Wolfgang Trutschnig

Research group for Stochastics/StatisticsDepartment for Mathematics

University Salzburg

www.trutschnig.net

Salzburg, March 2017

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

http://www.trutschnig.net

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Plan for today:

I Some standard probability distributions (uniform, exponential, normaldistribution) which will be used throughout the seminar

I Generating samples from these distributions

I Loops and if/ifelse

I Writing own R-functions

I Pearson vs. Spearman (rank) correlation

I First steps linear regression

I Exercises

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Uniform distribution U(a, b)

I Range: Interval [a, b] (with a < b).

I The density f is given by

f (x) =1

b − a1[a,b](x).

I For X ∼ U(a, b) we have

E(X ) =a + b

2, V(X ) =

(b − a)2

12.

I Where does this distributionnaturally appear?

I Generate a sample of sizen = 10.000.

1 n <− 100002 x <− r u n i f ( n , min=−1,max=1)3 h i s t ( x , p r o b a b i l i t y = TRUE)

Den

sity

−1.0 −0.5 0.0 0.5 1.00.

00.

10.

20.

30.

40.

5

Figure: Histogram of a sample of size 10.000 from

X ∼ U(−1, 1)

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Exponential distribution E(λ)

Exponential distribution E(λ) (λ > 0)

I Range: Interval [0,∞).

I The density f is given by

f (x) = λe−λx1[0,∞)(x).

I For X ∼ E(λ) we have

E(X ) =1

λ, V(X ) =

1

λ2.

I Where does this distributionnaturally appear?

I Generate a sample of sizen = 10.000.

1 n <− 100002 lambda <− 33 x <− r e x p ( n , r=lambda )4 h i s t ( x , p r o b a b i l i t y = TRUE)

Den

sity

0.0 0.5 1.0 1.5 2.0 2.5 3.00.

00.

51.

01.

52.

0

Figure: Histogram of a sample of size 10.000 from X ∼ E(3)

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Normal distribution

Normal distribution N (µ, σ2)

I Range: R

I Density of N (µ, σ2)

f (x) =1

√2πσ2

e− (x−µ)2

2σ2 .

I For X ∼ N (µ, σ2) we have

E(X ) = µ, V(X ) = σ2.

I The most important case isX ∼ N (0, 1), for which we haveE(X ) = 0, V(X ) = 1.

I Generate a sample of sizen = 10.000.

1 n <− 100002 mu <− 0 ; s igma <− 13 x <− rnorm ( n , mean=mu, sd=

sigma )4 h i s t ( x , p r o b a b i l i t y = TRUE)

Den

sity

−4 −2 0 2 40.

00.

10.

20.

30.

4

Figure: Histogram of a sample of size 10.000 from

X ∼ N (0, 1)

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Normal distribution

I Why ’Normal’ distribution?

Example (Coin tossing)

I A fair coin is tossed n times.

I x1, x2, . . . , xn denote the results (0or 1).

I Calculate the standardized value

z := 2√n (xn−0.5) =

√n

(xn − 0.5)√

0.5 0.5.

I1 n <− 1002 x <− sample ( c ( 0 , 1 ) , n , r e p l a c e

=TRUE)3 z <− s q r t ( n )∗ ( mean ( x )−0.5) /

s q r t ( 0 . 5 ∗ 0 . 5 )

I Repeat R = 20.000 times and plotthe histogram of z1, . . . , zR .

n=30

x

freq

uenc

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

n=100

x

freq

uenc

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

n=500

x

freq

uenc

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

n=5000

x

freq

uenc

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Figure: Histogram of z1, . . . , zR .

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Normal distribution

Example (Rolling a dice)

I A dice is rolled n times.

I x1, x2, . . . , xn denotes the results.

I Calculate the standardized value

z :=√n

(xn − 3.5)√35/12

.

I1 n <− 1002 x <− sample ( 1 : 6 , n , r e p l a c e=

TRUE)3 z <− s q r t ( n )∗ ( mean ( x )−3.5) /

s q r t (35 / 12)

I Repeat R = 20.000 times and plotthe histogram of z1, . . . , zR .

n=30

x

freq

uenc

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

n=100

x

freq

uenc

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

n=500

x

freq

uenc

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

n=5000

x

freq

uenc

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Figure: Histogram of z1, . . . , zR .

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Normal distribution

Example (Uniform distribution)

I Suppose that X ∼ U(0, 1) and thatx1, x2, . . . , xn denote a sample of X .

I Calculate the standardized value

z :=√n

(xn − 1/2)√1/12

.

I1 n <− 1002 x <− r u n i f ( n , 0 , 1 )3 z <− s q r t ( n )∗ ( mean ( x )−0.5) /

s q r t (1 / 12)

I Repeat R = 20.000 times and plotthe histogram of z1, . . . , zR .

n=30

x

freq

uenc

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

n=100

x

freq

uenc

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

n=500

x

freq

uenc

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

n=5000

x

freq

uenc

y

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Figure: Histogram of z1, . . . , zR .

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Normal distribution

I In each of the considered cases the standardized mean approximately had aN (0, 1)-distribution.

I ...we observed examples for the central limit theorem (CLT).

I The general result is as follows:

Theorem (CLT)

Suppose that (Xn)n∈N is an i.i.d. sequence of random variables with finite varianceV(X1) = σ2 > 0. Set µ := E(X1) and define Zn as

Zn :=

∑ni=1 Xi − nµ√nσ

=√n

Xn − µσ

for every n ∈ N. Then FZn (x) −→ Φ(x) for n→∞ and arbitrary x ∈ R(Φ.... distribution function of N (0, 1)).

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Loops

Learning by doing - let’s have a look at an example

I1 R <− 100002 e r g <− r e p ( 0 ,R)3 n <− 5004 f o r ( i i n 1 :R){5 x <− r u n i f ( n , 0 , 1 )6 e r g [ i ] <− s q r t ( n )∗ ( mean ( x )−0.5) / s q r t (1 / 12)7 }8 h i s t ( erg , c o l=” l i g h t b l u e ” , main=”” , x l a b=” x ” , y l a b=” f r e q u e n c y ” ,

p r o b a b i l i t y=TRUE, x l i m=c (−4 ,4) , b r e a k s =35)

I @line 2: construct a vector with name ’erg’ of length R only containing zeros.

I @line 4: repeat the same procedure R times; save the result of the first run inthe 1-st coordinate of ’erg’, the result of the second run in the 2nd coordinate,the result of the third run in the 3rd coordinate, and so on till run number R.

I @line 8: plot a histogram of the resulting values.

Solve exercises 01 - 03 in the R-script R-Codes-R-SVm01.R.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

If and ifelse statements

Learning by doing - let’s have a look at two examples of if-statements

I1 x <− rnorm ( 1 )2 i f ( x<0){ p r i n t ( ” N e g a t i v e v a l u e ” )}

I @line 1: sample of size one of X ∼ N (0, 1).

I @line 2: If the value is negative print ’Negative value’.

I1 n <− 10002 x <− rnorm ( n )3 z <− r e p ( 0 , n )4 f o r ( i i n 1 : n ){5 i f ( x [ i ]>=0){z [ i ] <− 1}6 }7 mean ( z )

I What is the code doing?

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

If and ifelse statements

I If loops can be avoided by using ’ifelse’ instead.

I The subsequent code is significantly faster than the previous snippet:

I1 n <− 10002 x <− rnorm ( n )3 z <− i f e l s e ( x>=0 ,1 ,0)4 mean ( z )

Solve exercises 04 - 07 in the R-script R-Codes-R-SVm01.R.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

R-functions

Learning by doing - let’s have a look at the following simple function

I1 #@ f u n c t i o n s :2 my . fun <− f u n c t i o n ( n ){3 x <− r u n i f ( n ,−1 ,1)4 a <− min ( x )5 b <− max ( x )6 r e s <− c ( a , b )7 r e t u r n ( r e s )8 }9

10 my . fun ( 1 0 0 )

I What does the function do?

I How can the function be applied?

I NB: A function takes some arguments/data as input, does some calculations andthen returns a result as output.

I Any structures (vector, data.frame, list, etc.) can serve as input and as output.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

R-functions

I Let’s extend the previous function ’my.fun’ in such a way that the user can alsochoose the parameters of the uniform distribution and that the functionautomatically produces a histogram:

I1 my . fun2 <− f u n c t i o n ( n , a=0,b=1){2 x <− r u n i f ( n , min=a , max=b )3 h i s t ( x , p r o b a b i l i t y = TRUE, c o l=” l i g h t b l u e ” )4 a <− min ( x )5 b <− max ( x )6 r e s <− c ( a , b )7 r e t u r n ( r e s )8 }9

10 my . fun2 (1000 , a=3,b=5)

Solve exercises 08 - 09 in the R-script R-Codes-R-SVm01.R.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Exercise 10: A small simulation study concerning GPS bias

I Ten students of geoinformatics want to test GPS-based distance measurements.

I They (consecutively) record the GPS-coordinates of (the outer track of) the100m starting line in an athletics stadium close by, then (consecutively) walkalong the outer track till the finishing line, and again record the GPS-coordinates.

I Each of them repeats this procedure 50 times.

I For each of the 500 pairs they calculate the distance in meters.

I Given the sample size of n = 500 they expect the mean distance to be prettyclose to 100m (why?).

I All the bigger the surprise when the mean distance turns out to be roughly 102m.

I What went wrong - just bad luck?

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Exercise 10: A small simulation study concerning GPS bias

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

−40

−20

0

20

40

0 50 100 150x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Exercise 10: A small simulation study concerning GPS bias

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

−40

−20

0

20

40

0 50 100 150x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Exercise 10: A small simulation study concerning GPS bias

I What went wrong - just bad luck?

I We answer the question by means of simulations.

I Assume that the starting point S and the end point Z have the following exactcoordinates: S = (0, 0),Z = (100, 0)

I S ′,Z ′ will denote the measured coordinates; F = (X1,Y1) denotes themeasurement error in S, G = (X2,Y2) the measurement error in Z .

I In other words

S ′ = S + (X1,Y1) = (X1,Y1)

Z ′ = Z + (X2,Y2) = (100 + X2,Y2)

I The measured distance d therefore given by

d =√

(100 + X2 − X1)2 + (Y2 − Y1)2

I To simplify matters we assume that the errors follow a normal distribution, i.e.X1,X2,Y2,Y2 ∼ N (0, σ2).

I Consider the case σ = 15.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Exercise 10: A small simulation study concerning GPS bias

Exercise 10:

I Simulate n = 100.000 (or more) distance measurements.

I Calculate the corresponding distances distances d1, . . . , dn.

I Calculate the mean distance dn - is it greater or smaller than 100m?

I Produce a boxplot of the calculated distances.

I Analyze what happens if σ2 is increased or reduced.

I Find a possible explanation of the observation made.

I Write a function with sample size n as input parameter which produces a boxplotof the distances and returns the mean dn.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Quick Reminder: Pearson correlation coefficient ρ

●

●●

●

●●

●

●●

●

●●

●

−5

0

5

10

−2 0 2 4x

y

Figure: What is the correlation coefficient of the drawn

sample?

I The graphic depicts a sample(x1, y1), . . . , (xn, yn).

I Give a rough estimate of thecorrelation coefficient ρ of the sample

I How can ρ be calculated?

I Let sx (resp. sy ) denote the standarddeviation of the x-coordinates(y -coordinates) of the sample, i.e.

sx =

√√√√ 1

n − 1

n∑i=1

(xi − xn)2

sy =

√√√√ 1

n − 1

n∑i=1

(yi − yn)2

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Quick Reminder: Pearson correlation coefficient ρ

●

●●

●

●●

●

●●

●

●●

●

−5

0

5

10

−2 0 2 4x

y

Figure: What is the correlation coefficient of the drawn

sample?

I Let sxy denote the (empirical)covariance of the sample, i.e.

sxy =1

n − 1

n∑i=1

(xi − xn)(yi − yn)

I The (Pearson) correlation coefficientρxy is defined as

ρxy =sxy

sx sy

if sx , sy > 0.

I In our case we get ρxy = 0.97464.

I How can this value be interpreted?

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Quick Reminder: Pearson correlation coefficient ρ

Properties of ρ

I Whenever ρxy exists (i.e. whenever sx , sy > 0) we have −1 ≤ ρxy ≤ 1.

I We have ρxy = ρyx . As a consequence we will simply write ρ in the sequel.

I ρ = 1 if and only if (x1, y1), . . . , (xn, yn) lie on a straight line with positive slope.

I ρ = −1 if and only if (x1, y1), . . . , (xn, yn) lie on a straight line with negativeslope.

I In case of ρ = 0 we call the sample (x1, y1), . . . , (xn, yn) uncorrelated.

I ρ = 0 is not a measure of dependence - it only measures linear dependence.

I ρ = 0 means that there is no linear dependence.

I If instead of (x1, y1), . . . , (xn, yn) we consider (2x1, 3y1), . . . , (2xn, 3yn), whathappens to ρ?

I If instead of (x1, y1), . . . , (xn, yn) we consider (−2x1,−3y1), . . . , (−2xn,−3yn),what happens to ρ?

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Quick Reminder: Pearson correlation coefficient ρ

I If instead of (x1, y1), . . . , (xn, yn) we consider (−2x1,−3y1), . . . , (−2xn,−3yn),what happens to ρ?

1 f i l e <− u r l ( ” h t t p : //www. t r u t s c h n i g . n e t / geo r e g 1 . RData” )2 l o a d ( f i l e )3 A<−geo r e g 14 head ( geo r e g 1 )5

6 c o r (A$x , A$ y )7 c o r (2∗A$x , 3∗A$ y )8 c o r (−2∗A$x ,−3∗A$ y )9 c o r (−2∗A$x , 3∗A$ y )

I ρ does not change under linear transformations with the same sign.

I ρ changes, however, under non-linear transformations:

I If instead of (x1, y1), . . . , (xn, yn) we consider (x31 , y

31 ), . . . , (x3

n , y3n ) then we get

ρ = 0.9.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Spearman rank correlation ρS

I Assume we want to have a measure quantifying if there is a monotonicrelationship between the x- and the y -coordinates of a sample(x1, y1), . . . , (xn, yn).

I ’Monotonic relationship’ (or concordance) in the sense that if the x-coordinatesincrease then also the y -coordinates (grow or fall together).

I There is no need for the relationship to be linear.

I One natural idea is to work with ranks - best explained by some simple examples:

1 x1 <− c ( 3 , 1 , 4 , 15 , 13)2 r 1 <− rank ( x1 )3 x14 #[ 1 ] 3 1 4 15 135 r 16 #[ 1 ] 2 1 3 5 4

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Spearman rank correlation ρS

1 x1 <− c ( 3 , 1 , 3 , 15 , 13)2 r 1 <− rank ( x1 )3 x14 #[ 1 ] 3 1 3 15 135 r 16 #[ 1 ] 2 . 5 1 . 0 2 . 5 5 . 0 4 . 0

I The values are sorted - the rank rk(xi ) of observation xi is the position after theranking.

I In case of ties averages of the ranks will be calculated (other choices are optionalin the function).

I From (x1, y1), . . . , (xn, yn) we get the sample ranks(rkx (x1), rky (y1)), . . . , (rkx (xn), rky (yn)).

I rkx (xi ) is the rank of observation xi among x1, . . . , xn.

I rky (yi ) is the rank of observation yi among y1, . . . , yn.

I The Spearman rank correlation is defined as the Pearson correlation of theseranks.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Spearman rank correlation ρS

Example

I Considering the following sample of size n = 5

x y3.05 10.211.38 2.194.32 19.31

15.51 241.087.08 50.81

x y rk.x rk.y3.05 10.21 2.00 2.001.38 2.19 1.00 1.004.32 19.31 3.00 3.00

15.51 241.08 5.00 5.007.08 50.81 4.00 4.00

I What can be seen?

I For ρS we get ρs = 1

1 c o r ( rank (E$ x ) , rank (E$ y ) )2 c o r (E$x , E$y , method=” spearman ” )

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Spearman rank correlation ρS

Properties of ρS :

I Whenever ρS exists we have −1 ≤ ρxy ≤ 1.

I ρS is symmetric too.

I ρS = 1 if and only if: for each pair (xi , yi ), (xj , yj ) we have xi ≤ xj and only ifyi ≤ yj .

I ρS = −1 if and only if: for each pair (xi , yi ), (xj , yj ) we have xi ≤ xj if and onlyif yi ≥ yj .

I ρS = 0 is not a measure of dependence - it only measures monotonicdependence (aka concordance).

I ρS = 0 means that there is no monotonic relationship dependence.

I If instead of (x1, y1), . . . , (xn, yn) we consider (2x1, 3y1), . . . , (2xn, 3yn), whathappens to ρS?

I If instead of (x1, y1), . . . , (xn, yn) we consider (−2x1,−3y1), . . . , (−2xn,−3yn),what happens to ρS?

I If instead of (x1, y1), . . . , (xn, yn) we consider (x31 , y

31 ), . . . , (x3

n , y3n ), what

happens to ρS?

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Spearman rank correlation ρS

1 f i l e <− u r l ( ” h t t p : //www. t r u t s c h n i g . n e t / geo r e g 1 . RData” )2 l o a d ( f i l e )3 A<−geo r e g 14 head ( geo r e g 1 )5

6 c o r (A$x , A$y , method = ” spearman ” )7 c o r (2∗A$x , 3∗A$y , method = ” spearman ” )8 c o r (−2∗A$x ,−3∗A$y , method = ” spearman ” )9 c o r (A$ x ˆ3 ,A$ y ˆ3 , method = ” spearman ” )

I For all four cases we get ρS = 0.9633945.

I Easy to verify: ρS is invariant under monotonic transformations (both increasingor both decreasing).

I Let’s add two outliers to A and see how ρ and ρS change.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Spearman rank correlation ρS

●

●●

●

●●

●

●●

●

−2 0 2 4 6 8 10

−5

05

10

x

y

1 Dazu<−data . f rame ( x=c ( 1 0 , 1 0 . 3 ) , y=c( 2 , 2 . 4 ) )

2 A1<−r b i n d (A, Dazu )3 p l o t (A1)4 c o r (A1$x , A1$ y )5 c o r (A1$x , A1$y , method = ” spearman ” )

I Which is more influenced by the twonew points?

I We get ρ = 0.8187617 (beforeρ = 0.97464)

I Moreover ρS = 0.9349794 (beforeρS = 0.9633945)

I ρ is less robust against outliers than ρS

I Rank-based quantities are generallyrobust

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Some examples and exercises

●

●●

●

● ●

●

● ●●

●

●●

●

● ●

●

●●

●

●●

●

●●

● ●

●

●●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

● ●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

−4 −2 0 2 4

−3

−2

−1

01

23

x

y

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

0 200 400 600 800 1000

020

040

060

080

010

00

x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Some examples and exercises

●

●●

●

● ●

●

● ●●

●

●●

●

● ●

●

●●

●

●●

●

●●

● ●

●

●●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

● ●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

−4 −2 0 2 4

−3

−2

−1

01

23

rho=0.0477

x

y

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

0 200 400 600 800 1000

020

040

060

080

010

00

rho_S=0.0469

x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Some examples and exercises

●

●●

●

● ●

●

● ●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

● ●

●

● ●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●●

●

●● ●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

−4 −2 0 2 4

−4

−2

02

4

x

y

●

●●

●

● ●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

0 200 400 600 800 1000

020

040

060

080

010

00

x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Some examples and exercises

●

●●

●

● ●

●

● ●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

● ●

●

● ●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●●

●

●● ●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

−4 −2 0 2 4

−4

−2

02

4

rho=0.7266

x

y

●

●●

●

● ●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

0 200 400 600 800 1000

020

040

060

080

010

00

rho_S=0.7013

x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Some examples and exercises

●●

●

●●

●

●●

●

●●

● ●

●

●●

●

●●

●

●●

●

●●

●●●

●

●●

●

●● ●

●

●●

● ●

●

●●

●

●● ●

●

● ●

●

● ●

●

● ●

●●

●

● ●

●

●●

●

●●

●

●● ●

●●

●

● ●●●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

● ●

●

●●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●●

●

●●

●

●●

●

●● ●

●

● ●

●

●● ●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

−4 −2 0 2 4

−20

−10

010

20

x

y

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

0 200 400 600 800 1000

020

040

060

080

010

00

x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Some examples and exercises

●●

●

●●

●

●●

●

●●

● ●

●

●●

●

●●

●

●●

●

●●

●●●

●

●●

●

●● ●

●

●●

● ●

●

●●

●

●● ●

●

● ●

●

● ●

●

● ●

●●

●

● ●

●

●●

●

●●

●

●● ●

●●

●

● ●●●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

● ●

●

●●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●●

●

●●

●

●●

●

●● ●

●

● ●

●

●● ●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

−4 −2 0 2 4

−20

−10

010

20

rho=−0.9175

x

y

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

0 200 400 600 800 1000

020

040

060

080

010

00

rho_S=−0.9652

x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Some examples and exercises

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

y

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

0 200 400 600 800 1000

020

040

060

080

010

00

x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Some examples and exercises

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

rho=0.0143

x

y

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

0 200 400 600 800 1000

020

040

060

080

010

00

rho_S=0.0251

x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Some examples and exercises

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

● ●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●●

●

●●

●

●●

●

●●●

●

● ●

●

●●

●

●●

●

●●

● ●

●

●●

●

●●

●

●●

●

●● ●

●●

●

●●

● ●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

● ● ●

● ●

●

● ●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●●

●

●●

●

●●

●

● ●

●

● ●

●●

●

●●●

●

●●

●

●●

●

● ●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●●

●

● ●

●

● ●

●

● ●

●

●●

●

●●

●

●●●●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

● ●●

●

−3 −2 −1 0 1 2 3

02

46

8

x

y

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

0 200 400 600 800 1000

020

040

060

080

010

00

x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Some examples and exercises

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

● ●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●●

●

●●

●

●●

●

●●●

●

● ●

●

●●

●

●●

●

●●

● ●

●

●●

●

●●

●

●●

●

●● ●

●●

●

●●

● ●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

● ● ●

● ●

●

● ●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●●

●

●●

●

●●

●

● ●

●

● ●

●●

●

●●●

●

●●

●

●●

●

● ●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●●

●

● ●

●

● ●

●

● ●

●

●●

●

●●

●

●●●●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

● ●●

●

−3 −2 −1 0 1 2 3

02

46

8

rho=0.8576

x

y

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

0 200 400 600 800 1000

020

040

060

080

010

00

rho_S=0.9422

x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Some examples and exercises

Solve Exercise 11 and Exercise 12 in the R-script R-Codes-R-SVm01.R.

Exercise 13: Can you find a sample (x1, y1), . . . , (xn, yn) for which the Pearsoncorrelation ρ and the Spearman correlation ρS have different sign?Hint: Running simulations is never a bad idea; simulate five x-coordinates and fivey -coordinates from U(0, 1) and calculate ρ and ρS ; repeat several times

Solve Exercise 14 in the R-script R-Codes-R-SVm01.R.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

What is regression all about - a general perspective

Known:

I We know that there is a relationship between quantities X and Y of thefollowing form:

Y = r(X ) + ε (1)

I r is an unknown function and ε is a random error fulfilling E(ε) = 0.

I Usually we also assume that ε is not influenced by X (might be a too restrictivecondition in various situations).

I We call X the predictor and Y the response.

Wanted:

I Based on observations (x1, y1), (x2, y2), . . . , (xn, yn) from (1) we want todetermine/estimate the function r (why?).

I If we have a good estimator r of r then we can predict Y for arbitrary values ofX by considering r(X ).

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

A real-life example

Example (Offer optimization in supermarkets)

I A supermarket chain wants to optimize their offers.

I If the price is only reduced by 5% then the sales numbers will only go up a bit.

I If the price is reduced by 50% then the sales numbers will go up a lot but thecompany might earn less because the margin is too small.

I Objective: Determine the optimal price reduction in the sense that thesupermarket’s profit is maximal.

I X ...price reduction (absolute or percentage) of a certain product.

I Y ...net earnings (based on this product).

I Y = r(X ) + ε.

I What do you think: Is the model solely based on price reduction as predictorgood?

I Which other predictors would you choose?

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Linear regression

●

●●

●

●●

●

●●

●

●●

●

−5

0

5

10

−2 0 2 4x

y

Figure: Prediction at the point x = 1.5?

I The graphic depicts measurements(x1, y1), . . . , (xn, yn).

I It is known that the data comes fromthe following linear model

Y = aX + b︸︷︷︸r(X )

+ε.

I In other words: yi = axi + b + εi fori ∈ {1, . . . , n}.

I εi ...samples of the random error εfulfilling E(ε) = 0 that do not influenceeach other and are not influenced by xi .

I Wanted: Forecast the y -value at thepoint x = 1.5.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Linear regression

●

●●

●

●●

●

●●

●

●●

●

−5

0

5

10

−2 0 2 4x

y

Figure: Prediction at the point x = 1.5?

I How would you predict the value atthe point x = 1.5?

I Problem: We do not know theparameters a and b.

I Choose a and b in such a way thatthe straight line y = ax + b fits thedata in the best possible way.

I Denote the optimal values by a andb.

I Given a and b, predicty = a 1.5 + b.

I Which of the following straightlines fits best?

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Linear regression

●

●●

●

●●

●

●●

●

●●

●

−5

0

5

10

−2 0 2 4x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Linear regression

I Choose those values for a and b that minimize the prediction errors at the pointsin the sample.

I Choosing a and b as parameters we would forecast axi + b for xi .

I The error ri we make is ri = yi − (axi + b) = yi − axi − b. Plot ri

I The sum of all squared errors is given by

F (a, b) :=n∑

i=1

(yi − axi − b

)2(2)

I Choose a and b in such a way that F (a, b) is minimal.

I Analytic calculation yields the following optimal values

a =

∑ni=1(xi − xn)(yi − yn)∑n

i=1(xi − xn)2=

sxy

s2x

(3)

b = yn − a xn. (4)

I For our given sample we get a = 2.010 and b = 0.897.

I The forecast at the point x = 1.5 therefore is y = 2.01 · 1.5 + 0.897 = 3.912

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Linear regression

Back

●

●●

●

●●

●

●●●

●

●●

●

−3 −2 −1 0 1 2 3 4

−5

05

10

x

y

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Linear regression

I Before fitting linear models in R some additional observations:

I The estimate slope a =sxys2x

looks a bit like the Pearson correlation ρ =sxysx sy

.

I Using both expressions we get

a = ρsy

sx

I Increasing x by one standard deviation sx increases y by ρ standard deviationssy , in fact

r(x + sx ) = a(x + sx ) + b = ax + b︸︷︷︸y

+asx = y + ρsy

sxsx = y + ρsy .

I How do we quantify if our optimal model offers a good explanation of the model?

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Linear regression

I A natural idea is the coefficient of determination R2

I Easy to show:

n∑i=1

(yi − yn)2 =n∑

i=1

(yi − yi )2︸︷︷︸

r2i

+n∑

i=1

(yi − yn)2

I Variance of y1, . . . , yn equals the variance of the residuals plus the variance ofthe forecasts y1, . . . , yn.

I Calculate

R2 = 1−∑n

i=1(yi − yi )2∑n

i=1(yi − yn)2=

∑ni=1(yi − yn)2∑ni=1(yi − yn)2

(5)

I R2 is the portion of y -variance explained by the model.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Linear regression

●

●●

●

●●

●

●●

●

●●

●

−5

0

5

10

−2 0 2 4x

y

R^2=0.9499

●

●●

●

●●

●

●●

●

●●

●

●●

0.0

2.5

5.0

7.5

−2 0 2 4x

y

R^2=0.1044

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Linear regression

Properties of R2:

I We have 0 ≤ R2 ≤ 1.

I The higher R2 the higher the percentage of variance explained by the model.

I If R2 is close to 1 then the model explains the data very well.

I If R2 is close to 0 the model does not help much to explain the data.

I There should be a strong interrelation between R2 and the correlation ρ of theoriginal sample (x1, y1), . . . , (xn, yn)...

I Calculations in R will make this clear.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Linear regression

1 f i l e <− u r l ( ” h t t p : //www. t r u t s c h n i g . n e t / geo r e g 1 . RData” )2 l o a d ( f i l e )3 head ( geo r e g 1 )4 A<−geo r e g 15

6 model<−lm ( data=A, y ˜ x ) #use what eve r name you want i n s t e a d o fmodel

7 summary ( model )

I yields

1

2 C a l l :3 lm ( f o r m u l a = y ˜ x , data = A)4

5 R e s i d u a l s :6 Min 1Q Median 3Q Max7 −3.07477 −0.63681 −0.03544 0.70030 1.95308

I and

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Linear regression

1 C o e f f i c i e n t s :2 E s t i m a t e Std . E r r o r t v a l u e Pr (>| t | )3 ( I n t e r c e p t ) 0 .89704 0.11406 7 . 8 6 5 1 . 1 3 e−11 ∗∗∗4 x 2 .00965 0.05035 39 .917 < 2e−16 ∗∗∗5 −−−6 S i g n i f . codes : 0 ∗∗∗ 0 . 0 0 1 ∗∗ 0 . 0 1 ∗ 0 . 0 5 .

0 . 1 17

8 R e s i d u a l s t a n d a r d e r r o r : 1 . 0 3 on 84 d e g r e e s o f f reedom9 M u l t i p l e R−s q u a r e d : 0 . 9 4 9 9 , A d j u s t e d R−s q u a r e d : 0 .9493

10 F−s t a t i s t i c : 1593 on 1 and 84 DF, p−v a l u e : < 2 . 2 e−16

I Calculate the prediction for x = 1.5

1 ND<−data . f rame ( x=c ( 1 . 5 ) )2 p<−p r e d i c t ( model , new=ND)3 p4 3.91152

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Exercises

Exercise 15:

I Load the dataset geo reg1.RData (see R-Code, end of part 01 in linearregression).

I Produce a scatterplot of the data including the regression line.

I Add the values of the estimated parameters a and b in the title of the plot.

I Produce a boxplots of the residuals r1, . . . , rn.

I Calculate ρ and ρS of the data.

I Forecast r(x) for x ∈ {0, 0.1, 0.2, . . . , 0.9, 1}.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Exercises

Exercise 16:

I The datset ’brainhead.txt’ (see R-Code )contains Brain weight (grams) and headsize (cm3) for 237 adults.

I Fit a linear regression with ’weight’ as response and ’cm3’ as explanatoryvariable.

I Plot the data together with the regression line.

I Calculate the corresponding R2.

I Calculate the biggest ten residuals (’biggest in the sense of absolute value’) -how many man and how many woman are in the ’top-ten’?

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

@Performance

Summary @univariate linear regression

I (x1, y1), . . . , (xn, yn) are observations from the model Y = aX + b + ε.

I Thereby ε was a random error fulfilling E(ε) = 0; set σ2 = V(ε).

I In other words: yi = axi + b + εi for every i ∈ {1, . . . , n} .

I Using least squares we got the following estimators a of a and b of b

a =

∑ni=1(xi − xn)(yi − yn)∑n

i=1(xi − xn)2=

sxy

s2x

(6)

b = yn − a xn. (7)

I We hope to get a ≈ a and b ≈ b, i.e. we hope that the estimates are close tothe true values.

I Will this always be the case?

I When can we expect to get good estimates?

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

@Performance

1 #one s i m u l a t i o n2 a<−2 ; b<−13 n<−1004 x<−r u n i f ( n ,−3 ,4) #g e n e r a t e random x v a l u e s5 e r r o r<−rnorm ( n , 0 , 1 ) #e r r o r from normal d i s t r i b u t i o n N( 0 , 1 )6 y<−a∗x+b+e r r o r7 A<−data . f rame ( x=x , y=y )8 p l o t (A)9 model<−lm ( data=A, y ˜ x )

10 a b l i n e ( model )11 summary ( model )

I yields

1

2 C a l l :3 lm ( f o r m u l a = y ˜ x , data = A)4

5 R e s i d u a l s :6 Min 1Q Median 3Q Max7 −2.20647 −0.66814 −0.09888 0.77627 1.95348

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

@Performance

1 C o e f f i c i e n t s :2 E s t i m a t e Std . E r r o r t v a l u e Pr (>| t | )3 ( I n t e r c e p t ) 0 .76146 0.10149 7 . 5 0 3 2 . 8 7 e−11 ∗∗∗4 x 2 .09902 0.04686 44 .790 < 2e−16 ∗∗∗5 −−−6 S i g n i f . codes : 0 ∗∗∗ 0 . 0 0 1 ∗∗ 0 . 0 1 ∗ 0 . 0 5 . 0 . 1 17

8 R e s i d u a l s t a n d a r d e r r o r : 0 .9631 on 98 d e g r e e s o f f reedom9 M u l t i p l e R−s q u a r e d : 0 . 9 5 3 4 , A d j u s t e d R−s q u a r e d : 0 .9529

10 F−s t a t i s t i c : 2006 on 1 and 98 DF, p−v a l u e : < 2 . 2 e−16

1 sum ( model $ r e s i d u a l s ˆ2) / ( n−2)

I yields

1

2 [ 1 ] 0 .9275251

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

@Performance

1 #s e v e r a l r u n s2 R<−10003 E<−data . f rame ( a=r e p ( 0 ,R) , b=r e p ( 0 ,R) )4

5 a<−2 ; b<−16 n<−1007 f o r ( i i n 1 :R){8 x<−r u n i f ( n ,−3 ,4) #g e n e r a t e random x v a l u e s9 e r r o r<−rnorm ( n , 0 , 1 )

10 y<−a∗x+b+e r r o r11 A<−data . f rame ( x=x , y=y )12 model<−lm ( data=A, y ˜ x )13 E [ i , ]<−as . numer ic ( c o e f f i c i e n t s ( model ) ) [ 2 : 1 ]14 }

I yields

1 a b

2 1 1.994841 0.8079434

3 2 1.987354 1.0237531

4 3 1.951075 0.9251133

5 4 1.999110 1.0721703

6 5 1.996653 0.8200005

7 6 1.968700 1.0383586

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

@Performance

1 a b2 Min . : 1 . 8 5 8 Min . : 0 . 6 8 4 13 1 s t Qu . : 1 . 9 6 6 1 s t Qu . : 0 . 9 2 8 04 Median : 2 . 0 0 1 Median : 0 . 9 9 9 45 Mean : 2 . 0 0 2 Mean : 0 . 9 9 8 96 3 rd Qu . : 2 . 0 3 5 3 rd Qu . : 1 . 0 7 2 27 Max . : 2 . 1 6 2 Max . : 1 . 3 0 3 1

I What does the table tell us?

I A graphical overview also helps to interpret the results.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

@Performance

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

0.8

1.0

1.2

1.9 2.0 2.1a

b

sample size n= 100

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Influence of the parameters at stake

Natural related questions:

I What happens if the sample size n is increased?

I The more info the better the estimates should (on average) be!

I What other parameter in the simulation could have an influence on the quality ofthe estimates?

I Answer: The variance σ2 of ε is important.

I The higher the variance the poorer the estimates.

I Repeat the simulation (several runs) for higher and lower sample size and varythe variance of the error.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Influence of the parameters at stake

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

0.8

1.0

1.2

1.9 2.0 2.1a

b

sample size n= 100

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Influence of the parameters at stake

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

● ●

●

● ●

●

0.85

0.90

0.95

1.00

1.05

1.10

1.96 2.00 2.04 2.08a

b

sample size n= 500

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Influence of the parameters at stake

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

● ●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

0.90

0.95

1.00

1.05

1.10

1.950 1.975 2.000 2.025 2.050a

b

sample size n= 1000

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Influence of the parameters at stake

●

●● ●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

●

0.96

0.98

1.00

1.02

1.98 1.99 2.00 2.01a

b

sample size n= 10000

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Influence of the parameters at stake

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

0.8

1.0

1.2

1.9 2.0 2.1a

b

sample size n= 100

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Influence of the parameters at stake

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●●

●

● ●

●

●●

●

●●

●

●●

●●●

●

● ●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

● ●

●

0.4

0.8

1.2

1.6

1.8 2.0 2.2 2.4a

b

sample size n= 100, sigma^2=4

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Influence of the parameters at stake

●

●●

●

●●

●

● ●

●

●●

●

●●

●●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●●

●

0

1

2

1.6 2.0 2.4a

b

sample size n= 100, sigma^2=16

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Exercises

Exercise 17:Modify the last 30 lines of the R-Code R-Codes-R-SVm01.R to do the following:

I Simulate a sample of size n = 100 from the model Y = 0.5X − 1 + ε wherebyε ∼ N (0, 0.5).

I Include a scatterplot of the data including the regression line; include theestimated parameters a and b in the title of the scatterplot.

I Produce a boxplots of the residuals r1, . . . , rn.

I Calculate ρ and ρS of the data.

I Forecast r(x) for x ∈ {0, 0.1, 0.2, . . . , 0.9, 1}.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

http://www.trutschnig.net/R-Codes-R-SVm01.R

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Exercises

Exercise 18:Modify the last 30 lines of the R-Code R-Codes-R-SVm01.R to do the following:

I Simulate a sample of size n = 100 from the model Y = 0.5X − 1 + ε withε ∼ N (0, 0.5).

I Save the estimated parameters a and b in a data.frame A.

I Repeat the previous two steps R = 1000 times.

I Produce a boxplots of the estimates a1, . . . , aR and a boxplot of the estimatesb1, . . . , bR .

I Calculate the biggest, the smallest and the median value of a1, . . . , aR .

I Calculate the biggest, the smallest and the median value of b1, . . . , bR .

I Repeat the previous steps for bigger sample size and/or for bigger variance ofthe errors.

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

http://www.trutschnig.net/R-Codes-R-SVm01.R

Standard probability distributions Loops, if/ifelse, R-functions Correlation Regression in general Linear regression

Exercises

Exercise 19:

I In the literature and in bad courses one frequently sees that regression onlyworks in case the errors have normal distribution.

I Consider U(−1, 1)-distributed errors using the command error=runif(n,-1,1) andrepeat the tasks in Exercise 18 and Exercise 19 for this situation.

I Do we also get good results in this setting?

Wolfgang Trutschnig

Statistics, Visualization and More Using ”R” (298.916)

statistics, visualization and more using 'r' (298.916 ... · standard probability...

Documents

statistics: introduction to regression

presentation on regression (statistics)

statistics in medicine unit 9: overview/teasers. overview...

inference regression notes - ap statistics

ibm spss statistics 23 part 3: regression analysis

nonparametric regression - cmu statistics

forecastit 2. linear regression & model statistics

statistics measures of regression and prediction intervals

advanced statistics-19 |nonparametric regression ii

descriptive statistics prerequisite material mgs 8110...

(ebook-pdf) - statistics - applied nonparametric...

statistics background of regression analysis:...

statistics in medicine unit 8: overview/teasers. overview...

basic statistics linear regression. x y simple linear...

robust statistics part 3: regression...

dynamic logistic regression and dynamic model - statistics

statistics i - introduction to anova, regression, and...

functional linear regression via canonical analysis -...

control statements: part1 if, ifelse, switch 1

statistics for managers, multiple regression analysis