linear correlation and linear regression + summary of tests dr. omar al jadaan assistant professor...

Linear correlation and linear regression + summary of tests

Dr. Omar Al JadaanAssistant Professor – Computer Science &

Mathematics

Recall: Covariance

1

))((),(cov 1

n

YyXxyx

n

iii

cov(X,Y) > 0 X and Y are positively correlated

cov(X,Y) < 0 X and Y are inversely correlated

cov(X,Y) = 0 X and Y are independent

Interpreting Covariance

Correlation coefficient

Pearson’s Correlation Coefficient is standardized covariance (unitless):

yx

yxariancer

varvar

),(cov

Correlation Measures the relative strength of the linear

relationship between two variables Unit-less Ranges between –1 and 1 The closer to –1, the stronger the negative linear

relationship The closer to 1, the stronger the positive linear

relationship The closer to 0, the weaker any positive linear

relationship

Scatter Plots of Data with Various Correlation Coefficients

Y

X

Y

X

Y

X

Y

X

Y

X

r = -1 r = -.6 r = 0

r = +.3r = +1

Y

Xr = 0

Y

X

Y

X

Y

Y

X

X

Linear relationships Curvilinear relationships

Linear Correlation

Y

X

Y

X

Y

Y

X

X

Strong relationships Weak relationships

Linear Correlation

Linear Correlation

Y

X

Y

X

No relationship

Some calculation formulas…

yx

xy

n

ii

n

ii

n

iii

n

ii

n

ii

n

iii

SSSS

SS

yyxx

yyxx

n

yy

n

xx

n

yyxx

r

1

2

1

2

1

1

2

1

2

1

)()(

))((

1

)(

1

)(

1

))((

ˆ

yx

xy

SSSS

SSr ˆ

Note: Easier computation formulas:

22

22

ynySS

xnxSS

yxnyxSS

iy

ix

iixy

Sampling distribution of correlation coefficient:

*note, like a proportion, the variance of the correlation coefficient depends on the correlation coefficient itselfsubstitute in estimated r

2

1)ˆ(

2

n

rrSE

The sample correlation coefficient follows a T-distribution with n-2 degrees of freedom (since you have to estimate the standard error).

What is “Linear”?

Remember this: Y=mX+B?

B

m

What’s Slope?

A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y.

Simple linear regression

The linear regression model:

Love of Math = 5 + .01*math SAT score

intercept

slope

P=.22; not significant

PredictionIf you know something about X, this knowledge helps you

predict something about Y. (Sound familiar?…sound like conditional probabilities?)

EXAMPLE The distribution of baby weights at

Stanford ~ N(3400, 360000)

Your “Best guess” at a random baby’s weight, given no information about the baby, is what?

3400 grams

But, what if you have relevant information? Can you make a better guess?

Predictor variable X=gestation time

Assume that babies that gestate for longer are born heavier, all other things being equal.

Pretend (at least for the purposes of this example) that this relationship is linear.

Example: suppose a one-week increase in gestation, on average, leads to a 100-gram increase in birth-weight

Y depends on X

Y=birth- weight

(g)

X=gestation time (weeks)

Best fit line is chosen such that the sum of the squared (why squared?) distances of the points (Yi’s) from the line is minimized:

Or mathematically… (remember max and mins from calculus)…

Derivative[(Yi-(mx+b))2]=0

Prediction

A new baby is born that had gestated for just 30 weeks. What’s your best guess at the birth-weight?

Are you still best off guessing 3400? NO!

Y=birth- weight

(g)


At 30 weeks…

3000

30

Y=birth weight

(g)


At 30 weeks…

(x,y)=

(30,3000)

3000

30

At 30 weeks…

The babies that gestate for 30 weeks appear to center around a weight of 3000 grams.

In Math-Speak… E(Y/X=30 weeks)=3000 grams

Note the conditional expectation

But…Note that not every Y-value (Yi) sits on the line. There’s variability.

Yi=3000 + random errori

In fact, babies that gestate for 30 weeks have birth-weights that center at 3000 grams, but vary around 3000 with some variance 2

Approximately what distribution do birth-weights follow? Normal. Y/X=30 weeks ~ N(3000, 2)

Y=birth- weight

(g)


And, if X=20, 30, or 40…

20 30 40

Y=baby weights

(g)

X=gestation times (weeks)

If X=20, 30, or 40…

20 30 40

Y/X=40 weeks ~ N(4000, 2)

Y/X=30 weeks ~ N(3000, 2)

Y/X=20 weeks ~ N(2000, 2)

Mean values fall on the line

E(Y/X=40 weeks)=4000 E(Y/X=30 weeks)=3000 E(Y/X=20 weeks)=2000

E(Y/X)= Y/X = 100 grams/week*X weeks

Linear Regression Model

Y’s are modeled…

Yi= 100*X + random errori

Follows a normal distribution

Fixed – exactly on the line

linear correlation and linear regression + summary of tests dr. omar al jadaan assistant professor...

Documents

weeks x

significant slide

birthweight slide

birth weight g x

birth weight g x

predictor variable x

points y i s

positive linear relationship