Introduction to Biostatistics and Bioinformatics
Regression and Correlation
Learning Objectives
Regression – estimation of the relationship between variables • Linear regression• Assessing the assumptions• Non-linear regression
Learning Objectives
Regression – estimation of the relationship between variables • Linear regression• Assessing the assumptions• Non-linear regression
Correlation • Correlation coefficient quantifies the association strength• Sensitivity to the distribution
Relationships
Relationship No Relationship
Relationships
Linear Relationships
Non-Linear Relationship
Relationships
Linear, Strong Linear, Weak
Linear Regression
Linear, Strong Linear, Weak Non-Linear
Linear Regression - Residuals
Linear, Strong Linear, Weak Non-Linear
Resi
duals
Resi
duals
Resi
duals
Linear Regression Model
Linearcomponent
Intercept Slope
Random Error
Dependent Variable
Independent Variable
Random Error component
ii10i εXββY
Linear Regression Assumptions
The relationship between the variables is linear.
Linear Regression Assumptions
The relationship between the variables is linear.
Errors are independent, normally distributed with mean zero and constant variance.
Linear Regression Assumptions
Linear Non-LinearR
esi
duals
Resi
duals
Linear Regression Assumptions
Constant Variance Variable VarianceR
esi
duals
Resi
duals
Linear Regression Model
Linearcomponent
Intercept Slope
Random Error
Dependent Variable
Independent Variable
Random Error component
ii10i εXββY
Linear Regression – Estimating the Line
Estimated
Intercept
Estimated Slope
Estimated Value
Independent Variablei10i XˆˆY
Least Squares Method
Find slope and intercept given measurements Xi,Yi, i=1..N
that minimizes the sum of the squares of the residuals.
Least Squares Method
2
iS
Find slope and intercept given measurements Xi,Yi, i=1..N
that minimizes the sum of the squares of the residuals.
Least Squares Method
Find slope and intercept given measurements Xi,Yi, i=1..N
that minimizes the sum of the squares of the residuals.
0ˆ0
S
01
S
Least Squares Method
Find slope and intercept given measurements Xi,Yi, i=1..N
that minimizes the sum of the squares of the residuals.
0))X)X(
(ˆXY
XY(2Xˆ2XXˆ2X
Y2XY2
Xˆ2XXˆ2XY2XY2)Xˆ2X)XˆY(2XY2(
)Xˆ2Xˆ2XY2()XˆXˆˆ2XˆY2(ˆ
)XˆXˆˆ2ˆ)Xˆˆ(Y2(Yˆ
)XˆXˆˆ2ˆ)Xˆˆ(Y2(Yˆ
))Xˆˆ()Xˆˆ(Y2(Yˆ
))Xˆˆ(Y(ˆ
)Y-Y(ˆˆˆ
2i
2i
1ii
ii2i1i
i1i
iii
2i1i1iii
2i1i1ii
2i1i0ii
2i
21i10i1i
1
2i
21i10
20i10i
2i
1
2i
21i10
20i10i
2i
1
2i10i10i
2i
1
2i10i
1
2ii
1
2
11
NNNN
Si
NN
N
N
i1
i0
2i
2i
iiii
1
XˆYˆ
X)X(
XYXY
ˆ
Linear Regression in Python
import scipy.stats as stats
slope,intercept,r_value,p_value,std_err = stats.linregress(x,y)
Linear Regression Example
Linear, Strong
Resi
duals
x=np.linspace(-1,1,points)y=x+0.1*np.random.normal(size=points)slope,intercept,r_value,p_value,std_err = stats.linregress(x,y)y_line=slope*x+intercept
fig, (ax1) = plt.subplots(1,figsize=(4,4))ax1.scatter(x,y,color='#4D0132',lw=0,s=60)ax1.set_xlim([-1.5,1.5])ax1.set_ylim([-1.5,1.5])ax1.plot(x,y_line,color='red',lw=2)fig.savefig('linear.png')
fig, (ax1) = plt.subplots(1,figsize=(4,4))ax1.scatter(x,y-y_line, color='#963725',lw=0,s=60)ax1.set_xlim([-1.5,1.5])ax1.set_ylim([-1.5,1.5])fig.savefig('linear-residuals.png')
Linear Regression Example
x=np.linspace(-1,1,points)y=x+0.4*np.random.normal(size=points)slope,intercept,r_value,p_value,std_err = stats.linregress(x,y)y_line=slope*x+intercept
fig, (ax1) = plt.subplots(1,figsize=(4,4))ax1.scatter(x,y,color='#4D0132',lw=0,s=60)ax1.set_xlim([-1.5,1.5])ax1.set_ylim([-1.5,1.5])ax1.plot(x,y_line,color='red',lw=2)fig.savefig('linear-weak.png')
fig, (ax1) = plt.subplots(1,figsize=(4,4))ax1.scatter(x,y-y_line, color='#963725',lw=0,s=60)ax1.set_xlim([-1.5,1.5])ax1.set_ylim([-1.5,1.5])fig.savefig('linear-weak-residuals.png')
Linear, Weak
Resi
duals
Linear Regression Example
Outlier
Regression – Non-linear data
Solution 1: Transformation
Solution 2: Non-linear Regression
,...)ˆ,ˆ,f(XY 10ii
Correlation Coefficient
22 )()(
))((
YYXX
YYXXr
ii
ii
• A measure of the correlation between the two variables
• Quantifies the association strength
Pearson correlation coefficient:
Correlation Coefficient
Correlation Coefficient
Correlation Coefficient
Correlation Coefficient
Correlation Coefficient
Correlation Coefficient
Source: Wikipedia
Coefficient of Variation
n
ni
iix
1
xxx n,...,,21
Variance
Sample
Mean
n
i
ni
ix
1
2
2)(
Coefficient of Variation (CV)
Correlation Coefficient and CV
Uniform distribution
Correlation Coefficient and CV
Uniform distribution Normal distribution Lognormal distribution
Correlation Coefficient - Outliers
Outlier
Correlation Coefficient – Non-linear
Solutions:• Transformation• Rank correlation (Spearman, r=0.93)
Correlation Coefficient and p-value
Hypothesis: Is there a correlation?
r r r
p p p
Application: Analytical Measurements
Theoretical Concentration
Measu
red
C
on
cen
trati
on
A Few Characteristics of Analytical Measurements
Accuracy: Closeness of agreement between a test result and an accepted reference value.
Precision: Closeness of agreement between independent test results.
Robustness: Test precision given small, deliberate changes in test conditions (preanalytic delays, variations in storage temperature).
Lower limit of detection: The lowest amount of analyte that is statistically distinguishable from background or a negative control.
Limit of quantification: Lowest and highest concentrations of analyte that can be quantitatively determined with suitable precision and accuracy.
Linearity: The ability of the test to return values that are directly proportional to the concentration of the analyte in the sample.
Limit of Detection and Linearity
Theoretical Concentration
Measu
red
C
on
cen
trati
on
Precision and Accuracy
Theoretical Concentration
Theoretical Concentration
Measu
red
C
on
cen
trati
on
Measu
red
C
on
cen
trati
on
Summary - Regression
Source: http://xkcdsw.com/content/img/2274.png
Summary - Correlation
Next Lecture: Experimental Design & Analysis
Experimental Design by Christine Ambrosinowww.hawaii.edu/fishlab/Nearside.htm