analysis of dependent variables: correlation and...
TRANSCRIPT
7/28/16
1
Analysis of Dependent Variables: Correlation and Simple Regression Zacariah Labby, PhD, DABR
Asst. Prof. (CHS), Dept. of Human OncologyUniversity of Wisconsin – Madison
Conflicts of Interest
None to disclose
Aug 1, 2016 Labby - AAPM 2016 2
7/28/16
2
Purpose•Review basic statistics and identify appropriate use of statistics related to analyzing simple relationships between two variables:•Correlation statistics• Linear regression and model fitting
Aug 1, 2016 Labby - AAPM 2016 3
STATISTICS OF CORRELATION
Aug 1, 2016 Labby - AAPM 2016 4
7/28/16
3
Correlation: Review of Terminology•Dependent vs. Independent Variables•Standard plot: X is Independent
Y is Dependent•Linear vs. Monotonic• Linear: increase in X leadsto proportional increase in Y•Monotonic: increase in Xleads to some increase in Y
Aug 1, 2016 Labby - AAPM 2016 5
0
1
0 1
y
x
Correlation: Review of Terminology•Variable Type•Continuous•Example: Ionization chamber charge collected vs. Dose delivered
•Discrete•Example: Number of patients seen vs. Calendar year
•Ordinal•Example: Severity of normal tissue toxicity vs. Prescription Level
•Categorical•Example: RECIST response classification vs. Radiologist Observer
Aug 1, 2016 Labby - AAPM 2016 6
7/28/16
4
Correlation: Metrics of Interest • Four big categories of data•Continuous
•Discrete
•Ordinal
•Categorical
Aug 1, 2016 Labby - AAPM 2016 7
Correlation: Metrics of Interest • Four big categories of data•Continuous
•Discrete
•Ordinal
•Categorical
Three major correlation metrics
Pearson’s r
Spearman’s ⍴
Fleiss’ κ
Aug 1, 2016 Labby - AAPM 2016 8
7/28/16
5
Correlation: Metrics of Interest • Four big categories of data•Continuous
•Discrete
•Ordinal
•Categorical
Three major correlation metrics
Pearson’s r
Spearman’s ⍴
Fleiss’ κ
Aug 1, 2016 Labby - AAPM 2016 9
Correlation: Pearson’s r• “Linear” or “Product-Moment” correlation•Applies only to continuous data•Parametric correlation• Tendency of dependent variable to increase linearly with the independent variable
•Key Point:• There is an assumed form to the relationship•Linear, and therefore also monotonic
Aug 1, 2016 Labby - AAPM 2016 10
7/28/16
6
Correlation: Pearson’s r
Aug 1, 2016 Labby - AAPM 2016 11
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
xy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
2.5
x
y
r = 1.00 r = 0.97 r = 0.76
r = 0.96 r = 0.96 r = 0.75
Correlation: Pearson’s r
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
x
y
r = 1.00 r = 0.97 r = 0.76
7/28/16
7
Correlation: Spearman’s ⍴• “Rank” correlation•Applies to continuous, discrete, or ordinal data•Non-parametric correlation• Tendency of dependent variable to increase with the independent variable
•Key Point:• There is no assumed relationship, only monotonicity
•Math: Pearson’s r of rank-transformed data
Aug 1, 2016 Labby - AAPM 2016 13
Correlation: Spearman’s ⍴
Aug 1, 2016 Labby - AAPM 2016 14
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
7/28/16
8
Correlation: Spearman’s ⍴
Aug 1, 2016 Labby - AAPM 2016 15
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
Raw: (0,0)Rank: (1,1)
(X,Y) pairs
Correlation: Spearman’s ⍴
Aug 1, 2016 Labby - AAPM 2016 16
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
Raw: (0,0)Rank: (1,1)
(X,Y) pairs
Raw: (0.05,0.0025)Rank: (2,2)
7/28/16
9
Correlation: Spearman’s ⍴
Aug 1, 2016 Labby - AAPM 2016 17
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
Raw: (1,1)Rank: (20,20)
Pearson’s r of rank-transformed data: 1.00
Raw: (0,0)Rank: (1,1)
Raw: (0.05,0.0025)Rank: (2,2)
Correlation: Spearman’s ⍴
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
x
y
r = 1.00⍴ = 1.00
r = 0.97⍴ = 0.97
r = 0.76⍴ = 0.90
7/28/16
10
Correlation: Spearman’s ⍴
Aug 1, 2016 Labby - AAPM 2016 19
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
xy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
2.5
x
y
r = 1.00⍴ = 1.00
r = 0.97⍴ = 0.97
r = 0.76⍴ = 0.90
r = 0.96⍴ = 1.00
r = 0.96⍴ = 0.99
r = 0.75⍴ = 0.89
Correlation: Which Metric?
Continuous variables; “When one goes up, does the other (reliably) go down?”Aug 1, 2016 Labby - AAPM 2016 20
Months after Baseline
Rela
tive
Chan
ge fr
om B
asel
ine
−1.0
−0.5
0.0
0.5
1.0
0 1 2 3 4 5 6 7
Disease Volume
Lung Volume
Z.E. Labby et al, J Thorac Oncol 8, (2013)
7/28/16
11
Correlation: Which Metric?
Continuous variables; “When one goes up, does the other (reliably) go down?”Aug 1, 2016 Labby - AAPM 2016 21
Months after Baseline
Rela
tive
Chan
ge fr
om B
asel
ine
−1.0
−0.5
0.0
0.5
1.0
0 1 2 3 4 5 6 7
Disease Volume
Lung Volume
Z.E. Labby et al, J Thorac Oncol 8, (2013) Answer:Spearman’s ⍴
Correlation: Fleiss’ κ•Categorical correlation•Applies only to categorical data•Categorical data could be inherently ordinal
•Non-parametric correlation•How well do independent categories sort dependent categories?
•Math: number of dependent-independent pairs in agreement over the number expected by chance alone.
Aug 1, 2016 Labby - AAPM 2016 22
7/28/16
12
Correlation: Fleiss’ κ•Example:• 5 radiologists contour tumors in• 31 patients•Response classification from baseline to post-chemo CT scans•Progressive Disease•Stable Disease•Partial Response•Complete Response
Aug 1, 2016 Labby - AAPM 2016 23
Obs. 1 Obs. 2 Obs. 3 Obs. 4 Obs. 5
Progression 6 11 7 11 14
Stable 17 10 19 15 9
Partial 7 10 5 4 8
Complete 1 0 0 1 0
κ = 0.64Landis and Koch, Biometrics, 33,159–174 (1977)
Correlation vs. Agreement•Quick tangent…
Important question:Do you already know that the two variables will be correlated?
Example: Tumor volumes as assessed by Physician vs. Algorithm
Aug 1, 2016 Labby - AAPM 2016 24
7/28/16
13
Correlation vs. Agreement•Especially with implicit independent variables (i.e., the true value remains unknown), correlation isn’t as meaningful•Correlation is only the strength of a relationship between two variables•Agreement is the actual 1:1 accuracy
Aug 1, 2016 Labby - AAPM 2016 25
0.0 0.2 0.4 0.6 0.8 1.0
3.0
3.2
3.4
3.6
3.8
.0
x
y
Bland and Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 327, 307 (1986).
Correlation vs. Agreement
Aug 1, 2016 Labby - AAPM 2016 26
Average of Physician and Algorithm
Difference
(Physician –Algorithm)
Mean
Mean + 2SD
Mean - 2SD
Bland and Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 327, 307 (1986).
7/28/16
14
Correlation vs. Agreement•Absolute agreement vs. Relative agreement•Absolute: plot raw differences•Relative: plot log differences
•Get mean, SD of log-transformed data, then apply exponential to get relative agreement bounds
Aug 1, 2016 Labby - AAPM 2016 27
Bland and Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 327, 307 (1986).
ln𝑥𝑦 = ln 𝑥 − ln 𝑦
SIMPLE MODELING
Aug 1, 2016 Labby - AAPM 2016 28
7/28/16
15
Correlation vs. Agreement vs. Modeling•Correlation: Strength of relationship•Agreement: Accuracy of 1:1 match•Modeling: Quantifying the relationship
•Rules of Modeling:1. Prefer model with n-1 parameters to n2. Prefer model with k-1 independent variables to k3. Prefer linear model to curved model
Aug 1, 2016 Labby - AAPM 2016 29
Crawley, Statistics: An Introduction using R, Wiley (2005)
Simple Linear Regression•Linear regression is linear in the coefficients, not necessarily in the independent variable
•Linear:
•Not Linear:
Aug 1, 2016 Labby - AAPM 2016 30
𝒚 = 𝛼 + 𝛽𝒙𝒚 = 𝛼 + 𝛽𝒙,
𝒚 = 𝛼 + 𝑒.𝒙
7/28/16
16
Simple Linear Regression• I was going to put the math here, but…
Aug 1, 2016 Labby - AAPM 2016 31
Simple Linear Regression•Sources of Variance in the data
•Your model:•Reality:
•Variance in y can be explained by•Variance in x•Residual uncertainty (called 𝝐 )
Aug 1, 2016 Labby - AAPM 2016 32
𝑦0 = 𝛼 + 𝛽𝑥0 + 𝜖0𝒚 = 𝛼 + 𝛽𝒙
Random (residual) error from fit for each 𝑥0
7/28/16
17
Simple Linear Regression•Sources of Variance in the data•Explained Sum of squared errors (ESS)•Residual Sum of squared errors (RSS)• Total sum of squared errors (TSS)
•Coefficient of Determination: ESS/TSS•Proportion of total variation in y explained by the model
Aug 1, 2016 Labby - AAPM 2016 33
𝑓 𝑥0 − 𝑦3 , + 𝑦0 − 𝑓 𝑥0, = 𝑦0 − 𝑦3 ,
Simple Linear Regression•Coefficient of Determination: ESS/TSS• Proportion of total variation in y explained by the model
•Has another name…R-squared!
•Pearson’s correlation coefficient• 𝑟 = 𝑅,�
•Drive home: Correlation quantifies strength of relationship, not relationship itself
Aug 1, 2016 Labby - AAPM 2016 34
𝑅, =𝐸𝑆𝑆𝑇𝑆𝑆
7/28/16
18
Simple Linear Regression•Making predictions• From analysis, derive best-fit values of fit 𝛼:, 𝛽<, etc.•Predict new values according to 𝑦=>? = 𝛼: +𝛽<𝑥=>?
•However, models have uncertainty!•Variance estimates can be provided for 𝛼:, 𝛽<, etc. (e.g., 𝜎:A)
Aug 1, 2016 Labby - AAPM 2016 35
Simple Linear Regression•Confidence Bands•Variance associated with mean predicted response
•Prediction Bands•Variance associated with single new prediction• Takes into account residual errors in linear model
Aug 1, 2016 Labby - AAPM 2016 36
Var 𝛼: + 𝛽<𝑥=>?
Var 𝛼: + 𝛽<𝑥=>? + 𝜎:E,(in some ways, this is like the difference
between standard deviation and
standard error)
7/28/16
19
Simple Linear Regression
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
x
y
Aug 1, 2016 Labby - AAPM 2016 37
Data Points
Linear Model
95% Conf. Bands
95% Pred. Bands
Example•Task: Department administrator asks you to figure out the relationship between patient census and required RadTechhours.•Question 1: what kind of relationship would we expect?•Probably Linear with some residual uncertainty
•Question 2: which correlation metric would you use?•Pearson’s r
•Question 3: how would you quantify the relationship?•Simple linear regression
Aug 1, 2016 Labby - AAPM 2016 38
7/28/16
20
Example•Regression model
•Predict tomorrow’s RadTech staffing level if you know the patient workload•Could staff at the upper 95% prediction band?
Aug 1, 2016 Labby - AAPM 2016 39
StaffingRequirements = 𝛼: + 𝛽<PatientWorkload𝛼: = Fixed StaffOverhead𝛽< = ScalableCoefficient
A plug for “R”…•R is a free software package for data analysis and is very common in the statistics community.
•Good text for learning R and basic stats:•Statistics: An Introduction using R by Michael J. Crawley, published 2005 by John Wiley & Sons, Ltd
Aug 1, 2016 Labby - AAPM 2016 40
7/28/16
21
A plug for “R”…model=lm(y~x)
conf.bands=predict(model,interval=‘conf’)
pred.bands=predict(model,interval=‘pred’)
Aug 1, 2016 Labby - AAPM 2016 41
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
x
y
Use your Biostatisticians•Many large centers have at least one biostatistician on staff• In many centers, free consultations for•Experimental design•Simple clinical trials•Data analysis questions
•Prevent headaches and lost costs for rework and rejected papers
Aug 1, 2016 Labby - AAPM 2016 42
7/28/16
22
Aug 1, 2016 Labby - AAPM 2016 43