lecture 2
TRANSCRIPT
Advanced statistical methods II
Learning objectives
• Implement an analytic strategy using mediation
• Implement an analytic strategy using path analysis
• Appreciate the role of structural equation modelling
stata usage
Menu or command driven
• Use the menus to find out how to write a command, then save as a program
Use a program in a ‘do’ file for analysis and save it
• Replication
How to use do file
– Click “do-file editor” on the toolbar
– Type your command and click “run” button
– Click “file”-> “save as” to save your do-file You can choose to save it under your account (eg. “stata01”)
– To see what files are under your account• ls– Use “do-file editor” -> “open” to open a
saved do-file
Recap - stata
Key commands (use class examples as templates)• browse
– to look at your data• tabu edup ov7 if sex==1
– to cross tab your data with a selection• bysort sex:summarize bw
– to get means for different groups • xi:regress bmi7z i.edup bw i.sex
– to do regression• xi:logistic ov7 i.edup bw i.sex
– to do logistic regression
Interpreting stata output (1)
• browse, tabu, summarize – self-explanatory
• regress shows you– number of obs– R-squared– For each exposure
• Coef. May think of it as beta (β)• Std. err.• P>|t| - may think of it as p-value• 95% CI
Interpreting stata output (2)
• logistic shows you– number of obs– LR chi2(df)– For each exposure
• Odds ratio• Std. err.• P>|z| - may think of it as p-value• 95% CI
Stata – things to be aware of
Stata does not include in the analysis observations with missing values (.)
In regression (any sort) stata always uses the lowest group as the reference group for a categorical value
There is a fantastic website with annotated stata analysis
http://www.ats.ucla.edu/stat/stata
New statistics tools
• Mediation– examining the ‘active’ ingredient
• Path analysis– examining several ‘active’ ingredients
• Structural equation modeling– examining several ‘active’ ingredients
including unmeasured concepts
Mediation
• The ‘active ingredient’ of an exposure• The mechanism by which an exposure works• Increases biological plausibility of theory• Reduces likelihood exposure disease relationship is
caused by confounder
E D
M
Direct effect: (not through hypothesized
mediator, E-D)
Indirect effect: (through
hypothesized mediator, E-M-D)
Mediation as causal explanation• Crude OR for BMI and breast cancer = 2.0• Adjusted for estradiol, aOR = 1.0• Full mediation – all of the association goes through
this pathway• If adjusted was 1.5 – partial mediation
(another pathway exists)• Identifies more proximal cause of disease for
potential intervention/prevention
BMI Breast cancer
Estradiol
Baron & Kenny 4-step approach1. Exposure (BMI) should be associated with outcome (BrCa) 2. Exposure (BMI) should be associated with mediator
(Estradiol) 3. Mediator (Estradiol) should be associated with outcome
(BrCa) 4. Association of exposure (BMI) with outcome (BrCa) should
be reduced by adjusting for mediator (Estradiol)
BMIBreast cancer
Estradiol
OR > 1?
OR > 1?
Crude OR > Adjusted OR
2 3
1,4
Sobel and Goodman tests
Sobel and Goodman tests• Null hypothesis indirect effect is 0• Test statistic (normally distributed)
α * β
σαβ
– σαβ approximated by
Sqrt(α2 σβ2+ β2 σα
2) Sobel
Sqrt(α2 σβ2+ β2 σα
2- σα2 σβ
2 )Goodman
E D
M
Direct effect: (not through hypothesized mediator, E-D)
Indirect effect: (through
hypothesized mediator, E-M-D)
α β
Example of mediation
Question:Does childhood growth mediate the association
between infant growth and adolescent systolic blood pressure?
In the dataset you have• Systolic blood pressure - bpsys • Infant growth as change in weight z-score from
birth to 3 months - w0to3mz• Height z-score at ~7 years - height7z
Theoretical model
Infant growth
Systolic blood pressure
Childhood growth
c
a b
Read data
• use /home/asm2/s2/mediation,clear
See what kind of variables you got
• describe
Install “sgmediation” package• findit sgmediation
Install “sgmediation” package
Sobel-Goodman mediation tests
• sgmediation bpsys, mv(height7z) iv(w0to3mz)
http://www.ats.ucla.edu/stat/stata/faq/sgmediation.htm
_cons 102.1715 .202861 503.65 0.000 101.7738 102.5693 w0to3mz .6655623 .2268029 2.93 0.003 .2208664 1.110258 bpsys Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 377212.343 3158 119.446594 Root MSE = 10.916 Adj R-squared = 0.0024 Residual 376186.198 3157 119.159391 R-squared = 0.0027 Model 1026.14489 1 1026.14489 Prob > F = 0.0034 F( 1, 3157) = 8.61 Source SS df MS Number of obs = 3159
Model with dv regressed on iv (path c)
. sgmediation bpsys, mv(height7z) iv(w0to3mz)
Direct Association
Infant growth
Systolic blood pressure
Childhood growth
β =0.67
Sobel-Goodman mediation tests
_cons -.1604 .0168525 -9.52 0.000 -.1934429 -.1273571 w0to3mz .1766453 .0188414 9.38 0.000 .1397027 .213588 height7z Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 2668.45379 3158 .8449822 Root MSE = .90684 Adj R-squared = 0.0268 Residual 2596.17089 3157 .822353782 R-squared = 0.0271 Model 72.2828983 1 72.2828983 Prob > F = 0.0000 F( 1, 3157) = 87.90 Source SS df MS Number of obs = 3159
Model with mediator regressed on iv (path a)
Association of exposure with mediator
Infant growth
Systolic blood pressure
Childhood growth
cβ =0.67
aβ =0.17
Sobel-Goodman mediation tests
_cons 102.7592 .1960212 524.23 0.000 102.3749 103.1435 w0to3mz .0184025 .2190649 0.08 0.933 -.4111216 .4479266 height7z 3.663611 .2041073 17.95 0.000 3.263415 4.063808 bpsys Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 377212.343 3158 119.446594 Root MSE = 10.4 Adj R-squared = 0.0945 Residual 341340.27 3156 108.155979 R-squared = 0.0951 Model 35872.0721 2 17936.036 Prob > F = 0.0000 F( 2, 3156) = 165.83 Source SS df MS Number of obs = 3159
Model with dv regressed on mediator and iv (paths b and c')
Association of mediator with outcome
Infant growth
Systolic blood pressure
Childhood growth
cInitial β =0.67
With mediator β =0.02
aβ =0.17
bβ =3.66
Sobel-Goodman mediation tests
Ratio of indirect to direct effect: 35.166928Proportion of total effect that is mediated: .97235043
Total effect = .66556235 Direct effect = .01840251Indirect effect = .64715983
Goodman-2 .64715983 .07778151 8.32 0Goodman-1 .64715983 .07797141 8.3 0Sobel .64715983 .07787652 8.31 0 Coef Std Err Z P>|Z|
Sobel-Goodman Mediation Tests
Issues• Other mediators may exist on ‘direct’ pathway, just not
mediators we are evaluating• Need to consider confounding when estimating direct or
indirect effect• Not all the conditions may be necessary• May need other methods to estimate Sobel test statistics for
other than linear variables
BMIBreast cancer
Estradiol
C
Why path analysis?
E1
E2
E4
E3
D
Multiple regression model Path analysis
E1
E2
E4
E3
D
Do not know how the exposures relate to each other
Any mediation here?
How to do path analysis
1. Draw out ‘a priori’ path diagram
2. Compute the co-efficients for each path– based on multiple regression – Use stata “pathreg”
3. Draw the final path diagram with only the paths with significant co-efficients
Path analysis question
• How does late infant growth affect systolic blood pressure in adolescence and does bmi z-score at 7 years play a role?
‘a priori’ path diagram
BMI z-score at 7 years
Height z-score change from 3 to 9 months
Weight z-score change from 3 to 9 months
Blood pressure at
11 years
Note – no ‘causal’ loops
Path analysis
• findit pathreg
• corr (w3to9mz h3to9mz)
h3to9mz 0.2782 1.0000 w3to9mz 1.0000 w3to9mz h3to9mz
(obs=3159). corr (w3to9mz h3to9mz)
http://www.ats.ucla.edu/stat/stata/faq/pathreg.htm
Path analysis
• pathreg (bmi7z w3to9mz h3to9mz)(bpsys bmi7z w3to9mz h3to9mz)
n = 3159 R2 = 0.1198 sqrt(1 - R2) = 0.9382 _cons 101.9994 .1965874 518.85 0.000 . h3to9mz 1.109793 .299163 3.71 0.000 .064528 w3to9mz .0197162 .3133117 0.06 0.950 .0010963 bmi7z 3.104725 .1542642 20.13 0.000 .3371197 bpsys Coef. Std. Err. t P>|t| Beta
n = 3159 R2 = 0.0057 sqrt(1 - R2) = 0.9971 _cons .1752513 .0224686 7.80 0.000 . h3to9mz .049892 .0345088 1.45 0.148 .0267163 w3to9mz .1243758 .036085 3.45 0.001 .0636919 bmi7z Coef. Std. Err. t P>|t| Beta
. pathreg (bmi7z w3to9mz h3to9mz)(bpsys bmi7z w3to9mz h3to9mz)
Path analysis
BMI z-score at 7 years
Height z-score change from 3 to 9 months
Weight z-score change from 3 to 9 months
Blood pressure at
11 years
0.06
0.28
0.06
0.34
Standardized co-efficientsThat fraction of the standard deviation of the dependent variable for which the designated factor is directly responsible
How do coefficients in path analysis relate to that in multiple regression?
• regress bpsys w3to9mz h3to9mz bmi7z, beta
_cons 101.9994 .1965874 518.85 0.000 . bmi7z 3.104725 .1542642 20.13 0.000 .3371197 h3to9mz 1.109793 .299163 3.71 0.000 .064528 w3to9mz .0197162 .3133117 0.06 0.950 .0010963 bpsys Coef. Std. Err. t P>|t| Beta
Total 377212.343 3158 119.446594 Root MSE = 10.258 Adj R-squared = 0.1190 Residual 332007.26 3155 105.232095 R-squared = 0.1198 Model 45205.083 3 15068.361 Prob > F = 0.0000 F( 3, 3155) = 143.19 Source SS df MS Number of obs = 3159
. regress bpsys w3to9mz h3to9mz bmi7z,beta
Same beta co-efficients as from the second step of the path analysis
What did path analysis do?
• 2 regressions
• Standardized the coefficients
• We manually calculated the correlation between the two exogeneous variables (infant weight growth and infant height growth)
Pros & Cons of path analysis
• Path analysis may help distinguish plausibility of different hypotheses (could compare models)
• Path analysis may help present complicated relations
• Path diagram does NOT imply causal associations– Cannot establish direction of causality
• Path analysis only works with continuous variables
• Path analysis only uses observed variables
Structural Equation Modeling
Combination of
• path analysis
• factor analysis
E1
E2
E4
E3
D
E1
E2
E4
E3
E7E8
E5
E9
E10
E12
E0
E11
L1
L2
L represents some underlying or latent construct, e.g., growth potential, ability, etc, also called hypothetical, unobserved
E measured, observed or manifest
Structural Equation Modeling
D
E1
E2
E4
E3
E7E8
E5
E9
E10
E12
E0
E11
L1
L2
Also called covariance structure analysis, covariance structure modeling, and analysis of covariance structures.
On-way arrows stand for regression weightsTwo way arrows stand for correlation among the predictors
Uses of SEM
• Confirm a model
• Compare models
• Create a new model
POST HOC ANALYSIS– Needs cross-validation
based on some theoretical model
using external knowledge
}
Key steps in SEM
• Create theoretical model• Common factor analysis to establish the number of latents • Confirmatory factor analysis to confirm the measurement
model. As a further refinement, factor loadings can be constrained to 0 for any measured variable's crossloadings on other latent variables, so every measured variable loads only on its latent.
• Test nested models to get the most parsimonious one. Alternatively, test other research studies' findings or theory by constraining parameters as they suggest should be the case. Consider raising the alpha significance level from .05 to .01 to test for a more significant model.
• Relate back to theory
Reminder on factor analysis
• ASM 1
• Groups measured variables according to the correlations between them
• Enables measured variables to be grouped into distinct latent factors representing the same concept
Test the model
• Absolute fit index, e.g.– RMSEA, how well the model, with unknown but
optimally chosen parameter estimates would fit the populations covariance matrix. Magic numbers <0.07 to 0.10
• Incremental fit indices, e.g.,– CFI this statistic assumes that all latent variables are
uncorrelated (null/independence model) and compares the sample covariance matrix with this null model. Magic numbers >0.9
• Parsimony fit indices, e.g.,– AIC can be used to compare models
Pros and Cons of SEM• SEM may help distinguish plausibility of different hypotheses (if
you compare models)
• SEM may help present complicated relations
• SEM works best with normally distributed continuous variables
• Needs specialised software, e.g., LISREL, AMOS, Mplus
• Model construction can be quite subjective
• Does NOT imply causal associations
• GIGO
Take home messages