lecture 2

Advanced statistical methods II

Learning objectives

• Implement an analytic strategy using mediation

• Implement an analytic strategy using path analysis

• Appreciate the role of structural equation modelling

stata usage

Menu or command driven

• Use the menus to find out how to write a command, then save as a program

Use a program in a ‘do’ file for analysis and save it

• Replication

How to use do file

– Click “do-file editor” on the toolbar

– Type your command and click “run” button

– Click “file”-> “save as” to save your do-file You can choose to save it under your account (eg. “stata01”)

– To see what files are under your account• ls– Use “do-file editor” -> “open” to open a

saved do-file

Recap - stata

Key commands (use class examples as templates)• browse

– to look at your data• tabu edup ov7 if sex==1

– to cross tab your data with a selection• bysort sex:summarize bw

– to get means for different groups • xi:regress bmi7z i.edup bw i.sex

– to do regression• xi:logistic ov7 i.edup bw i.sex

– to do logistic regression

Interpreting stata output (1)

• browse, tabu, summarize – self-explanatory

• regress shows you– number of obs– R-squared– For each exposure

• Coef. May think of it as beta (β)• Std. err.• P>|t| - may think of it as p-value• 95% CI

Interpreting stata output (2)

• logistic shows you– number of obs– LR chi2(df)– For each exposure

• Odds ratio• Std. err.• P>|z| - may think of it as p-value• 95% CI

Stata – things to be aware of

Stata does not include in the analysis observations with missing values (.)

In regression (any sort) stata always uses the lowest group as the reference group for a categorical value

There is a fantastic website with annotated stata analysis

http://www.ats.ucla.edu/stat/stata

New statistics tools

• Mediation– examining the ‘active’ ingredient

• Path analysis– examining several ‘active’ ingredients

• Structural equation modeling– examining several ‘active’ ingredients

including unmeasured concepts

Mediation

• The ‘active ingredient’ of an exposure• The mechanism by which an exposure works• Increases biological plausibility of theory• Reduces likelihood exposure disease relationship is

caused by confounder

E D

M

Direct effect: (not through hypothesized

mediator, E-D)

Indirect effect: (through

hypothesized mediator, E-M-D)

Mediation as causal explanation• Crude OR for BMI and breast cancer = 2.0• Adjusted for estradiol, aOR = 1.0• Full mediation – all of the association goes through

this pathway• If adjusted was 1.5 – partial mediation

(another pathway exists)• Identifies more proximal cause of disease for

potential intervention/prevention

BMI Breast cancer

Estradiol

Baron & Kenny 4-step approach1. Exposure (BMI) should be associated with outcome (BrCa) 2. Exposure (BMI) should be associated with mediator

(Estradiol) 3. Mediator (Estradiol) should be associated with outcome

(BrCa) 4. Association of exposure (BMI) with outcome (BrCa) should

be reduced by adjusting for mediator (Estradiol)

BMIBreast cancer

Estradiol

OR > 1?

OR > 1?

Crude OR > Adjusted OR

2 3

1,4

Sobel and Goodman tests

Sobel and Goodman tests• Null hypothesis indirect effect is 0• Test statistic (normally distributed)

α * β

σαβ

– σαβ approximated by

Sqrt(α2 σβ2+ β2 σα

2) Sobel

Sqrt(α2 σβ2+ β2 σα

2- σα2 σβ

2 )Goodman

E D

M

Direct effect: (not through hypothesized mediator, E-D)

Indirect effect: (through

hypothesized mediator, E-M-D)

α β

Example of mediation

Question:Does childhood growth mediate the association

between infant growth and adolescent systolic blood pressure?

In the dataset you have• Systolic blood pressure - bpsys • Infant growth as change in weight z-score from

birth to 3 months - w0to3mz• Height z-score at ~7 years - height7z

Theoretical model

Infant growth

Systolic blood pressure

Childhood growth

c

a b

Read data

• use /home/asm2/s2/mediation,clear

See what kind of variables you got

• describe

Install “sgmediation” package• findit sgmediation

Install “sgmediation” package

Sobel-Goodman mediation tests

• sgmediation bpsys, mv(height7z) iv(w0to3mz)

http://www.ats.ucla.edu/stat/stata/faq/sgmediation.htm

_cons 102.1715 .202861 503.65 0.000 101.7738 102.5693 w0to3mz .6655623 .2268029 2.93 0.003 .2208664 1.110258 bpsys Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 377212.343 3158 119.446594 Root MSE = 10.916 Adj R-squared = 0.0024 Residual 376186.198 3157 119.159391 R-squared = 0.0027 Model 1026.14489 1 1026.14489 Prob > F = 0.0034 F( 1, 3157) = 8.61 Source SS df MS Number of obs = 3159

Model with dv regressed on iv (path c)

. sgmediation bpsys, mv(height7z) iv(w0to3mz)

Direct Association

Infant growth


Childhood growth

β =0.67


_cons -.1604 .0168525 -9.52 0.000 -.1934429 -.1273571 w0to3mz .1766453 .0188414 9.38 0.000 .1397027 .213588 height7z Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 2668.45379 3158 .8449822 Root MSE = .90684 Adj R-squared = 0.0268 Residual 2596.17089 3157 .822353782 R-squared = 0.0271 Model 72.2828983 1 72.2828983 Prob > F = 0.0000 F( 1, 3157) = 87.90 Source SS df MS Number of obs = 3159

Model with mediator regressed on iv (path a)

Association of exposure with mediator

Infant growth


Childhood growth

cβ =0.67

aβ =0.17


_cons 102.7592 .1960212 524.23 0.000 102.3749 103.1435 w0to3mz .0184025 .2190649 0.08 0.933 -.4111216 .4479266 height7z 3.663611 .2041073 17.95 0.000 3.263415 4.063808 bpsys Coef. Std. Err. t P>|t| [95% Conf. Interval]


Model with dv regressed on mediator and iv (paths b and c')

Association of mediator with outcome

Infant growth


Childhood growth

cInitial β =0.67

With mediator β =0.02

aβ =0.17

bβ =3.66


Ratio of indirect to direct effect: 35.166928Proportion of total effect that is mediated: .97235043

Total effect = .66556235 Direct effect = .01840251Indirect effect = .64715983

Goodman-2 .64715983 .07778151 8.32 0Goodman-1 .64715983 .07797141 8.3 0Sobel .64715983 .07787652 8.31 0 Coef Std Err Z P>|Z|

Sobel-Goodman Mediation Tests

Issues• Other mediators may exist on ‘direct’ pathway, just not

mediators we are evaluating• Need to consider confounding when estimating direct or

indirect effect• Not all the conditions may be necessary• May need other methods to estimate Sobel test statistics for

other than linear variables

BMIBreast cancer

Estradiol

C

Why path analysis?

E1

E2

E4

E3

D

Multiple regression model Path analysis

E1

E2

E4

E3

D

Do not know how the exposures relate to each other

Any mediation here?

How to do path analysis

1. Draw out ‘a priori’ path diagram

2. Compute the co-efficients for each path– based on multiple regression – Use stata “pathreg”

3. Draw the final path diagram with only the paths with significant co-efficients

Path analysis question

• How does late infant growth affect systolic blood pressure in adolescence and does bmi z-score at 7 years play a role?

‘a priori’ path diagram

BMI z-score at 7 years

Height z-score change from 3 to 9 months

Weight z-score change from 3 to 9 months

Blood pressure at

11 years

Note – no ‘causal’ loops

Path analysis

• findit pathreg

• corr (w3to9mz h3to9mz)

h3to9mz 0.2782 1.0000 w3to9mz 1.0000 w3to9mz h3to9mz

(obs=3159). corr (w3to9mz h3to9mz)

http://www.ats.ucla.edu/stat/stata/faq/pathreg.htm

Path analysis

• pathreg (bmi7z w3to9mz h3to9mz)(bpsys bmi7z w3to9mz h3to9mz)

n = 3159 R2 = 0.1198 sqrt(1 - R2) = 0.9382 _cons 101.9994 .1965874 518.85 0.000 . h3to9mz 1.109793 .299163 3.71 0.000 .064528 w3to9mz .0197162 .3133117 0.06 0.950 .0010963 bmi7z 3.104725 .1542642 20.13 0.000 .3371197 bpsys Coef. Std. Err. t P>|t| Beta

n = 3159 R2 = 0.0057 sqrt(1 - R2) = 0.9971 _cons .1752513 .0224686 7.80 0.000 . h3to9mz .049892 .0345088 1.45 0.148 .0267163 w3to9mz .1243758 .036085 3.45 0.001 .0636919 bmi7z Coef. Std. Err. t P>|t| Beta

. pathreg (bmi7z w3to9mz h3to9mz)(bpsys bmi7z w3to9mz h3to9mz)

Path analysis

BMI z-score at 7 years

Height z-score change from 3 to 9 months

Weight z-score change from 3 to 9 months

Blood pressure at

11 years

0.06

0.28

0.06

0.34

Standardized co-efficientsThat fraction of the standard deviation of the dependent variable for which the designated factor is directly responsible

How do coefficients in path analysis relate to that in multiple regression?

• regress bpsys w3to9mz h3to9mz bmi7z, beta

_cons 101.9994 .1965874 518.85 0.000 . bmi7z 3.104725 .1542642 20.13 0.000 .3371197 h3to9mz 1.109793 .299163 3.71 0.000 .064528 w3to9mz .0197162 .3133117 0.06 0.950 .0010963 bpsys Coef. Std. Err. t P>|t| Beta


. regress bpsys w3to9mz h3to9mz bmi7z,beta

Same beta co-efficients as from the second step of the path analysis

What did path analysis do?

• 2 regressions

• Standardized the coefficients

• We manually calculated the correlation between the two exogeneous variables (infant weight growth and infant height growth)

Pros & Cons of path analysis

• Path analysis may help distinguish plausibility of different hypotheses (could compare models)

• Path analysis may help present complicated relations

• Path diagram does NOT imply causal associations– Cannot establish direction of causality

• Path analysis only works with continuous variables

• Path analysis only uses observed variables

Structural Equation Modeling

Combination of

• path analysis

• factor analysis

E1

E2

E4

E3

D

E1

E2

E4

E3

E7E8

E5

E9

E10

E12

E0

E11

L1

L2

L represents some underlying or latent construct, e.g., growth potential, ability, etc, also called hypothetical, unobserved

E measured, observed or manifest

Structural Equation Modeling

D

E1

E2

E4

E3

E7E8

E5

E9

E10

E12

E0

E11

L1

L2

Also called covariance structure analysis, covariance structure modeling, and analysis of covariance structures.

On-way arrows stand for regression weightsTwo way arrows stand for correlation among the predictors

Uses of SEM

• Confirm a model

• Compare models

• Create a new model

POST HOC ANALYSIS– Needs cross-validation

based on some theoretical model

using external knowledge

}

Key steps in SEM

• Create theoretical model• Common factor analysis to establish the number of latents • Confirmatory factor analysis to confirm the measurement

model. As a further refinement, factor loadings can be constrained to 0 for any measured variable's crossloadings on other latent variables, so every measured variable loads only on its latent.

• Test nested models to get the most parsimonious one. Alternatively, test other research studies' findings or theory by constraining parameters as they suggest should be the case. Consider raising the alpha significance level from .05 to .01 to test for a more significant model.

• Relate back to theory

Reminder on factor analysis

• ASM 1

• Groups measured variables according to the correlations between them

• Enables measured variables to be grouped into distinct latent factors representing the same concept

Test the model

• Absolute fit index, e.g.– RMSEA, how well the model, with unknown but

optimally chosen parameter estimates would fit the populations covariance matrix. Magic numbers <0.07 to 0.10

• Incremental fit indices, e.g.,– CFI this statistic assumes that all latent variables are

uncorrelated (null/independence model) and compares the sample covariance matrix with this null model. Magic numbers >0.9

• Parsimony fit indices, e.g.,– AIC can be used to compare models

Pros and Cons of SEM• SEM may help distinguish plausibility of different hypotheses (if

you compare models)

• SEM may help present complicated relations

• SEM works best with normally distributed continuous variables

• Needs specialised software, e.g., LISREL, AMOS, Mplus

• Model construction can be quite subjective

• Does NOT imply causal associations

• GIGO

Take home messages

lecture 2

Documents

present complicated

systolic blood

interpreting

bmi7z w3to9mz

goodman mediation

blood pressure

w3to9mz h3to9mz

infant growth