statistics for medical researchers hongshik ahn professor department of applied math and statistics...

49
Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook GCRC

Upload: natalie-eaton

Post on 12-Jan-2016

218 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

Statistics for Medical Researchers

Hongshik Ahn

Professor

Department of Applied Math and Statistics

Stony Brook University

Biostatistician, Stony Brook GCRC

Page 2: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

2

Contents

1. Experimental Design2. Descriptive Statistics and

Distributions3. Comparison of Means4. Comparison of Proportions5. Power Analysis/Sample Size

Calculation6. Correlation and Regression

Page 3: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

3

1. Experimental Design Experiment

Treatment: something that researchers administer to experimental units

Factor: controlled independent variable whose levels are set by the experimenter

Experimental design Control Treatment

Placebo effect Blind

single blind, double blind, triple blind

Page 4: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

4

1. Experimental DesignRandomization

Completely randomized design Randomized block design: if there are

specific differences among groups of subjects Permuted block randomization: used for

small studies to maintain reasonably good balance among groups

Stratified block randomization: matching

Page 5: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

5

1. Experimental Design Completely randomized design

The computer generated sequence: 4,8,3,2,7,2,6,6,3,4,2,1,6,2,0,……. Two Groups (criterion: even-odd): AABABAAABAABAAA…… Three Groups: (criterion:{1,2,3}~A, {4,5,6}~B, {7,8,9}~C; ignore

0’s) BCAACABBABAABA…… Two Groups: different randomization ratios(eg.,2:3): (criterion:{0,1,2,3}~A, {4,5,6,7,8,9}~B) BBAABABBABAABAA……..

Page 6: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

6

1. Experimental Design Permuted block randomization

With a block size of 4 for two groups(A,B), there are 6 possible permutations and they can be coded as: 1=AABB, 2=ABAB, 3=ABBA, 4=BAAB, 5=BABA, 6=BBAA Each number in the random number sequence in turn selects the next block, determining the next four

participant allocations (ignoring numbers 0,7,8 and 9). e.g., The sequence 67126814…. will produce BBAA AABB ABAB BBAA AABB BAAB. In practice, a block size of four is too small since researchers may crack the code and risk selection bias. Mixing block sizes of 4 and 6 is better with the size kept

un known to the investigator.

Page 7: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

7

1. Experimental Design

Methods of Sampling Random sampling Systematic sampling Convenience sampling Stratified sampling

Page 8: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

8

1. Experimental Design Random Sampling

Selection so that each individual member has an equal chance of being selected

Systematic Sampling Select some starting point and then select

every k th element in the population

Page 9: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

9

1. Experimental Design Convenience Sampling

Use results that are easy to get

Page 10: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

10

1. Experimental Design Stratified Sampling

Draw a sample from each stratum

Page 11: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

11

2. Descriptive Statistics & Distributions Parameter: population quantity

Statistic: summary of the sample Inference for parameters: use sample Central Tendency

Mean (average) Median (middle value)

Variability Variance: measure of variation Standard deviation (sd): square root of

variance Standard error (se): sd of the estimate Median, quartiles, min., max, range, boxplot

Proportion

Page 12: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

12

2. Descriptive Statistics & Distributions

Normal distribution

Page 13: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

13

2. Descriptive Statistics & Distributions

Standard normal distribution: Mean 0, variance 1

Page 14: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

14

2. Descriptive Statistics & Distributions

Z-test for means T-test for means if sd is unknown

Page 15: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

15

3. Inference for Means Two-sample t-test

Two independent groups: Control and treatment Continuous variables Assumption: populations are normally

distributed Checking normality

Histogram Normal probability curve (Q-Q plot): straight? Shapiro-Wilk test, Kolmogorov-Smirnov test,

Anderson-Darling test If the normality assumption is violated

T-test is not appropriate. Possible transformation Use non-parametric alternative: Mann-Whitney

U-test (Wilcoxon rank-sum test)

Page 16: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

16

3. Inference for Means A clinical trial on effectiveness of drug A in

preventing premature birth 30 pregnant women are randomly assigned to

control and treatment groups of size 15 each Primary endpoint: weight of the babies at birth

Treatment Control

n 15 15

mean 7.08 6.26

sd 0.90 0.96

Page 17: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

17

3. Inference for Means Hypothesis: The group means are different

Null hypothesis (Ho): 1 = 2

Alternative hypothesis (H1): 1 2

Significance level: = 0.05 Assumption: Equal variance Degrees of freedom (df): Calculate the T-value (test statistic)

P-value: Type I error rate (false positive rate) Reject Ho if p-value < Do not reject Ho if p-value >

)/1()/1(

)()(

21

2121

nns

xxT

p

221 nn

Page 18: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

18

3. Inference for Means

Previous example: Test at

P-value: 0.026 < 0.05Reject the null hypothesis that there is no drug effect.

413.2)15/1()15/1(866.0

26.608.7

)/1()/1(

)()(

21

2121

nns

xxt

p

866.021515

)96(.14)90(.14

2

)1()1( 22

21

222

2112

nn

snsnsp

05.0

Page 19: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

19

3. Inference for Means Confidence interval (CI):

An interval of values used to estimate the true value of a population parameter.

The probability 1- that is the proportion of times that the CI actually contains the population parameter, assuming that the estimation process is repeated a large number of times.

Common choices: 90% CI ( = 10%), 95% CI ( = 5%), 99% CI ( = 1%)

Page 20: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

3. Inference for Means

CI for a comparison of two means:

where

A 95% CI for the previous example:

ExxExx )()( 212121

)/1()/1( 212,2/ 21nnstE pnn

)52.1,12(.70.)26.608.7(

70.)]15/1)15/1[(866.)048.2()15/1()15/1(28,025. pstE

Page 21: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

21

3. Inference for Means

SAS programming for Two-Sample T-test Data steps :Click ‘File’ Click ‘Import Data’ Select a data source

Click ‘Browse’ and find the path of the data file

Click ‘Next’ Fill the blank of ‘Member’ with the name of the SAS data set

Click ‘Finish’ Procedure steps :Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’ Click ‘File’

Click ‘Open By SAS Name’ Select the SAS data set and Click ‘OK’

Click ‘Statistics’ Click ‘ Hypothesis Tests’

Click ‘Two-Sample T-test for Means’

Select the independent variable as ‘Group’ and the dependent variable as

‘Dependent’ Choose the interested Hypothesis and Click ‘OK’

Page 22: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

22

3. Inference for Means

Click ‘Statistics’ to select the

statistical procedure.

Click ‘File’ to open the SAS data set.

Click ‘File’ to import data and create

the SAS data set.

Click ‘Solution’ to create a

project to run statistical test

Page 23: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

23

3. Inference for Means

Mann-Whitney U-Test (Wilcoxon Rank-Sum Test)

Nonparametric alternative to two-sample t-test

The populations don’t need to be normal H0: The two samples come from populations

with equal medians H1: The two samples come from populations

with different medians

Page 24: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

24

3. Inference for Means

Mann-Whitney U-Test Procedure Temporarily combine the two samples

into one big sample, then replace each sample value with its rank

Find the sum of the ranks for either one of the two samples

Calculate the value of the z test statistic

Page 25: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

25

3. Inference for Means

Mann-Whitney U-Test, Example Numbers in

parentheses are their ranks beginning with a rank of 1 assigned to the lowest value of 17.7.

R1 and R2: sum of ranks

Page 26: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

26

3. Inference for Means Hypothesis: The group means are different

Ho: Men and women have same median BMI’s H1: Men and women have different median BMI’s

p-valuethus we do not reject H0 at =0.05.

There is no significant difference in BMI between men and women.

1 1 2( 1) 13(13 12 1)169

2 2R

n n n

1 2 1 2( 1) (13)(12)(13 12 1)18.385

12 12R

n n n n

187 1690.98

18.385R

R

Rz

Page 27: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

27

3. Inference for Means

SAS Programming for Mann-Whitney U-Test Procedure

Data steps : The same as slide 21. Procedure steps : Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’

Click ‘File’ Click ‘Open By SAS Name’

Select the SAS data set and Click ‘OK’

Click ‘Statistics’ Click ‘ ANOVA’

Click ‘Nonparametric One-Way ANOVA’

Select the ‘Dependent’ and ‘Independent’ variables respectively

and choose the interested test Click ‘OK’

Page 28: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

28

3. Inference for Means

Click ‘Statistics’ to select the

statistical procedure.

Click ‘File’ to open the SAS data

set.

Select the dependent and independent variables:

Page 29: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

29

3. Inference for Means Paired t-test

Mean difference of matched pairs Test for changes (e.g., before & after) The measures in each pair are correlated. Assumption: population is normally distributed Take the difference in each pair and perform

one-sample t-test. Check normality

If the normality assumption is viloated T-test is not appropriate. Use non-parametric alternative: Wilcoxon

signed rank test

Page 30: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

30

3. Inference for Means

Notation for paired t-test d = individual difference between the two values of a single matched pair µd = mean value of the differences d for the population of paired data = mean value of the differences d for the

paired sample data sd = standard deviation of the differences d

for the paired sample data n = number of pairs

d

d

Page 31: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

31

3. Inference for Means Example: Systolic Blood Pressure

OC: Oral contraceptive

ID Without OC’s With OC’s Difference

1 115 128 13

2 112 115 3

3 107 106 -1

4 119 128 9

5 115 122 7

6 138 145 7

7 126 132 6

8 105 109 4

9 104 102 -2

10 115 117 2

Page 32: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

32

3. Inference for Means Hypothesis: The group means are

different Ho: vs. H1: Significance level: = 0.05 Degrees of freedom (df): Test statistic

P-value: 0.009, thus reject Ho at =0.05 The data support the claim that oral

contraceptives affect the systolic bp.

0d 0d

32.310/57.4

8.4

/

ns

dt

d

d

91n

Page 33: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

33

3. Inference for Means Confidence interval for matched pairs

100(1-)% CI:

95% CI for the mean difference of the systolic bp:

(1.53, 8.07)

n

std

n

std d

nd

n 1,2/1,2/ ,

27.38.410

57.426.28.4

109,025.0 ds

td

Page 34: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

34

3. Inference for Means

SAS Programming for Paired T-test Data steps : The same as slide 21. Procedure steps : Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’

Click ‘File’ Click ‘Open By SAS Name’

Select the SAS data set and Click ‘OK’

Click ‘Statistics’ Click ‘ Hypothesis tests’

Click ‘Two-Sample Paired T-test for means’

Select the ‘Group1’ and ‘Group2’ variables respectively

Click ‘OK’

(Note: You can also calculate the difference, and use it as the dependent variable to run the one-sample t-test)

Page 35: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

35

3. Inference for Means

Click ‘Statistics’ to select the

statistical procedure.

Click ‘File’ to open the SAS data

set.

Put the two group variables into ‘Group 1’ and ‘Group 2’

Page 36: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

36

3. Inference for Means

Comparison of more than two means: ANOVA (Analysis of Variance)

One-way ANOVA: One factor, eg., control, drug 1, drug 2

Two-way ANOVA: Two factors, eg., drugs, age groups

Repeated measures: If there is a repeated measures within subject such as time points

Page 37: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

37

3. Inference for means Example: Pulmonary disease

Endpoint: Mid-expiratory flow (FEF) in L/s 6 groups: nonsmokers (NS), passive smokers

(PS), noninhaling smokers (NI), light smokers (LS), moderate smokers (MS) and heavy smokers (HS)

Group name

Mean FEF SD FEF n

NS 3.78 0.79 200

PS 3.30 0.77 200

NI 3.32 0.86 50

LS 3.23 0.78 200

MS 2.73 0.81 200

HS 2.59 0.82 200

Page 38: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

38

3. Inference for means Example: Pulmonary disease

Ho: group means are the same H1: not all the groups means are the same

P-value<0.001 There is a significant difference in the mean FEF

among the groups. Comparison of specific groups: linear contrast Multiple comparison: Bonferroni adjustment (/n)

SS df MS F statistic

P-value

Between

184.38

5 36.875 58.0 <0.001

Within 663.87

1044 0.636

Total 848.25

Page 39: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

39

3. Inference for Means

SAS Programming for One-Way ANOVA Data steps : The same as slide 21. Procedure steps : Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’

Click ‘File’ Click ‘Open By SAS Name’

Select the SAS data set and Click ‘OK’

Click ‘Statistics’ Click ‘ ANOVA’

Click ‘One-Way ANOVA’

Select the ‘Independent’ and ‘Dependent’ variables respectively

Click ‘OK’

Page 40: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

40

3. Inference for Means

Click ‘Solutions’ to select the

statistical procedure.

Click ‘File’ to open the SAS data

set.

Select the dependent and Independent variables:

Page 41: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

41

4. Inference for Proportions

Chi-square test Testing difference of two proportions n: #successes, p: success rate Requirement: & H0: p1

= p2

H1: p1 p

2 (for two-sided test)

If the requirement is not satisfied, use Fisher’s exact test.

5np 5)1( pn

Page 42: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

42

5. Power/Sample Size Calculation

Decide significance level (eg. 0.05) Decide desired power (eg. 80%) One-sided or two-sided test Comparison of means: two-sample t-test

Need to know sample means in each group Need to know sample sd’s in each group Calculation: use software (Nquery, power, etc)

Comparison of proportions: Chi-square test Need to know sample proportions in each group Continuity correction Small sample size: Fisher’s exact test Calculation: use software

Page 43: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

43

6. Correlation and Regression

Correlation Pearson correlation for continuous variables Spearman correlation for ranked variables Chi-square test for categorical variables

Pearson correlation Correlation coefficient (r): -1<r<1 Test for coefficient: t-test Larger sample more significant for the

same value of the correlation coefficient Thus it is not meaningful to judge by the

magnitude of the correlation coefficient. Judge the significance of the correlation by p-

value

Page 44: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

44

6. Correlation and Regression Regression

Objective Find out whether a significant linear relationship

exists between the response and independent variables

Use it to predict a future value Notation

X: independent (predictor) variable Y: dependent (response) variable

Multiple linear regression model

Where is the random error Checking the model (assumption)

Normality: q-q plot, histogram, Shapiro-Wilk test Equal variance: predicted y vs. error is a band shape Linear relationship: predicted y vs. each x

kκxxy ... 11

Page 45: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

45

6. Correlation and RegressionWeight (x1) in

LBAge (x2) Blood pressure

(y)

152 50 120

183 20 141

171 20 124

165 30 126

158 30 117

161 50 129

149 60 123

158 50 125

170 40 132

153 55 123

164 40 132

190 40 155

185 20 147

Page 46: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

46

6. Correlation and RegressionThe regression equation is

The mean blood pressure increases by 1.08 if weight (x1) increases by one pound and age (x2) remains fixed. Similarly, a 1-year increase in age with the weight held fixed will increase the mean blood pressure by 0.425.

s=2.509 R2=95.8% Error sd is estimated as 2.509 with df=13-3=10 95.8% of the variation in y can be explained by the

regression.

21 425.008.11.65 xxy

Predictor

Coefficient

se T-ratio P-value

Constant -65.10 14.94 -4.36 0.001

x1 1.077 0.077 13.98 0.000

x2 0.425 0.073 5.82 0.000

Page 47: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

47

6. Correlation and Regression

SAS Programming for Linear Regression Data steps : The same as slide 21. Procedure steps : Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’

Click ‘File’ Click ‘Open By SAS Name’

Select the SAS data set and Click ‘OK’

Click ‘Statistics’ Click ‘ Regression’ Click ‘Linear’

Select the ‘Dependent’ (Response) variable and the ‘Explanatory’

(Predictor) variable respectively

Click ‘OK’

Page 48: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

48

6. Correlation and Regression

Click ‘Solutions’ to select the

statistical procedure.

Click ‘File’ to open the SAS data

set.

Select the dependent and explanatory variables:

Page 49: Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook

49

6. Correlation and Regression

Other regression models Polynomial regression Transformation Logistic regression