statistics for medical researchers hongshik ahn professor department of applied math and statistics...
TRANSCRIPT
Statistics for Medical Researchers
Hongshik Ahn
Professor
Department of Applied Math and Statistics
Stony Brook University
Biostatistician, Stony Brook GCRC
2
Contents
1. Experimental Design2. Descriptive Statistics and
Distributions3. Comparison of Means4. Comparison of Proportions5. Power Analysis/Sample Size
Calculation6. Correlation and Regression
3
1. Experimental Design Experiment
Treatment: something that researchers administer to experimental units
Factor: controlled independent variable whose levels are set by the experimenter
Experimental design Control Treatment
Placebo effect Blind
single blind, double blind, triple blind
4
1. Experimental DesignRandomization
Completely randomized design Randomized block design: if there are
specific differences among groups of subjects Permuted block randomization: used for
small studies to maintain reasonably good balance among groups
Stratified block randomization: matching
5
1. Experimental Design Completely randomized design
The computer generated sequence: 4,8,3,2,7,2,6,6,3,4,2,1,6,2,0,……. Two Groups (criterion: even-odd): AABABAAABAABAAA…… Three Groups: (criterion:{1,2,3}~A, {4,5,6}~B, {7,8,9}~C; ignore
0’s) BCAACABBABAABA…… Two Groups: different randomization ratios(eg.,2:3): (criterion:{0,1,2,3}~A, {4,5,6,7,8,9}~B) BBAABABBABAABAA……..
6
1. Experimental Design Permuted block randomization
With a block size of 4 for two groups(A,B), there are 6 possible permutations and they can be coded as: 1=AABB, 2=ABAB, 3=ABBA, 4=BAAB, 5=BABA, 6=BBAA Each number in the random number sequence in turn selects the next block, determining the next four
participant allocations (ignoring numbers 0,7,8 and 9). e.g., The sequence 67126814…. will produce BBAA AABB ABAB BBAA AABB BAAB. In practice, a block size of four is too small since researchers may crack the code and risk selection bias. Mixing block sizes of 4 and 6 is better with the size kept
un known to the investigator.
7
1. Experimental Design
Methods of Sampling Random sampling Systematic sampling Convenience sampling Stratified sampling
8
1. Experimental Design Random Sampling
Selection so that each individual member has an equal chance of being selected
Systematic Sampling Select some starting point and then select
every k th element in the population
9
1. Experimental Design Convenience Sampling
Use results that are easy to get
10
1. Experimental Design Stratified Sampling
Draw a sample from each stratum
11
2. Descriptive Statistics & Distributions Parameter: population quantity
Statistic: summary of the sample Inference for parameters: use sample Central Tendency
Mean (average) Median (middle value)
Variability Variance: measure of variation Standard deviation (sd): square root of
variance Standard error (se): sd of the estimate Median, quartiles, min., max, range, boxplot
Proportion
12
2. Descriptive Statistics & Distributions
Normal distribution
13
2. Descriptive Statistics & Distributions
Standard normal distribution: Mean 0, variance 1
14
2. Descriptive Statistics & Distributions
Z-test for means T-test for means if sd is unknown
15
3. Inference for Means Two-sample t-test
Two independent groups: Control and treatment Continuous variables Assumption: populations are normally
distributed Checking normality
Histogram Normal probability curve (Q-Q plot): straight? Shapiro-Wilk test, Kolmogorov-Smirnov test,
Anderson-Darling test If the normality assumption is violated
T-test is not appropriate. Possible transformation Use non-parametric alternative: Mann-Whitney
U-test (Wilcoxon rank-sum test)
16
3. Inference for Means A clinical trial on effectiveness of drug A in
preventing premature birth 30 pregnant women are randomly assigned to
control and treatment groups of size 15 each Primary endpoint: weight of the babies at birth
Treatment Control
n 15 15
mean 7.08 6.26
sd 0.90 0.96
17
3. Inference for Means Hypothesis: The group means are different
Null hypothesis (Ho): 1 = 2
Alternative hypothesis (H1): 1 2
Significance level: = 0.05 Assumption: Equal variance Degrees of freedom (df): Calculate the T-value (test statistic)
P-value: Type I error rate (false positive rate) Reject Ho if p-value < Do not reject Ho if p-value >
)/1()/1(
)()(
21
2121
nns
xxT
p
221 nn
18
3. Inference for Means
Previous example: Test at
P-value: 0.026 < 0.05Reject the null hypothesis that there is no drug effect.
413.2)15/1()15/1(866.0
26.608.7
)/1()/1(
)()(
21
2121
nns
xxt
p
866.021515
)96(.14)90(.14
2
)1()1( 22
21
222
2112
nn
snsnsp
05.0
19
3. Inference for Means Confidence interval (CI):
An interval of values used to estimate the true value of a population parameter.
The probability 1- that is the proportion of times that the CI actually contains the population parameter, assuming that the estimation process is repeated a large number of times.
Common choices: 90% CI ( = 10%), 95% CI ( = 5%), 99% CI ( = 1%)
3. Inference for Means
CI for a comparison of two means:
where
A 95% CI for the previous example:
ExxExx )()( 212121
)/1()/1( 212,2/ 21nnstE pnn
)52.1,12(.70.)26.608.7(
70.)]15/1)15/1[(866.)048.2()15/1()15/1(28,025. pstE
21
3. Inference for Means
SAS programming for Two-Sample T-test Data steps :Click ‘File’ Click ‘Import Data’ Select a data source
Click ‘Browse’ and find the path of the data file
Click ‘Next’ Fill the blank of ‘Member’ with the name of the SAS data set
Click ‘Finish’ Procedure steps :Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’ Click ‘File’
Click ‘Open By SAS Name’ Select the SAS data set and Click ‘OK’
Click ‘Statistics’ Click ‘ Hypothesis Tests’
Click ‘Two-Sample T-test for Means’
Select the independent variable as ‘Group’ and the dependent variable as
‘Dependent’ Choose the interested Hypothesis and Click ‘OK’
22
3. Inference for Means
Click ‘Statistics’ to select the
statistical procedure.
Click ‘File’ to open the SAS data set.
Click ‘File’ to import data and create
the SAS data set.
Click ‘Solution’ to create a
project to run statistical test
23
3. Inference for Means
Mann-Whitney U-Test (Wilcoxon Rank-Sum Test)
Nonparametric alternative to two-sample t-test
The populations don’t need to be normal H0: The two samples come from populations
with equal medians H1: The two samples come from populations
with different medians
24
3. Inference for Means
Mann-Whitney U-Test Procedure Temporarily combine the two samples
into one big sample, then replace each sample value with its rank
Find the sum of the ranks for either one of the two samples
Calculate the value of the z test statistic
25
3. Inference for Means
Mann-Whitney U-Test, Example Numbers in
parentheses are their ranks beginning with a rank of 1 assigned to the lowest value of 17.7.
R1 and R2: sum of ranks
26
3. Inference for Means Hypothesis: The group means are different
Ho: Men and women have same median BMI’s H1: Men and women have different median BMI’s
p-valuethus we do not reject H0 at =0.05.
There is no significant difference in BMI between men and women.
1 1 2( 1) 13(13 12 1)169
2 2R
n n n
1 2 1 2( 1) (13)(12)(13 12 1)18.385
12 12R
n n n n
187 1690.98
18.385R
R
Rz
27
3. Inference for Means
SAS Programming for Mann-Whitney U-Test Procedure
Data steps : The same as slide 21. Procedure steps : Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’
Click ‘File’ Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’ Click ‘ ANOVA’
Click ‘Nonparametric One-Way ANOVA’
Select the ‘Dependent’ and ‘Independent’ variables respectively
and choose the interested test Click ‘OK’
28
3. Inference for Means
Click ‘Statistics’ to select the
statistical procedure.
Click ‘File’ to open the SAS data
set.
Select the dependent and independent variables:
29
3. Inference for Means Paired t-test
Mean difference of matched pairs Test for changes (e.g., before & after) The measures in each pair are correlated. Assumption: population is normally distributed Take the difference in each pair and perform
one-sample t-test. Check normality
If the normality assumption is viloated T-test is not appropriate. Use non-parametric alternative: Wilcoxon
signed rank test
30
3. Inference for Means
Notation for paired t-test d = individual difference between the two values of a single matched pair µd = mean value of the differences d for the population of paired data = mean value of the differences d for the
paired sample data sd = standard deviation of the differences d
for the paired sample data n = number of pairs
d
d
31
3. Inference for Means Example: Systolic Blood Pressure
OC: Oral contraceptive
ID Without OC’s With OC’s Difference
1 115 128 13
2 112 115 3
3 107 106 -1
4 119 128 9
5 115 122 7
6 138 145 7
7 126 132 6
8 105 109 4
9 104 102 -2
10 115 117 2
32
3. Inference for Means Hypothesis: The group means are
different Ho: vs. H1: Significance level: = 0.05 Degrees of freedom (df): Test statistic
P-value: 0.009, thus reject Ho at =0.05 The data support the claim that oral
contraceptives affect the systolic bp.
0d 0d
32.310/57.4
8.4
/
ns
dt
d
d
91n
33
3. Inference for Means Confidence interval for matched pairs
100(1-)% CI:
95% CI for the mean difference of the systolic bp:
(1.53, 8.07)
n
std
n
std d
nd
n 1,2/1,2/ ,
27.38.410
57.426.28.4
109,025.0 ds
td
34
3. Inference for Means
SAS Programming for Paired T-test Data steps : The same as slide 21. Procedure steps : Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’
Click ‘File’ Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’ Click ‘ Hypothesis tests’
Click ‘Two-Sample Paired T-test for means’
Select the ‘Group1’ and ‘Group2’ variables respectively
Click ‘OK’
(Note: You can also calculate the difference, and use it as the dependent variable to run the one-sample t-test)
35
3. Inference for Means
Click ‘Statistics’ to select the
statistical procedure.
Click ‘File’ to open the SAS data
set.
Put the two group variables into ‘Group 1’ and ‘Group 2’
36
3. Inference for Means
Comparison of more than two means: ANOVA (Analysis of Variance)
One-way ANOVA: One factor, eg., control, drug 1, drug 2
Two-way ANOVA: Two factors, eg., drugs, age groups
Repeated measures: If there is a repeated measures within subject such as time points
37
3. Inference for means Example: Pulmonary disease
Endpoint: Mid-expiratory flow (FEF) in L/s 6 groups: nonsmokers (NS), passive smokers
(PS), noninhaling smokers (NI), light smokers (LS), moderate smokers (MS) and heavy smokers (HS)
Group name
Mean FEF SD FEF n
NS 3.78 0.79 200
PS 3.30 0.77 200
NI 3.32 0.86 50
LS 3.23 0.78 200
MS 2.73 0.81 200
HS 2.59 0.82 200
38
3. Inference for means Example: Pulmonary disease
Ho: group means are the same H1: not all the groups means are the same
P-value<0.001 There is a significant difference in the mean FEF
among the groups. Comparison of specific groups: linear contrast Multiple comparison: Bonferroni adjustment (/n)
SS df MS F statistic
P-value
Between
184.38
5 36.875 58.0 <0.001
Within 663.87
1044 0.636
Total 848.25
39
3. Inference for Means
SAS Programming for One-Way ANOVA Data steps : The same as slide 21. Procedure steps : Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’
Click ‘File’ Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’ Click ‘ ANOVA’
Click ‘One-Way ANOVA’
Select the ‘Independent’ and ‘Dependent’ variables respectively
Click ‘OK’
40
3. Inference for Means
Click ‘Solutions’ to select the
statistical procedure.
Click ‘File’ to open the SAS data
set.
Select the dependent and Independent variables:
41
4. Inference for Proportions
Chi-square test Testing difference of two proportions n: #successes, p: success rate Requirement: & H0: p1
= p2
H1: p1 p
2 (for two-sided test)
If the requirement is not satisfied, use Fisher’s exact test.
5np 5)1( pn
42
5. Power/Sample Size Calculation
Decide significance level (eg. 0.05) Decide desired power (eg. 80%) One-sided or two-sided test Comparison of means: two-sample t-test
Need to know sample means in each group Need to know sample sd’s in each group Calculation: use software (Nquery, power, etc)
Comparison of proportions: Chi-square test Need to know sample proportions in each group Continuity correction Small sample size: Fisher’s exact test Calculation: use software
43
6. Correlation and Regression
Correlation Pearson correlation for continuous variables Spearman correlation for ranked variables Chi-square test for categorical variables
Pearson correlation Correlation coefficient (r): -1<r<1 Test for coefficient: t-test Larger sample more significant for the
same value of the correlation coefficient Thus it is not meaningful to judge by the
magnitude of the correlation coefficient. Judge the significance of the correlation by p-
value
44
6. Correlation and Regression Regression
Objective Find out whether a significant linear relationship
exists between the response and independent variables
Use it to predict a future value Notation
X: independent (predictor) variable Y: dependent (response) variable
Multiple linear regression model
Where is the random error Checking the model (assumption)
Normality: q-q plot, histogram, Shapiro-Wilk test Equal variance: predicted y vs. error is a band shape Linear relationship: predicted y vs. each x
kκxxy ... 11
45
6. Correlation and RegressionWeight (x1) in
LBAge (x2) Blood pressure
(y)
152 50 120
183 20 141
171 20 124
165 30 126
158 30 117
161 50 129
149 60 123
158 50 125
170 40 132
153 55 123
164 40 132
190 40 155
185 20 147
46
6. Correlation and RegressionThe regression equation is
The mean blood pressure increases by 1.08 if weight (x1) increases by one pound and age (x2) remains fixed. Similarly, a 1-year increase in age with the weight held fixed will increase the mean blood pressure by 0.425.
s=2.509 R2=95.8% Error sd is estimated as 2.509 with df=13-3=10 95.8% of the variation in y can be explained by the
regression.
21 425.008.11.65 xxy
Predictor
Coefficient
se T-ratio P-value
Constant -65.10 14.94 -4.36 0.001
x1 1.077 0.077 13.98 0.000
x2 0.425 0.073 5.82 0.000
47
6. Correlation and Regression
SAS Programming for Linear Regression Data steps : The same as slide 21. Procedure steps : Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’
Click ‘File’ Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’ Click ‘ Regression’ Click ‘Linear’
Select the ‘Dependent’ (Response) variable and the ‘Explanatory’
(Predictor) variable respectively
Click ‘OK’
48
6. Correlation and Regression
Click ‘Solutions’ to select the
statistical procedure.
Click ‘File’ to open the SAS data
set.
Select the dependent and explanatory variables:
49
6. Correlation and Regression
Other regression models Polynomial regression Transformation Logistic regression