methodological foundations of biomedical informatics...
TRANSCRIPT
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Methodological Foundations of Biomedical InformaticsLecture 4: Introduction to Biostatistics
Yixin Fang
Division of Biostatistics, Department of Population HealthNew York University School of Medicine
September 22, 2015
1 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Outline
1 Introduction
2 Hypothesis
3 Design
4 Data collection
5 Data analysis
6 Result report
7 Discussion
2 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Definition and history
Statistics: Science of data analysis
Biostatistics: A branch of applied statistics
Mathematics, Statistics, Biostatistics, Bioinformatics
3 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Two purposes
Two purposes of doing statistics analysis:
Prediction: for example, in technology industry
Explanation: for example, in medical school
4 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Two relationship
If your goal is explanation, then two subgoals:
Association
Causation
Democritus: “I would rather discover a single causal explanationthan become king of the Persians”.
5 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
An example: HRT
Want to know if hormone replacement therapy (HRT) can reducethe risk of heart attack in women?
Hypothesis
Design
Data collection
Data analysis
Result reporting
6 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Hypotheses
Null hypothesis (test against) vs. Alternative hypothesis (test for)
Hypotheses should depend only on population; they cannot dependon data
Good example: H0 : µ1 = µ0 vs. H1 : µ1 6= µ0
Good example: H0 : p1 = p0 vs. H1 : p1 6= p0
Bad example: H0 : x̄1 = x̄0 vs. H1 : x̄ 6= x̄0
Bad example: H0 : p̂1 = p̂0 vs. H1 : p̂1 6= p̂0
7 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Two errors
Type-I error (α): If the null hypothesis is true, but the alternativehypothesis is concluded
Type-II error (β): If the alternative hypothesis is true, but the nullhypothesis is concluded
Power (1− β): The probability of concluding the alternativehypothesis is true, provided that the alternative hypothesis is true
8 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Multiple comparisons
Simple comparison: H0 : µ1 = µ0 vs. H1 : µ1 6= µ0
Multiple comparisons:
H0 : µ(m)1 = µ
(m)0 ,m = 1, · · · ,M vs. H1 : at least one µ
(m)1 6= µ
(m)0
9 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
HRT example continued
A population of 40+ years old women
If everyone in the population take HRT, p1 be the proportion ofhaving heart attack within 20 years
If everyone in the population not take HRT, p0 be the proportion ofhaving heart attack within 20 years
H0 : p1 = p0 vs. H1 : p1 6= p0
10 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
HRT example continued
A population of 40+ years old women who took HRT, µ(1)1 is the
mean age, µ(2)1 is the mean household income, µ
(3)1 is the mean
education year
A population of 40+ years old women who took no HRT, µ(1)0 is the
mean age, µ(2)0 is the mean household income, µ
(3)1 is the mean
education year
H0 : µ(m)1 = µ
(m)0 ,m = 1, · · · , 3 vs. H1 : at least one µ
(m)1 6= µ
(m)0
11 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Observational studies vs. Experiments
Observational study: An observational study observes individualsand measures variables of interest but does not attempt to influencethe response. For exapmle, case-control studies, cohort studies, andcross-sectional studies
Experiment: An experiment deliberately imposes some treatmenton individuals in order to observe their response. The purpose of anexperiment is to study whether the treatment causes a change inthe response. For example, randomized controlled clinical trials andcluster RCTs
12 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
HRT example continued: rise and fall of HRT
Want to know if hormone replacement therapy (HRT) can reducethe risk of heart attack?
In 1992, some observational studies compared women who weretaking HRT with others who were not. Conclusion is: HRT canreduce risk of heart attack
In 2002, an experiment assigned women to either HRT or to dummypills that look and taste the same as the hormone pills, where theassignment is done by coin toss. Conclusion is: HRT does notreduce heart attack.
13 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Association or causation?
In this case, we trust RCT
In observation study, women who choose to take HRT are verydifferent from women who do not: they are richer and bettereducated and see doctors more often. These women do many thingsto maintain their health. It is not surprising that they have fewerheart attacks
Therefore, in the observational studies, the effects of actually takingHRT are confounded with the characteristics of women whochoose to take HRT. Confounders: characteristics (age, income,education, etc) of women who choose HRT
14 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Plan a design
Goal: association or causation
Design: observational study or experiment
Power analysis and sample size calculation (PASS)
15 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Response variable vs. explanatory variables
Response variable; outcome variable; dependent variable; etc
Explanatory variable; predictors; independent variable; etc
(Main explanatory variable of interest along with covariates)
16 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Quantitative variable vs. Categorical variable
Quantitative variable; numerical variableinclude: count variable, continuous variable, etc
Categorical variableinclude: binary variable, multi-categorical variableinclude: ordinal variable, nominal variable
17 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Missing data
During data collection, missing data are a notorious problem
During data recording, make sure variable types are consistent
18 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
HRT example continued
Response variable: whether or not heart attack (0-1 or Yes-No)
Main predictor of interest: whether or not HRT (0-1 or Yes-No)
Covariates: age, income, education (observed confounders, latentconfounders)
19 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
HRT dataset 1
Want to know if hormone replacement therapy (HRT) can reducethe risk of heart attack?
An observational study compares women who were taking HRT withothers who were not
Dataset 1
ID Heart attack HRT Age Income Education1 0 1 50 100,000 162 0 0 55 120,000 16...
......
......
...N 1 0 49 90,000 18
20 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
HRT dataset 2
Want to know if hormone replacement therapy (HRT) can reducethe risk of heart attack?
An RCT assigns women to either HRT or to dummy pills randomly
Dataset 2
ID Heart attack HRT Age Income Education1 0 0 50 100,000 16...
......
......
...n 1 0 55 120,000 16
n + 1 1 1 58 200,000 12...
......
......
...2n 0 1 49 90,000 18
21 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Three steps
Descriptive analysis
Univariate analysis
Multivariate analysis
22 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Step 1: Descriptive analysis
Numerical summary for categorical variable: frequency table
Graphical display for categorical variable: bar graph and pie chart
Numerical summary for quantitative variable: mean and standarddeviation; median and inter-quartile-range
Graphical display for quantitative variable: histogram and box-plot
23 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Step 2: Univariate analysis
Univariate analysis considers relationship between response variableand each single explanatory variable
Between quantitative response and quantitative explanatory variable
Between quantitative response and categorical explanatory variable
Between categorical response and quantitative explanatory variable:
Between categorical response and categorical explanatory variable
24 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Univariate analysis (1)
Relationship between quantitative response and quantitativeexplanatory variable
Graphical display: Scatter-plot
Pearson’s correlation r
Spearman’s correlation ρ
25 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Univariate analysis (2)
Relationship between quantitative response and categoricalexplanatory variableRelationship btween categorical response and quantitativeexplanatory variable
Graphical display: Box-plot side by side; Bar graph with error bar
T-test; Analysis of Variance (ANOVA)
Wilcoxon rank sum test; Kruskal-Wallis test
26 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Univariate analysis (3)
Relationship between categorical response and categoricalexplanatory variable
Graphical display: Two-by-two table; Contingency table
Chi-square test
Fisher exact test
27 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Univariate analysis continued
Univariate analysis also considers relationship among explanatoryvariables
Between quantitative predictor and quantitative predictor
Between quantitative predictor and categorical predictor
Between categorical predictor and quantitative predictor
Between categorical predictor and categorical predictor
28 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
HRT example continued: RCT data
Let p1 be the population proportion of heart attack if HRT
Let p0 be the population proportion of heart attack if no HRT
H0 : p1 = p0 vs. H1 : p1 6= p0
Let p̂1 be the proportion of heart attack in the intervention arm
Let p̂0 be the proportion of heart attack in the control arm
Consider test statistics
T = Z 2 =(p̂1 − p̂0)2
p̂1(1− p̂1)/n + p̂0(1− p̂0)/n
29 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
P-value
Under the null hypothesis, T follows Chi-square distribution χ21
Based on the observed data, T = tobs
P-value is the probability that we could obtain test statistic largerthan or equal to tobs (that is, a more extreme value than tobs), iftest statistic were calculated based on an independent datagenerated assuming the null hypothesis
In mathematics, p-value= Prob{T ≥ tobs |H0}, which can becalculated easily because T ∼ χ2
1 under H0
30 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Permutation
How can we find the p-value if we don’t know explicitly thedistribution of T under H0? Or if the distribution is verycomplicated although we know the distribution
We can use permutation to an approximate p-value
Based on the observed data, we have tobs
If we randomly permute the arm assignment once (that is, randomassign n out of 2n subjects as arm HRT = 1 and the others as armHRT = 0), while the other variables are unchanged, we obtain T (1)
If we repeat the above permutation process M times (say,M = 10, 000), then we obtain T (m),m = 1, · · · ,M
Finally, the p-value can be approximated by
#{m : T (m) ≥ tobs}M
31 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Magic 0.05
Fisher proposed: if p-value < 0.05, the result is statisticalsignificant; otherwise, the result is not statistical significant
In other words, the type-I error rate α of the above test is 0.05.Because even if H0 is true, there is still 0.05 chance to obtain antobs which has p-value less than 0.05, which rejects H0
32 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Confidence interval
The difference in proportions d̂ = p̂1 − p̂0 is point estimate of thepopulation difference in proportions d = p1 − p0
The standard deviation of d̂ is p1(1− p1)/n + p0(1− p0)/n, whichcan be estimated by the standard error of d̂ ,
SE = p̂1(1− p̂1)/n + p̂0(1− p̂0)/n
Then an 95% confidence interval estimate of d is
d̂ ± 1.96× SE = (d̂ − 1.96× SE , d̂ + 1.96× SE )
Confidence interval estimate can be used for hypothesis testing: ifzero is not in the above 95% CI, then reject H0; otherwise acceptH0. The type-I error rate of this test is 0.05
33 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Bootstrap
How can we find the standard error (SE) if we don’t have an explicitformula or if the formula is too complicated?
We can use Bootstrap to find an approximate SE
Based on the observed data, we have d̂
The dataset is a matrix of 2n rows. If we randomly select a row sayi1, put it back and then select randomly another row say i2. Keepdoing this until we get 2n rows i1, · · · , i2n. Then we have abootstrap sample of the same sample size as that of the originalsample. Based on this this bootstrap sample, we calculate d̂ (1)
Repeat the above bootstrap process for M times (say M = 500)then we have d̂ (m),m = 1, · · ·MFinally, we can approximate the standard error by
SE boot =
√√√√ M∑m=1
(d̂ (m) − ¯̂d)2/(M − 1)
34 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
HRT example continued: Multiplecomparisons
Want to know if hormone replacement therapy (HRT) can reducethe risk of heart attack, stroke, hypertension, diabetes, or lungcancer? Just call them Disease 1 to 5
Let pj1 be the population proportion of disease j if HRT
Let pj0 be the population proportion of disease j if no HRT
H0 : pj1 = pj0, j = 1, · · · , 5 vs. H1 : pj1 6= pj0 for some j , j = 1, · · · 5
Let p̂j1 be the proportion of disease j in the intervention arm
Let p̂j0 be the proportion of disease j in the control arm
Let test statistics
Tj = Z 2j =
(p̂j1 − p̂j0)2
p̂j1(1− p̂j1)/n + p̂j0(1− p̂j0)/n
35 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Family-wise error rate (FWER)
For each test statistic Tj , we have tobsj , and then we have p-value pj
Now, as usual, we apply magic 0.05 criterion for significance. Thatis, if p-value pj is less than 0.05, we conclude that HRT issignificantly associated with disease j , j = 1, · · · 5
Then, what is the type-I error rate? That is, what is the probabilitythat we will reject H0 if H0 is true?
Family-wise error rate FWER= Prob(reject H0|H0)=1-Prob(acceptH0|H0)=1-Prob(accept Hj0, j = 1, · · · 5|H0)=1− (1− 0.05)5
.= 0.23
36 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Bonferroni correction
Assume that there are J (say J = 5) comparisons
Consider α = 0.05/J (for J = 5, α = 0.01) as criterion to claimsignificance
For the above example, FWER Prob(reject H0|H0)=1-Prob(acceptH0|H0)=1-Prob(accept Hj0, j = 1, · · · 5|H0)=1− (1− 0.01)5
.= 0.05
Bonferroni method is conservative. Other methods include Sidakmethod, Scheffe method, and more recent False-discovery-rate(FDR)
37 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Step 3: Multivariate analysis
There is one or more than one response variables
There are more than one explanatory variables
Regression analysis is the most popular multivariate analysis
38 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Regression analysis
Two most comment regression analyses:
Response variable is continuous: linear regression
Response variable is binary variable: logistic regression
39 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
More regression analysis
Generalized linear regression
Mixed-effect regression
Nonlinear regression
Classification and regression tree (CART)
Etc.
40 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
HRT example continued: Observational data
Outcome is whether or not heart attack
Variable of interest is whether or not HRT
Covariates such as age, income, and education
41 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Logistic regression
Logistic regression model:
log-odds of (probability of heart attack) ∼ α + β1× indicator ofHRT + β2× age + β2× income + β2×education
Fit the model to the data, obtain point-estimate β̂j for βj , along
with 95% confidence interval estimate β̂j ± 1.96× SEj , and p-valuefor Hj0 : βj = 0 vs. Hj1 : βj 6= 0
42 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Odds ratio
For example, the coefficient of HRT effect is β̂1 = −1
First of all, because it is negative, the association between heartattack and HRT is negative. That is, taking HRT is associated withlower chance of heart attack
Then note that exp(−1) = 0.37. The odds of heart attack of awoman taking HRT is 0.37 of the odds of heart attack of a womannot taking HRT. That is, the odds of heart attack of a woman nottaking HRT is 1/0.37=2.72 times the odds of heart attack of awoman taking HRT
Caution: association from an observation study doesn’t implycausation
43 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Reporting statistical findings
INTEGRITY!!!
Make sure you are doing data mining or statistical inference
State honestly that you are doing data mining or statistical inference
Data mining: You are given some data; You “squeeze” the data veryhard; You get some interesting finding; Then you propose ahypothesis
Statistical inference: You specify a hypothesis; You design a studyand collect some data; You get some interesting finding; Then youconfirm or discard the hypothesis
44 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Frequentist vs. Bayesian
There are two schools of statistical inferences, although in thislecture we focus on the frequentist school
Frequentist: From distribution of data given the parameter, wederive likelihood function of the parameter given data. Based on thelikelihood function, we obtain point estimate of the parameter andmake inference
Bayesian: From prior distribution of the parameter and distributionof data given the parameter, we derive posterior distribution of theparameter given data. Based on the posterior distribution of theparameter, which includes all the undated information about theparameter, we examine the properties of the parameter
45 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Data mining
Are they doing the same thing? (Are you stealing our jobs?)
Data mining
Machine learning
Statistical learning
46 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Two types of learning
Supervised learning: classification; regression; etc
Unsupervised learning: cluster analysis; pattern recognition; etc
47 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Growing our toolbox
“If all you have is a hammer, everything looks like a nail”
Our toolbox should include traditional methods such as t-test andlinear regression
Our toolbox should also include new tools such as mixed-effectmodel, CART, multiple imputation, and newly developed methods(boosting and random forest)
48 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
History of data
We conclude this lecture with a brief history of data. Let n be thenumber of subjects and p be the number of variables
Data where n is large and p is small (Before R. A. Fisher)
Data where n is small and p is small (R. A. Fisher)
Data where n is small and p is large (High-dimensional data)
Data where n is large and p is large (Big data)
49 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics
Introduction Hypothesis Design Data collection Data analysis Result report Discussion
Questions?
?
50 / 50
Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics