methodological foundations of biomedical informatics...

Introduction Hypothesis Design Data collection Data analysis Result report Discussion

Methodological Foundations of Biomedical InformaticsLecture 4: Introduction to Biostatistics

Yixin Fang

Division of Biostatistics, Department of Population HealthNew York University School of Medicine

September 22, 2015

1 / 50

Methodological Foundations of Biomedical Informatics , Lecture 4: Introduction to Biostatistics


Outline

1 Introduction

2 Hypothesis

3 Design

4 Data collection

5 Data analysis

6 Result report

7 Discussion

2 / 50



Definition and history

Statistics: Science of data analysis

Biostatistics: A branch of applied statistics

Mathematics, Statistics, Biostatistics, Bioinformatics

3 / 50



Two purposes

Two purposes of doing statistics analysis:

Prediction: for example, in technology industry

Explanation: for example, in medical school

4 / 50



Two relationship

If your goal is explanation, then two subgoals:

Association

Causation

Democritus: “I would rather discover a single causal explanationthan become king of the Persians”.

5 / 50



An example: HRT

Want to know if hormone replacement therapy (HRT) can reducethe risk of heart attack in women?

Hypothesis

Design

Data collection

Data analysis

Result reporting

6 / 50



Hypotheses

Null hypothesis (test against) vs. Alternative hypothesis (test for)

Hypotheses should depend only on population; they cannot dependon data

Good example: H0 : µ1 = µ0 vs. H1 : µ1 6= µ0

Good example: H0 : p1 = p0 vs. H1 : p1 6= p0

Bad example: H0 : x̄1 = x̄0 vs. H1 : x̄ 6= x̄0

Bad example: H0 : p̂1 = p̂0 vs. H1 : p̂1 6= p̂0

7 / 50



Two errors

Type-I error (α): If the null hypothesis is true, but the alternativehypothesis is concluded

Type-II error (β): If the alternative hypothesis is true, but the nullhypothesis is concluded

Power (1− β): The probability of concluding the alternativehypothesis is true, provided that the alternative hypothesis is true

8 / 50



Multiple comparisons

Simple comparison: H0 : µ1 = µ0 vs. H1 : µ1 6= µ0

Multiple comparisons:

H0 : µ(m)1 = µ

(m)0 ,m = 1, · · · ,M vs. H1 : at least one µ

(m)1 6= µ

(m)0

9 / 50



HRT example continued

A population of 40+ years old women

If everyone in the population take HRT, p1 be the proportion ofhaving heart attack within 20 years

If everyone in the population not take HRT, p0 be the proportion ofhaving heart attack within 20 years

H0 : p1 = p0 vs. H1 : p1 6= p0

10 / 50




A population of 40+ years old women who took HRT, µ(1)1 is the

mean age, µ(2)1 is the mean household income, µ

(3)1 is the mean

education year

A population of 40+ years old women who took no HRT, µ(1)0 is the

mean age, µ(2)0 is the mean household income, µ

(3)1 is the mean

education year

H0 : µ(m)1 = µ

(m)0 ,m = 1, · · · , 3 vs. H1 : at least one µ

(m)1 6= µ

(m)0

11 / 50



Observational studies vs. Experiments

Observational study: An observational study observes individualsand measures variables of interest but does not attempt to influencethe response. For exapmle, case-control studies, cohort studies, andcross-sectional studies

Experiment: An experiment deliberately imposes some treatmenton individuals in order to observe their response. The purpose of anexperiment is to study whether the treatment causes a change inthe response. For example, randomized controlled clinical trials andcluster RCTs

12 / 50



HRT example continued: rise and fall of HRT

Want to know if hormone replacement therapy (HRT) can reducethe risk of heart attack?

In 1992, some observational studies compared women who weretaking HRT with others who were not. Conclusion is: HRT canreduce risk of heart attack

In 2002, an experiment assigned women to either HRT or to dummypills that look and taste the same as the hormone pills, where theassignment is done by coin toss. Conclusion is: HRT does notreduce heart attack.

13 / 50



Association or causation?

In this case, we trust RCT

In observation study, women who choose to take HRT are verydifferent from women who do not: they are richer and bettereducated and see doctors more often. These women do many thingsto maintain their health. It is not surprising that they have fewerheart attacks

Therefore, in the observational studies, the effects of actually takingHRT are confounded with the characteristics of women whochoose to take HRT. Confounders: characteristics (age, income,education, etc) of women who choose HRT

14 / 50



Plan a design

Goal: association or causation

Design: observational study or experiment

Power analysis and sample size calculation (PASS)

15 / 50



Response variable vs. explanatory variables

Response variable; outcome variable; dependent variable; etc

Explanatory variable; predictors; independent variable; etc

(Main explanatory variable of interest along with covariates)

16 / 50



Quantitative variable vs. Categorical variable

Quantitative variable; numerical variableinclude: count variable, continuous variable, etc

Categorical variableinclude: binary variable, multi-categorical variableinclude: ordinal variable, nominal variable

17 / 50



Missing data

During data collection, missing data are a notorious problem

During data recording, make sure variable types are consistent

18 / 50




Response variable: whether or not heart attack (0-1 or Yes-No)

Main predictor of interest: whether or not HRT (0-1 or Yes-No)

Covariates: age, income, education (observed confounders, latentconfounders)

19 / 50



HRT dataset 1


An observational study compares women who were taking HRT withothers who were not

Dataset 1

ID Heart attack HRT Age Income Education1 0 1 50 100,000 162 0 0 55 120,000 16...

......

......

...N 1 0 49 90,000 18

20 / 50



HRT dataset 2


An RCT assigns women to either HRT or to dummy pills randomly

Dataset 2

ID Heart attack HRT Age Income Education1 0 0 50 100,000 16...

......

......

...n 1 0 55 120,000 16

n + 1 1 1 58 200,000 12...

......

......

...2n 0 1 49 90,000 18

21 / 50



Three steps

Descriptive analysis

Univariate analysis

Multivariate analysis

22 / 50



Step 1: Descriptive analysis

Numerical summary for categorical variable: frequency table

Graphical display for categorical variable: bar graph and pie chart

Numerical summary for quantitative variable: mean and standarddeviation; median and inter-quartile-range

Graphical display for quantitative variable: histogram and box-plot

23 / 50



Step 2: Univariate analysis

Univariate analysis considers relationship between response variableand each single explanatory variable

Between quantitative response and quantitative explanatory variable

Between quantitative response and categorical explanatory variable

Between categorical response and quantitative explanatory variable:

Between categorical response and categorical explanatory variable

24 / 50



Univariate analysis (1)

Relationship between quantitative response and quantitativeexplanatory variable

Graphical display: Scatter-plot

Pearson’s correlation r

Spearman’s correlation ρ

25 / 50




Relationship between quantitative response and categoricalexplanatory variableRelationship btween categorical response and quantitativeexplanatory variable

Graphical display: Box-plot side by side; Bar graph with error bar

T-test; Analysis of Variance (ANOVA)

Wilcoxon rank sum test; Kruskal-Wallis test

26 / 50




Relationship between categorical response and categoricalexplanatory variable

Graphical display: Two-by-two table; Contingency table

Chi-square test

Fisher exact test

27 / 50



Univariate analysis continued

Univariate analysis also considers relationship among explanatoryvariables

Between quantitative predictor and quantitative predictor

Between quantitative predictor and categorical predictor

Between categorical predictor and quantitative predictor

Between categorical predictor and categorical predictor

28 / 50



HRT example continued: RCT data

Let p1 be the population proportion of heart attack if HRT

Let p0 be the population proportion of heart attack if no HRT

H0 : p1 = p0 vs. H1 : p1 6= p0

Let p̂1 be the proportion of heart attack in the intervention arm

Let p̂0 be the proportion of heart attack in the control arm

Consider test statistics

T = Z 2 =(p̂1 − p̂0)2

p̂1(1− p̂1)/n + p̂0(1− p̂0)/n

29 / 50



P-value

Under the null hypothesis, T follows Chi-square distribution χ21

Based on the observed data, T = tobs

P-value is the probability that we could obtain test statistic largerthan or equal to tobs (that is, a more extreme value than tobs), iftest statistic were calculated based on an independent datagenerated assuming the null hypothesis

In mathematics, p-value= Prob{T ≥ tobs |H0}, which can becalculated easily because T ∼ χ2

1 under H0

30 / 50



Permutation

How can we find the p-value if we don’t know explicitly thedistribution of T under H0? Or if the distribution is verycomplicated although we know the distribution

We can use permutation to an approximate p-value

Based on the observed data, we have tobs

If we randomly permute the arm assignment once (that is, randomassign n out of 2n subjects as arm HRT = 1 and the others as armHRT = 0), while the other variables are unchanged, we obtain T (1)

If we repeat the above permutation process M times (say,M = 10, 000), then we obtain T (m),m = 1, · · · ,M

Finally, the p-value can be approximated by

#{m : T (m) ≥ tobs}M

31 / 50



Magic 0.05

Fisher proposed: if p-value < 0.05, the result is statisticalsignificant; otherwise, the result is not statistical significant

In other words, the type-I error rate α of the above test is 0.05.Because even if H0 is true, there is still 0.05 chance to obtain antobs which has p-value less than 0.05, which rejects H0

32 / 50



Confidence interval

The difference in proportions d̂ = p̂1 − p̂0 is point estimate of thepopulation difference in proportions d = p1 − p0

The standard deviation of d̂ is p1(1− p1)/n + p0(1− p0)/n, whichcan be estimated by the standard error of d̂ ,

SE = p̂1(1− p̂1)/n + p̂0(1− p̂0)/n

Then an 95% confidence interval estimate of d is

d̂ ± 1.96× SE = (d̂ − 1.96× SE , d̂ + 1.96× SE )

Confidence interval estimate can be used for hypothesis testing: ifzero is not in the above 95% CI, then reject H0; otherwise acceptH0. The type-I error rate of this test is 0.05

33 / 50



Bootstrap

How can we find the standard error (SE) if we don’t have an explicitformula or if the formula is too complicated?

We can use Bootstrap to find an approximate SE

Based on the observed data, we have d̂

The dataset is a matrix of 2n rows. If we randomly select a row sayi1, put it back and then select randomly another row say i2. Keepdoing this until we get 2n rows i1, · · · , i2n. Then we have abootstrap sample of the same sample size as that of the originalsample. Based on this this bootstrap sample, we calculate d̂ (1)

Repeat the above bootstrap process for M times (say M = 500)then we have d̂ (m),m = 1, · · ·MFinally, we can approximate the standard error by

SE boot =

√√√√ M∑m=1

(d̂ (m) − ¯̂d)2/(M − 1)

34 / 50



HRT example continued: Multiplecomparisons

Want to know if hormone replacement therapy (HRT) can reducethe risk of heart attack, stroke, hypertension, diabetes, or lungcancer? Just call them Disease 1 to 5

Let pj1 be the population proportion of disease j if HRT

Let pj0 be the population proportion of disease j if no HRT

H0 : pj1 = pj0, j = 1, · · · , 5 vs. H1 : pj1 6= pj0 for some j , j = 1, · · · 5

Let p̂j1 be the proportion of disease j in the intervention arm

Let p̂j0 be the proportion of disease j in the control arm

Let test statistics

Tj = Z 2j =

(p̂j1 − p̂j0)2

p̂j1(1− p̂j1)/n + p̂j0(1− p̂j0)/n

35 / 50



Family-wise error rate (FWER)

For each test statistic Tj , we have tobsj , and then we have p-value pj

Now, as usual, we apply magic 0.05 criterion for significance. Thatis, if p-value pj is less than 0.05, we conclude that HRT issignificantly associated with disease j , j = 1, · · · 5

Then, what is the type-I error rate? That is, what is the probabilitythat we will reject H0 if H0 is true?

Family-wise error rate FWER= Prob(reject H0|H0)=1-Prob(acceptH0|H0)=1-Prob(accept Hj0, j = 1, · · · 5|H0)=1− (1− 0.05)5

.= 0.23

36 / 50



Bonferroni correction

Assume that there are J (say J = 5) comparisons

Consider α = 0.05/J (for J = 5, α = 0.01) as criterion to claimsignificance

For the above example, FWER Prob(reject H0|H0)=1-Prob(acceptH0|H0)=1-Prob(accept Hj0, j = 1, · · · 5|H0)=1− (1− 0.01)5

.= 0.05

Bonferroni method is conservative. Other methods include Sidakmethod, Scheffe method, and more recent False-discovery-rate(FDR)

37 / 50



Step 3: Multivariate analysis

There is one or more than one response variables

There are more than one explanatory variables

Regression analysis is the most popular multivariate analysis

38 / 50



Regression analysis

Two most comment regression analyses:

Response variable is continuous: linear regression

Response variable is binary variable: logistic regression

39 / 50



More regression analysis

Generalized linear regression

Mixed-effect regression

Nonlinear regression

Classification and regression tree (CART)

Etc.

40 / 50



HRT example continued: Observational data

Outcome is whether or not heart attack

Variable of interest is whether or not HRT

Covariates such as age, income, and education

41 / 50



Logistic regression

Logistic regression model:

log-odds of (probability of heart attack) ∼ α + β1× indicator ofHRT + β2× age + β2× income + β2×education

Fit the model to the data, obtain point-estimate β̂j for βj , along

with 95% confidence interval estimate β̂j ± 1.96× SEj , and p-valuefor Hj0 : βj = 0 vs. Hj1 : βj 6= 0

42 / 50



Odds ratio

For example, the coefficient of HRT effect is β̂1 = −1

First of all, because it is negative, the association between heartattack and HRT is negative. That is, taking HRT is associated withlower chance of heart attack

Then note that exp(−1) = 0.37. The odds of heart attack of awoman taking HRT is 0.37 of the odds of heart attack of a womannot taking HRT. That is, the odds of heart attack of a woman nottaking HRT is 1/0.37=2.72 times the odds of heart attack of awoman taking HRT

Caution: association from an observation study doesn’t implycausation

43 / 50



Reporting statistical findings

INTEGRITY!!!

Make sure you are doing data mining or statistical inference

State honestly that you are doing data mining or statistical inference

Data mining: You are given some data; You “squeeze” the data veryhard; You get some interesting finding; Then you propose ahypothesis

Statistical inference: You specify a hypothesis; You design a studyand collect some data; You get some interesting finding; Then youconfirm or discard the hypothesis

44 / 50



Frequentist vs. Bayesian

There are two schools of statistical inferences, although in thislecture we focus on the frequentist school

Frequentist: From distribution of data given the parameter, wederive likelihood function of the parameter given data. Based on thelikelihood function, we obtain point estimate of the parameter andmake inference

Bayesian: From prior distribution of the parameter and distributionof data given the parameter, we derive posterior distribution of theparameter given data. Based on the posterior distribution of theparameter, which includes all the undated information about theparameter, we examine the properties of the parameter

45 / 50



Data mining

Are they doing the same thing? (Are you stealing our jobs?)

Data mining

Machine learning

Statistical learning

46 / 50



Two types of learning

Supervised learning: classification; regression; etc

Unsupervised learning: cluster analysis; pattern recognition; etc

47 / 50



Growing our toolbox

“If all you have is a hammer, everything looks like a nail”

Our toolbox should include traditional methods such as t-test andlinear regression

Our toolbox should also include new tools such as mixed-effectmodel, CART, multiple imputation, and newly developed methods(boosting and random forest)

48 / 50



History of data

We conclude this lecture with a brief history of data. Let n be thenumber of subjects and p be the number of variables

Data where n is large and p is small (Before R. A. Fisher)

Data where n is small and p is small (R. A. Fisher)

Data where n is small and p is large (High-dimensional data)

Data where n is large and p is large (Big data)

49 / 50



Questions?

?

50 / 50


methodological foundations of biomedical informatics...

Documents