statisticalmodeling - uliege.be

44
Statistical modeling PhD course - April 2020 Course for PhD students 1

Upload: others

Post on 03-Oct-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statisticalmodeling - uliege.be

Statistical modelingPhD course - April 2020

Course for PhD students 1

Page 2: Statisticalmodeling - uliege.be

Statistical modeling: hypothesis testing

• A (too) brief reminder:1. Hypothesis testing involves expliciting an hypothesis to be tested (!)

• Example: the new diet decreases the adult dog weight, in average.

• Formally, it is most often easier to test the absence of effect (= null hypothesis). Here:

• H0: µ New diet = µ Old diet

• H1: µ New diet < µ Old diet (unilateral test)

Course for PhD students 2

Page 3: Statisticalmodeling - uliege.be

• A (too) brief reminder:2. Collect data allowing to test the hypothesis

• Example: create 2 « balanced » samples, feed one sample with the old diet and the other with the new diet and collect adult weights.

3. Obtain the probability p (= « p value ») of the observations if the nullhypothesis is true

• Example: use the weights from the 2 samples to compute a Student’s t statisticand compute the probability of such a large (or of a larger) value of t.

4. Accept the null hypothesis if p > α, reject (and accept H1) if p < α• Typically, α = 0.05, 0.01 or 0.001

Statistical modeling: hypothesis testing

Course for PhD students 3

Page 4: Statisticalmodeling - uliege.be

• From collected data to p values:• Simple tests

χ², Fisher...Logistic

regression

X discrete X continuous

Y discrete

Y continuous Regressions

t-tests,

ANOVA,

Wilcoxon,...

Statistical modeling: hypothesis testing

Dependent

variable

Independent

variable

Course for PhD students 4

Page 5: Statisticalmodeling - uliege.be

• From collected data to p values:• Examples of simple tests:

« Does treatment based on hydroxychloroquine reduce covid-19 occurrence ? »

• Dependent variable Y: covid-19 occurrence (Yes or No)

• Independent variable X: hydroxychloroquine treatment (Yes or No)

Statistical modeling: hypothesis testing

χ², Fisher...Logistic

regression

X discrete X continuous

Y discrete

Y continuous Regressions

t-tests,

ANOVA,

Wilcoxon,...Course for PhD students 5

Page 6: Statisticalmodeling - uliege.be

• From collected data to p values:• Examples of simple tests:

« Does treatment based on hydroxychloroquine reduce covid-19 occurrence ? »

• Experiment:compare the number of cases among N (randomly selected) treated patients to the the number of cases among M (randomly selected) untreated patients usinga χ² test (prospective study)

Statistical modeling: hypothesis testing

Course for PhD students 6

Page 7: Statisticalmodeling - uliege.be

• From collected data to p values:• Examples of simple tests:

« Does a new diet alter adult dogs’ weight ? »

• Dependent variable Y: adult dogs’ weight (continuous)

• Independent variable X: treatment (Old or New)

Statistical modeling: hypothesis testing

χ², Fisher...Logistic

regression

X discrete X continuous

Y discrete

Y continuous Regressions

t-tests,

ANOVA,

Wilcoxon,...

Course for PhD students 7

Page 8: Statisticalmodeling - uliege.be

• From collected data to p values:• Examples of simple tests:

« Does a new diet alter adult dogs’ weight ? »

• Experiment:Feed the new diet to a sample of young dogs, and feed the ‘old’ diet to anotherbalanced (i.e. same breed, body condition, sex, ...) sample of young dogs, and collect the weights at the same adult age. Perform a Student’s t test on the twoobtained averages.

Statistical modeling: hypothesis testing

Course for PhD students 8

Page 9: Statisticalmodeling - uliege.be

• From collected data to p values:• Examples of simple tests:

« Does cholesterol blood concentraction impact the occurrence of myocardialinfarction ? »

• Dependent variable Y: myocardial infarction (Yes or No)

• Independent variable X: cholesterol blood concentration (Continuous)

Statistical modeling: hypothesis testing

χ², Fisher...Logistic

regression

X discrete X continuous

Y discrete

Y continuous Regressions

t-tests,

ANOVA,

Wilcoxon,...Course for PhD students 9

Page 10: Statisticalmodeling - uliege.be

• From collected data to p values:• Examples of simple tests:

« Does cholesterol blood concentraction impact the occurrence of myocardialinfarction ? »

• Experiment:Sample randomly N cases (patients with a confirmed myocardial infarctionrecord) and M controls (people with no myocardial infraction record), obtain the cholesterol blood concentration for these two samples and perform a logisticregression to check whether a cholesterol concentration increase significantlyimpacts the probability of a myocardial infarction.

Statistical modeling: hypothesis testing

Course for PhD students 10

Page 11: Statisticalmodeling - uliege.be

• From collected data to p values:• Examples of simple tests:

« Can we predict an adult bovine weight based on the neck perimeter ? »

• Dependent variable Y: adult bovine weigth (Continuous)

• Independent variable X: neck perimeter (Continuous)

Statistical modeling: hypothesis testing

χ², Fisher...Logistic

regression

X discrete X continuous

Y discrete

Y continuous Regressions

t-tests,

ANOVA,

Wilcoxon,...Course for PhD students 11

Page 12: Statisticalmodeling - uliege.be

• From collected data to p values:• Examples of simple tests:

« Can we predict an adult bovine weight based on the neck perimeter ? »

• Experiment:Obtain the weights and the neck perimeters of a set of adult bovines from a targeted breed, build a regression of the weight on the neck, and check whetherthe relationship is significant (and, additionally, how much of the weightvariation is explained using the neck perimeter).

Statistical modeling: hypothesis testing

Course for PhD students 12

Page 13: Statisticalmodeling - uliege.be

• Errors ?

• Power = 1 - β

OK !Type I error

α

H0 accepted H0 rejected

H0 true

H0 false OK !Type II error

β

Statistical modeling: hypothesis testing

Course for PhD students 13

Page 14: Statisticalmodeling - uliege.be

• It seems that all situations have been introduced in thesesimple tests.

• Unfortunately, the real life is in most cases more complex thanthese simple examples and more complex statistics are neededto derive useful results for your research...

• We next show two examples to demonstrate the need for a larger class of models

Statistical modeling: hypothesis testing

Course for PhD students 14

Page 15: Statisticalmodeling - uliege.be

A first example dataset

• For this short introduction and as a vehicle for some of the ideas, we’ll use an example with 700 cases suffering fromkidney stones. The data set contains 4 fields:

• The first field is the patient number (from 1 to 700)

• The second field is the used treatment (either US - for ultra-soundtherapy -, or Open - for open surgery)

• The third field is the stone size (either Small - meaning the diameteris < 1 cm - or Large)

• The fourth field is the treatment result (either Success - no more stones in the 2 years following treatment - or Failure)

Course for PhD students 15

Page 16: Statisticalmodeling - uliege.be

(Available on the course website)

Number Type Size S_F

1 US Small Success

2 US Small Success

3 Open Small Success

698 Open Large Success

699 US Small Success

700 Open Small Success

A first example dataset

Course for PhD students 16

Page 17: Statisticalmodeling - uliege.be

A first example dataset

• A question of interest is certainly: « Which treament is most successful ? »

• This question might be answered using hypothesis testing:• ��: ��� = ���

i.e. the success rate π is the same for ultra-sound (US) and open surgery (OS)

• We might build a table with the collected data using a software (for examplecross-tables in Excel, or table function in R - the code is given in appendix 1)

All stones Open US Total

Failure 77 61 138

Success 273 289 562

Total 350 350 700

Course for PhD students 17

Page 18: Statisticalmodeling - uliege.be

A first example dataset

• A question of interest is certainly: « Which treament is most successful ? »

• Using these results, it seems that US is a little bit better than OS (61/350 failures for US vs 77/350 failures for OS) although the difference is not statistically significant (see appendix 1 for a χ² test)

• So, in the light of these results, and because US is less invasive thanOS, we might ask our physician to use US if needed.

• Then the physician tells you: « I’d rather use OS for large stones, and US for small stones »

Course for PhD students 18

Page 19: Statisticalmodeling - uliege.be

A first example dataset

• So, let’s look at our data again and let’s distinguish between small and large stones. This leads to 2 distinct tables as follows (appendix 1):

Large stones Open US Total

Failure 71 25 96

Success 192 55 247

Total 263 80 343

Small stones Open US Total

Failure 6 36 42

Success 81 234 315

Total 87 270 357

Course for PhD students 19

Page 20: Statisticalmodeling - uliege.be

A first example dataset

• If we summarize the results obtained on the same data set, eithertaking the whole set (« marginal set ») or splitting into two subsets(« conditional (on the stone size) subsets »):

• This very disturbing result shows that both conclusions on the conditional subsets differ from the marginal result: OS seems nowbetter than US !!! This strange phenomenon is known as Simpson paradox

Open

US

Marginal Small LargeP(Succ)

0.78

0.83

0.93

0.87

0.73

0.69

Course for PhD students 20

Page 21: Statisticalmodeling - uliege.be

• Assume the following problem: • A virus is causing skin lesions on domestic animals.

• The question we are interested in is: « Is there a relationshipbetween the size of the lesions and the time of exposition ? »

A second example dataset

Course for PhD students 21

Page 22: Statisticalmodeling - uliege.be

• Again, we can use hypothesis testing• ��: = 0

i.e. using linear regression, the change of the size of the lesion doesnot change with the time of exposition

• To test this (null) hypothesis, we can start collecting pairs of observations on affected cats, where each pair of observation is:

� = � �������� ���� ; � = ������ ����

• The data are presented on the next slide (file « lesions .txt »)

A second example dataset

Course for PhD students 22

Page 23: Statisticalmodeling - uliege.be

A second example dataset

• The first field is the time of exposition in days (a continuous variable)

• The second field is the lesion diametersize in mm (a continuous variable)

• The third field is a breed code (either1, 2, 3, 4 or 5)

Course for PhD students 23

Page 24: Statisticalmodeling - uliege.be

A second example dataset

Course for PhD students 24

Page 25: Statisticalmodeling - uliege.be

• It can be observed that:• The size seems to decrease with time

• This can be confirmed using regression of size on time (we willreview how to do this...):

• Regression coefficient = -0,25

• P(b ≤ -0,25 | β = 0) = 0,15

• not significant at the 5% level, although a trend is visible

• But counter intuitive for the vet in charge of the experiment !

A second example dataset

Course for PhD students 25

Page 26: Statisticalmodeling - uliege.be

• Idea: take the breed into account

A second example dataset

Course for PhD students 26

Page 27: Statisticalmodeling - uliege.be

• It can be observed that:• The size seems to increase with time for each breed

• Breed responses differ largely

• This can be confirmed using a linear model(we have to learn how to do that...):

• Regression coefficient = +0,64

• P(b ≥ 0,64 | β = 0) < 0,001 => very significant at the 5% level

• (Nuisance) Breed effect also very significant (p = 0,0002)

A second example dataset

Course for PhD students 27

Page 28: Statisticalmodeling - uliege.be

Modeling...• Conclusions of these 2 examples:

• Simple tests might be misleading...

• A solution could be to use models allowing to test several factors atthe same time

• In our first example:

• Treatment effect (open surgery or ultra-sounds)

• Kidney stone size effect (large or small)

• « Interaction » between these two effects (does the potential difference betweentreatments remain the asme for small and large size stones ?)

• In our second example:

• Time effect (continuous variable) ?

• Breed effect (differences between breeds) ?

• « Interaction » between these two effects (does the difference between breedsremain stable with time ?)

28Course for PhD students

Page 29: Statisticalmodeling - uliege.be

Models• A general (matrix) form for models expliciting the relationship

between the dependent variables Y and the independent onesX is:

where:• Y is a table of K measures taken on N individuals (« dependent

variables »)

• Yij is the jth measure on individual i

• X is another table of M measures taken on the N individuals(« independent variables »)

• f(.) is a function connecting X and Y.

• The exact form of f(.) is part of « the model »

29Course for PhD students

)(XfY =

Page 30: Statisticalmodeling - uliege.be

Models: a subset...• In this course:

• K = 1 (univariate models)

• Except otherwise stated:

Y = f(X) = X * β + ε (« linear model »)

where the different components Y, X, β and ε will be explained later...

• A (maybe more explicit) writing of this relationship is:

�� = ��� ∗ � + ��� ∗ � + ⋯ + �� ∗ + !� � = 1, … , %

where ���, … , �� are independent variables values measured on individual i and �� is the corresponding dependent variable value, �, … , are M parameters of the model, and !� is a quantity that is not explained by the model (called residual).

30Course for PhD students

Page 31: Statisticalmodeling - uliege.be

Models: a subset...• In this course:

• So, we will restrict ourselves to models where the relationshipbetween the dependent variables (Y) and the independent variables (X) takes a linear form, which is very useful in many practicalsituations...

31Course for PhD students

Page 32: Statisticalmodeling - uliege.be

Back to the dermatology problem

• A model can be fit using the linear form given above:

yi = α*ti + βi + εi

• yi = size of the lesion on animal i

• ti = time of exposition to the virus • a continuous variable, sometimes referred to as a covariate

• α = (linear) regression coefficient

• βi = effect on y of the breed of the animal i• a discrete variable (breed is "a","b","c","d" or "e")

• εi = residual for animal i

32Course for PhD students

Page 33: Statisticalmodeling - uliege.be

Back to the dermatology problem• In this model:

yi = α*ti + βi + εi

• The (unknown) parameters are:

• α = (linear) regression coefficient

• βa, βb, βc, βd, βe= effects of each of the breeds

• These parameters will need to be estimated

• Hypotheses about these parameters will be tested:

• H01: α = 0 (=> no (linear) effect of time on y)

• H02: βa = βb = βc = βd = βe (=> no difference between breeds)

• H03: ...

33Course for PhD students

Page 34: Statisticalmodeling - uliege.be

Back to the stones problem

• A model can be fit using the general form given above

Y = f(X)

• This is an example of a non linear relationship:

f(Xi) ≠ Xi * β + ei...

• Actually, the probability of a success for individual i is modelled as:

�� =&'( ∗)

�*&'( ∗)

• Some details of the resolution are provided in Appendix 2

34Course for PhD students

Page 35: Statisticalmodeling - uliege.be

What next ?• Before going into more details for the linear models, we will need

some tools that will be useful:• A math tool: matrices..

• A software tool: we will use R in this course

• These two tools will be the subject of the next lectures.

35Course for PhD students

Page 36: Statisticalmodeling - uliege.be

Appendix 1This appendix reports some R code to work on the « stones » problem

• Lines in green correspond to comments explaining the code

• Lines in blue correspond to R commands to be typed in the « console » of the software

• Lines in red correspond to text retruned by the software after typing a command.

Course for PhD students 36

Page 37: Statisticalmodeling - uliege.be

Appendix 1The R code to obtain the needed table for the stones problem:# Read the file (In "d:\courses\ " directory in my PC)# "file= " allows providing the name of the file# "head=T " instructs R that the file contains a header# (a first line with the column names)# "head=T " means that fields on a line are separated using tabs.f<-read.table(file= " d:\courses\pierres.txt ",head=T,sep= " \t " )# The file is now completely in f ! f$Type contains the 700 types# and f$S_F contains the 700 results (the names of thes e variables# are extracted from the header line)# Building the table is now very simple:table(f$S_F,f$Type)

Open US

Failure 77 61

Success 273 289Course for PhD students 37

Page 38: Statisticalmodeling - uliege.be

Appendix 1 (cont’d)The R code to obtain a χ² test for the stones problem:# We could put the table in a R variable:ct<-table(f$S_F,f$Type)# Next, we could instruct R to perform a chi-square tes t on the tablechisq.test(ct,correct=F)

Pearson's Chi-squared test

data: ct

X-squared = 2.3106, df = 1, p-value = 0.1285

# No significant difference (p=0.1285). correct=F i nstructs R not to# perform a Yate’s correction (this correction is u seful when numbers# are small)

Course for PhD students 38

Page 39: Statisticalmodeling - uliege.be

Appendix 1 (cont’d)The R code to obtain distinct tables for small and large stones:# Select only the records where the Size variable is " Small":table(f$S_F[f$Size=="Small"],f$Type[f$Size=="Small" ])

Open US

Failure 6 36

Success 81 234

# Select only the records where the Size variable is " Large":table(f$S_F[f$Size=="Large"],f$Type[f$Size=="Large" ])

Open US

Failure 71 25

Success 192 55

Course for PhD students 39

Page 40: Statisticalmodeling - uliege.be

Appendix 2: solving the stones problemSince this type of situations is not the target of the course, only a list of steps will be provided here, with few details.

1. Show that: E(y) = success proba.E(y) = Σ y*p(y) = 1*p + 0*(1-p) = p

2. Provide a link function1. Common choice: logit function

logit(p) = ln[ p / (1-p) ]1. p → 0 => logit(p) → - ∞

2. p → 1 => logit(p) → + ∞=> range of x mapped to ] - ∞; + ∞[

3. Obtain the « probability » functionlogit(p) = X β=> p = exp(Xβ)/[1+exp(Xβ)]

40Course for PhD students

Page 41: Statisticalmodeling - uliege.be

4. The « best » model is first to be obtained using « model selection » procedures. The final model is:

logit(pi) = µ + αi + βi

where:pi is the probability of success for iαi is the effect of the size (L or S) for iβi is the effect of the treatment (O or US) for i.

41Course for PhD students

Appendix 2: solving the stones problem

Page 42: Statisticalmodeling - uliege.be

5. Compute the likelihood of the data1. Assuming independence

6. Obtain the estimators of β that maximize the (log)likelihood1. Not easy (no « closed solution ») in general !

2. Optimization software needed

3. See the analytical development and an excel solution for our problem in attached documents.

42Course for PhD students

( ) iii

ii ppL δδ −−= ∏ 11*

Appendix 2: solving the stones problem

Page 43: Statisticalmodeling - uliege.be

7. Parameter estimation through ML leads to the following solutions:

m = 1.4849aSMALL = 0.6303aLARGE = -0.6303bOPEN = 0.1786bUS = -0.1786

ln(L) = -331.4333

43Course for PhD students

Appendix 2: solving the stones problem

Page 44: Statisticalmodeling - uliege.be

8. Risk assessment can be obtained through OR ± IC(95%)OR(L vs S) = exp(-0.6303-0.6303)

= 0.283IC(95%) = [0.177;0.453]

=> success probability significantly decreased if stone is large

OR(O vs US)= exp(0.1786+0.1786)= 1.429

IC(95%) = [0.912;2.239]

=> success probability not significantly affected by treatment type

44Course for PhD students

Appendix 2: solving the stones problem