statisticalmodeling - uliege.be

Statistical modelingPhD course - April 2020

Course for PhD students 1

Statistical modeling: hypothesis testing

• A (too) brief reminder:1. Hypothesis testing involves expliciting an hypothesis to be tested (!)

• Example: the new diet decreases the adult dog weight, in average.

• Formally, it is most often easier to test the absence of effect (= null hypothesis). Here:

• H0: µ New diet = µ Old diet

• H1: µ New diet < µ Old diet (unilateral test)


• A (too) brief reminder:2. Collect data allowing to test the hypothesis

• Example: create 2 « balanced » samples, feed one sample with the old diet and the other with the new diet and collect adult weights.

3. Obtain the probability p (= « p value ») of the observations if the nullhypothesis is true

• Example: use the weights from the 2 samples to compute a Student’s t statisticand compute the probability of such a large (or of a larger) value of t.

4. Accept the null hypothesis if p > α, reject (and accept H1) if p < α• Typically, α = 0.05, 0.01 or 0.001



• From collected data to p values:• Simple tests

χ², Fisher...Logistic

regression

X discrete X continuous

Y discrete

Y continuous Regressions

t-tests,

ANOVA,

Wilcoxon,...


Dependent

variable

Independent

variable


• From collected data to p values:• Examples of simple tests:

« Does treatment based on hydroxychloroquine reduce covid-19 occurrence ? »

• Dependent variable Y: covid-19 occurrence (Yes or No)

• Independent variable X: hydroxychloroquine treatment (Yes or No)



regression


Y discrete


t-tests,

ANOVA,

Wilcoxon,...Course for PhD students 5


« Does treatment based on hydroxychloroquine reduce covid-19 occurrence ? »

• Experiment:compare the number of cases among N (randomly selected) treated patients to the the number of cases among M (randomly selected) untreated patients usinga χ² test (prospective study)




« Does a new diet alter adult dogs’ weight ? »

• Dependent variable Y: adult dogs’ weight (continuous)

• Independent variable X: treatment (Old or New)



regression


Y discrete


t-tests,

ANOVA,

Wilcoxon,...



« Does a new diet alter adult dogs’ weight ? »

• Experiment:Feed the new diet to a sample of young dogs, and feed the ‘old’ diet to anotherbalanced (i.e. same breed, body condition, sex, ...) sample of young dogs, and collect the weights at the same adult age. Perform a Student’s t test on the twoobtained averages.




« Does cholesterol blood concentraction impact the occurrence of myocardialinfarction ? »

• Dependent variable Y: myocardial infarction (Yes or No)

• Independent variable X: cholesterol blood concentration (Continuous)



regression


Y discrete


t-tests,

ANOVA,



« Does cholesterol blood concentraction impact the occurrence of myocardialinfarction ? »

• Experiment:Sample randomly N cases (patients with a confirmed myocardial infarctionrecord) and M controls (people with no myocardial infraction record), obtain the cholesterol blood concentration for these two samples and perform a logisticregression to check whether a cholesterol concentration increase significantlyimpacts the probability of a myocardial infarction.




« Can we predict an adult bovine weight based on the neck perimeter ? »

• Dependent variable Y: adult bovine weigth (Continuous)

• Independent variable X: neck perimeter (Continuous)



regression


Y discrete


t-tests,

ANOVA,



« Can we predict an adult bovine weight based on the neck perimeter ? »

• Experiment:Obtain the weights and the neck perimeters of a set of adult bovines from a targeted breed, build a regression of the weight on the neck, and check whetherthe relationship is significant (and, additionally, how much of the weightvariation is explained using the neck perimeter).



• Errors ?

• Power = 1 - β

OK !Type I error

α

H0 accepted H0 rejected

H0 true

H0 false OK !Type II error

β



• It seems that all situations have been introduced in thesesimple tests.

• Unfortunately, the real life is in most cases more complex thanthese simple examples and more complex statistics are neededto derive useful results for your research...

• We next show two examples to demonstrate the need for a larger class of models



A first example dataset

• For this short introduction and as a vehicle for some of the ideas, we’ll use an example with 700 cases suffering fromkidney stones. The data set contains 4 fields:

• The first field is the patient number (from 1 to 700)

• The second field is the used treatment (either US - for ultra-soundtherapy -, or Open - for open surgery)

• The third field is the stone size (either Small - meaning the diameteris < 1 cm - or Large)

• The fourth field is the treatment result (either Success - no more stones in the 2 years following treatment - or Failure)


…

(Available on the course website)

Number Type Size S_F

1 US Small Success

2 US Small Success

3 Open Small Success

698 Open Large Success

699 US Small Success

700 Open Small Success




• A question of interest is certainly: « Which treament is most successful ? »

• This question might be answered using hypothesis testing:• ��: �� = ��

i.e. the success rate π is the same for ultra-sound (US) and open surgery (OS)

• We might build a table with the collected data using a software (for examplecross-tables in Excel, or table function in R - the code is given in appendix 1)

All stones Open US Total

Failure 77 61 138

Success 273 289 562

Total 350 350 700



• A question of interest is certainly: « Which treament is most successful ? »

• Using these results, it seems that US is a little bit better than OS (61/350 failures for US vs 77/350 failures for OS) although the difference is not statistically significant (see appendix 1 for a χ² test)

• So, in the light of these results, and because US is less invasive thanOS, we might ask our physician to use US if needed.

• Then the physician tells you: « I’d rather use OS for large stones, and US for small stones »



• So, let’s look at our data again and let’s distinguish between small and large stones. This leads to 2 distinct tables as follows (appendix 1):

Large stones Open US Total

Failure 71 25 96

Success 192 55 247

Total 263 80 343

Small stones Open US Total

Failure 6 36 42

Success 81 234 315

Total 87 270 357



• If we summarize the results obtained on the same data set, eithertaking the whole set (« marginal set ») or splitting into two subsets(« conditional (on the stone size) subsets »):

• This very disturbing result shows that both conclusions on the conditional subsets differ from the marginal result: OS seems nowbetter than US !!! This strange phenomenon is known as Simpson paradox

Open

US

Marginal Small LargeP(Succ)

0.78

0.83

0.93

0.87

0.73

0.69


• Assume the following problem: • A virus is causing skin lesions on domestic animals.

• The question we are interested in is: « Is there a relationshipbetween the size of the lesions and the time of exposition ? »

A second example dataset


• Again, we can use hypothesis testing• ��: = 0

i.e. using linear regression, the change of the size of the lesion doesnot change with the time of exposition

• To test this (null) hypothesis, we can start collecting pairs of observations on affected cats, where each pair of observation is:

� = � �� ; � = ��

• The data are presented on the next slide (file « lesions .txt »)




• The first field is the time of exposition in days (a continuous variable)

• The second field is the lesion diametersize in mm (a continuous variable)

• The third field is a breed code (either1, 2, 3, 4 or 5)


• It can be observed that:• The size seems to decrease with time

• This can be confirmed using regression of size on time (we willreview how to do this...):

• Regression coefficient = -0,25

• P(b ≤ -0,25 | β = 0) = 0,15

• not significant at the 5% level, although a trend is visible

• But counter intuitive for the vet in charge of the experiment !



• Idea: take the breed into account



• It can be observed that:• The size seems to increase with time for each breed

• Breed responses differ largely

• This can be confirmed using a linear model(we have to learn how to do that...):

• Regression coefficient = +0,64

• P(b ≥ 0,64 | β = 0) < 0,001 => very significant at the 5% level

• (Nuisance) Breed effect also very significant (p = 0,0002)



Modeling...• Conclusions of these 2 examples:

• Simple tests might be misleading...

• A solution could be to use models allowing to test several factors atthe same time

• In our first example:

• Treatment effect (open surgery or ultra-sounds)

• Kidney stone size effect (large or small)

• « Interaction » between these two effects (does the potential difference betweentreatments remain the asme for small and large size stones ?)

• In our second example:

• Time effect (continuous variable) ?

• Breed effect (differences between breeds) ?

• « Interaction » between these two effects (does the difference between breedsremain stable with time ?)

28Course for PhD students

Models• A general (matrix) form for models expliciting the relationship

between the dependent variables Y and the independent onesX is:

where:• Y is a table of K measures taken on N individuals (« dependent

variables »)

• Yij is the jth measure on individual i

• X is another table of M measures taken on the N individuals(« independent variables »)

• f(.) is a function connecting X and Y.

• The exact form of f(.) is part of « the model »


)(XfY =

Models: a subset...• In this course:

• K = 1 (univariate models)

• Except otherwise stated:

Y = f(X) = X * β + ε (« linear model »)

where the different components Y, X, β and ε will be explained later...

• A (maybe more explicit) writing of this relationship is:

�� = �� ∗ � + �� ∗ � + ⋯ + �� ∗ + !� � = 1, … , %

where ��, … , �� are independent variables values measured on individual i and �� is the corresponding dependent variable value, �, … , are M parameters of the model, and !� is a quantity that is not explained by the model (called residual).


Models: a subset...• In this course:

• So, we will restrict ourselves to models where the relationshipbetween the dependent variables (Y) and the independent variables (X) takes a linear form, which is very useful in many practicalsituations...


Back to the dermatology problem

• A model can be fit using the linear form given above:

yi = α*ti + βi + εi

• yi = size of the lesion on animal i

• ti = time of exposition to the virus • a continuous variable, sometimes referred to as a covariate

• α = (linear) regression coefficient

• βi = effect on y of the breed of the animal i• a discrete variable (breed is "a","b","c","d" or "e")

• εi = residual for animal i


Back to the dermatology problem• In this model:

yi = α*ti + βi + εi

• The (unknown) parameters are:

• α = (linear) regression coefficient

• βa, βb, βc, βd, βe= effects of each of the breeds

• These parameters will need to be estimated

• Hypotheses about these parameters will be tested:

• H01: α = 0 (=> no (linear) effect of time on y)

• H02: βa = βb = βc = βd = βe (=> no difference between breeds)

• H03: ...


Back to the stones problem

• A model can be fit using the general form given above

Y = f(X)

• This is an example of a non linear relationship:

f(Xi) ≠ Xi * β + ei...

• Actually, the probability of a success for individual i is modelled as:

�� =&'( ∗)

�*&'( ∗)

• Some details of the resolution are provided in Appendix 2


What next ?• Before going into more details for the linear models, we will need

some tools that will be useful:• A math tool: matrices..

• A software tool: we will use R in this course

• These two tools will be the subject of the next lectures.


Appendix 1This appendix reports some R code to work on the « stones » problem

• Lines in green correspond to comments explaining the code

• Lines in blue correspond to R commands to be typed in the « console » of the software

• Lines in red correspond to text retruned by the software after typing a command.


Appendix 1The R code to obtain the needed table for the stones problem:# Read the file (In "d:\courses\ " directory in my PC)# "file= " allows providing the name of the file# "head=T " instructs R that the file contains a header# (a first line with the column names)# "head=T " means that fields on a line are separated using tabs.f<-read.table(file= " d:\courses\pierres.txt ",head=T,sep= " \t " )# The file is now completely in f ! f$Type contains the 700 types# and f$S_F contains the 700 results (the names of thes e variables# are extracted from the header line)# Building the table is now very simple:table(f$S_F,f$Type)

Open US

Failure 77 61

Success 273 289Course for PhD students 37

Appendix 1 (cont’d)The R code to obtain a χ² test for the stones problem:# We could put the table in a R variable:ct<-table(f$S_F,f$Type)# Next, we could instruct R to perform a chi-square tes t on the tablechisq.test(ct,correct=F)

Pearson's Chi-squared test

data: ct

X-squared = 2.3106, df = 1, p-value = 0.1285

# No significant difference (p=0.1285). correct=F i nstructs R not to# perform a Yate’s correction (this correction is u seful when numbers# are small)


Appendix 1 (cont’d)The R code to obtain distinct tables for small and large stones:# Select only the records where the Size variable is " Small":table(f$S_F[f$Size=="Small"],f$Type[f$Size=="Small" ])

Open US

Failure 6 36

Success 81 234

# Select only the records where the Size variable is " Large":table(f$S_F[f$Size=="Large"],f$Type[f$Size=="Large" ])

Open US

Failure 71 25

Success 192 55


Appendix 2: solving the stones problemSince this type of situations is not the target of the course, only a list of steps will be provided here, with few details.

1. Show that: E(y) = success proba.E(y) = Σ y*p(y) = 1*p + 0*(1-p) = p

2. Provide a link function1. Common choice: logit function

logit(p) = ln[ p / (1-p) ]1. p → 0 => logit(p) → - ∞

2. p → 1 => logit(p) → + ∞=> range of x mapped to ] - ∞; + ∞[

3. Obtain the « probability » functionlogit(p) = X β=> p = exp(Xβ)/[1+exp(Xβ)]


4. The « best » model is first to be obtained using « model selection » procedures. The final model is:

logit(pi) = µ + αi + βi

where:pi is the probability of success for iαi is the effect of the size (L or S) for iβi is the effect of the treatment (O or US) for i.


Appendix 2: solving the stones problem

5. Compute the likelihood of the data1. Assuming independence

6. Obtain the estimators of β that maximize the (log)likelihood1. Not easy (no « closed solution ») in general !

2. Optimization software needed

3. See the analytical development and an excel solution for our problem in attached documents.


( ) iii

ii ppL δδ −−= ∏ 11*


7. Parameter estimation through ML leads to the following solutions:

m = 1.4849aSMALL = 0.6303aLARGE = -0.6303bOPEN = 0.1786bUS = -0.1786

ln(L) = -331.4333



8. Risk assessment can be obtained through OR ± IC(95%)OR(L vs S) = exp(-0.6303-0.6303)

= 0.283IC(95%) = [0.177;0.453]

=> success probability significantly decreased if stone is large

OR(O vs US)= exp(0.1786+0.1786)= 1.429

IC(95%) = [0.912;2.239]

=> success probability not significantly affected by treatment type



statisticalmodeling - uliege.be

Documents