statistical decision-making -...
TRANSCRIPT
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Statistical Decision-Making
Hong Li
May. 3, 2016
1
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Objectives
Explain steps in hypothesis testing
Describe null and alternative hypothesis
Define p-value (both one-sided and two-sided)
Define Type-I and Type-II errors
Define power
Understand how to interpret confidence interval (CI) inthe context of hypothesis testing
2
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Example: Avastin in Gastric Cancer (AVAGAST)
Paper
Bevacizumab in Combination with Chemotherapy AsFirst-Line Therapy in Advanced Gastric Cancer: ARandomized Double-Blind Placebo-Controlled Phase IIIStudy (Journal of Clincial Oncology Volume 29, Number30, Oct. 20, 2011, pp3968-3976)
Purpose
Evaluate the efficacy of adding bevacizumab tocapecitabine-cisplatin in the first-line treatment ofadvanced gastric cancer
Study design
Double-blind, placebo-controlled Phase III clinical trial
3
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Example: Avastin in Gastric Cancer (AVAGAST),
cont.
Treatment: in addition to cisplatin and capecitabine
Treatment arm: BevacizumabControl arm: placebo
Sample size
774 in total with 387 in each arm
Primary endpoint
Overall survivalAssessed every 3 months
4
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Example: Avastin in Gastric Cancer (AVAGAST),
cont.
Overall response rate (ORR)
Overall response rate was improved significantly with theaddition of bevacizumab (46% vs. 37.4% in the placebogroup; p-value=0.0315)
5
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Hypothesis Testing
Statistical hypothesis is an assumption about apopulation parameter
Assumption may or may not be true
Hypothesis testing refers to the formal procedures used bystatisticians to accept or reject statistical hypotheses
6
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Hypothesis Testing,cont.
There are two types of statistical hypotheses:
Null hypothesis (H0): usually the hypothesis that sampleobservations result purely from chanceAlternative hypothesis (HA): the hypothesis that sampleobservations are influenced by some non-random cause
7
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Hypothesis Testing, cont.
State the hypotheses. This involves stating the null andalternative hypotheses. The hypotheses are stated in sucha way that they are mutually exclusive. That is, if one istrue, the other must be false
Formulate an analysis plan. The analysis plan describeshow to use sample data to evaluate the null hypothesis.The evaluation often focuses around a single test statistic
Analyze sample data. Find the value of the test statistic(mean score, proportion, t statistic, z-score, etc.)described in the analysis plan
Reject the null or fail to reject the null, depending on thep-value’s extremeness
8
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
What Is Population Parameter?
Statistician differentiate between that which is beingestimated and the estimate itself
That which is being estimated is called a populationparameter and we (typically) assume it has a true butunknown value
In this example, there are two population parameters
True (but unknown) ORR for the population of patientsrepresented by those in the study receivingbevacizumab-pbevTrue (but unknown) ORR for the population of patientsrepresented by those in the study receiving placebo-pplac
9
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Hypothesis Testing: Assume a Null Condition
Statistical testing starts with a statement of the nullhypothesis
H0 : pbev = pplac
This can also be written
H0 : pbev − pplac = 0
10
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Hypothesis Testing: State Alternative Hypothesis
Alternative hypothesis is the state of nature we will acceptif the evidence in the data suggests the null is implausible
HA : pbev > pplac
This can also be written
HA : pbev − pplac > 0
This is called a one-tailed or one-sided test
Note that we do not actually specify a value forpbev − pplac under the alternative. We simply state thedirection of the effect (> 0, < 0, or 6= 0)
11
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Hypothesis Testing: Formulate Analysis Plan
Intutively, it makes sense that any test for a comparisonof ORRs should be constructed from their estimates,namely p̂bev and p̂plac
From the results stated in the paper
p̂bev = 46% and p̂plac = 37.4%
or equivalently
p̂bev − p̂plac = 8.6%
Question: are these estimates meaningfully different, or isthe magnitude of their difference a chance occurrence-theluck of the draw?
12
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Hypothesis Testing: Formulate Analysis Plan
(cont.)
The test is constructed based on the data (p̂bev − p̂plac)
Goal of test: determine how unlikely it is to observe adifference in the estimated ORRs at least as large as theone we’ve observed, if in fact there really is no differencein ORRs
Allows us to make a probabilistic statement about thelikelihood of the observed difference in ORRs beingattributable to chance and not to the intervention
Requires understanding the sampling distribution ofp̂bev − p̂plac
13
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Sampling Distribution
Sampling distribution of a statistic is the distribution ofthat statistic when derived from a random sample of sizen
Example: the test is constructed under the nullhypothesis
Sampling distirbution as a histogram of all possiblevalues of p̂bev − p̂plac you could observed if in fact thetrue ORRs, pbev and pplac are equal
14
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Example based on ORRs
pbev = 46% and pplac = 37.4%
H0: pbev = pplac
Under H0, the best estimate of the true underlying ORR(for either arm) is a pooled estimate
Results from the paper
p̂bev = 143/311 = 0.46
p̂plac = 111/297 = 0.374
The pooled estimate
p̂pooled = (143 + 111)/(311 + 297) = 0.418
15
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Repeated Sampling under H0
Simulate 10000 data sets with sample sizes of 311 and 297,and assume pbev = pplac = ppooled = 0.418. For each sample,compute pbev and pplac and construct their difference.
16
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Histogram of Resulting Differences
17
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
What Is P-value?
Question: How can we determine ”likely” or ”unlikely” bydetermining the probability-assuming the null hypothesiswere true- of observing a more extreme test statistic inthe direction of the alternative hypothesis than the oneobserved
Anwser: we estimate the probability of observing adifference in ORRs as or more extreme than the one weobserved in our study. This probability is called a p-value
The p-value is defined as the probability, under theassumption of null hypothesis, of obtaining a result equalto or more extreme than what was actually observed
18
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
P-value for Our Example
19
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
P-value for Our Example (cont.)
Approximate the theoretical distribution of the differences(blue line overlaing the histogram) using a normaldistribution
P-value is the red-shaded area in the plot
Example: the p-value is approximately 0.016
This probability represents the likelihood of obtaining asample proportion that is at least as extreme as oursample proportion in right tail of the distribution bychance if the null hypothesis is trueThe likelihood of observing a difference of 8.6% (or oneeven larger) by chance alone is less than 2%
20
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Other Alternatives
Another one-tailed test would be
HA : pbev < pplac
P-value would be approximated by the relative area to theleft of the observed difference.
21
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Other Alternatives (cont.)
22
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
P-value in the Paper
A two-tailed or two-sided test is given by the alternativehypothesis
HA : pbev 6= pplac
P-value is approximated by the sum of the relative areasto the left of -0.086 and to the right of 0.086. Theauthors performed a two-sided test and report a p-valueof 0.0315.
23
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
P-value in the Paper (cont.)
24
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Definition of P-value
The probability of observing a result at least as extremeas the one you observed if the null is true
P value is not the probability of making a mistake byrejecting a true null hypothesis
25
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Make Decision
Make a judgement about the null hypothesis based onp-value
Small p-value: observing the outcome or one moreextreme due to chance alone is highly improbable
Reject the null hypothesis in favor of the alternative (callfinding statistically significant)Common practice: p-value to be less than 0.05
P-value is not small: observing the outcome or one moreextreme due to chance alone is probable
Fail to reject the nullUnable to rule out random chance as a plausibleexplanation for any observed difference
26
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Actions and Errors
Two types of errors one can commit in conducting astatistical test
27
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Actions and Errors (cont.)
Type I error
The probability of rejecting the null given that the null istrue, denoted by α
Type II error
The probability of failing to reject the null given that thealternative is true, denoted by β
28
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Power
It is the probability of correctly rejecting the null in favorof the alternative. A well-designed study will strike abalance between acceptable levels of Type I error (usually0.05) and power (often set at a minimum of 0.8).
29
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Power (cont.)
The power of a hypothesis test is affected by three factors
Sample size
Other things being equal, the greater the sample size,the greater the power of the test
Significance level (α)
The higher the significance level, the higher the power ofthe testIncrease the significance level, reduce the region ofacceptance. Less likely to accept the null hypothesiswhen it is false; i.e., less likely to make a Type II error.Hence, the power of the test is increased
30
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Power (cont.)
The power of a hypothesis test is affected by three factors
The ”true” value of the parameter being tested
The greater the difference between the ”true” value of aparameter and the value specified in the null hypothesis,the greater the power of the test
31
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Common Endpoints to Compare Groups
Difference in means
Compares a continuous variable between two groupsNull value is 0two-sample t-test or Wilcoxon rank sum test(independent groups)paired t-test or Wilcoxon signed rank test (pairedgroups)
Difference in proportions
Compares a categorical variable between two groupsNull value is 0Chi-square test or Fisher’s exact test (independentgroups)McNemar’s test (paired groups)
32
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Common Endpoints to Compare Groups (cont.)
Odds Ratio (OR)
OR = odds of event E for subjects in group A divided byodds of event E for subjects in group BCompares a binary variable between groups A and BOR > 0Null value is 1In our example, OR = 0.46/(1−0.46)
0.374/(1−0.374)
.= 1.43
There is a 43% increase in the odds of responsecomparing patients randomized to bevacizumab topatients randomized to placebo
33
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Common Endpoints to Compare Groups (cont.)
Hazard Ratio (HR)
HR = hazard of death in group A divided by hazard ofdeath in group B (‘hazard’ = ‘risk’)Compares a time-to-event variable between groups Aand BHR > 0Null value is 1Log rank test or Cox proportional hazards regression
34
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Putting Some of This together
”On the basis of a systematic literature review, it was assumedthat median OS in the placebo group would be 10 months.The study was powered to test the hypothesis that theaddition of bevacizumab would improve median OS to 12.8months, equivalent to a hazard ratio (HR) of 0.78, 509 deathswere necessary to ensure 80% power for a two-sided log-ranktest at a significance level of 0.05”
35
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Confidence Interval (CI)
CI is a type of interval estimate of a population parameter
Estimated range of values that provides a measure ofuncertainty associated with an estimated parameter
The wider the interval the greater the uncertainty in theestimate
95% CI means
If you were able to replicate the exact same study aninfinite number of times, 95% of the resulting CIs wouldcontain the true parameter of interest
36
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
CI and Hypothesis Testing
”PFS was prolonged significantly in the bevacizumabgroup compared with the placebo group (HR, 0.8; 95%CI, 0.67 to 0.93; p-value=0.037)”
The best estimate of the true HR is 0.80, but our dataare consistent with a HR ranging from 0.67 to 0.93CI does not contain 1 which means we have even greaterfaith that the benefit observed is real and not just achance occurrence
37
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
CI and Hypothesis Testing (cont.)
S be the statistic used to perform a test that comparestwo groups. For example, S could be a difference insample means, a difference in sample proportions, anestimated OR, or an estimated HR. If the test issignificant at α-level 0.05, then the 95% CI onstructedwith S will not contain the relevant null value.
38
Outline AVAGAST Hypothesis Testing P-value Errors Confidence Interval
Subgroup Analysis of Overall Survival in the
Intent-to-Treat Population
39