ufdcimages.uflib.ufl.edu · acknowledgments i would like to gratefully and sincerely thank dr....
TRANSCRIPT
INTERVAL ESTIMATION FOR THE MEAN OF THE SELECTED POPULATIONS
By
CLAUDIO FUENTES
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2011
c© 2011 Claudio Fuentes
2
To my parents, who have been there in every step
3
ACKNOWLEDGMENTS
I would like to gratefully and sincerely thank Dr. George Casella for his guidance,
understanding and patience during my graduate studies at the University of Florida.
Working with him, as a research assistant and as a student, has been one of the most
rewarding experiences of my life. His wealth of knowledge and experience has shaped
not only the way I understand statistics today.
I would also like to thank my graduate committee members: Dr. Michael Daniels,
Dr. Malay Ghosh and Dr. Gary Peter for their understanding and support, throughout
the whole process. Their sharp comments and suggestions have greatly improved the
quality of this work.
I am deeply grateful to all my teachers and professors. In particular those at
the University of Florida and the Pontificia Universidad Catolica de Chile. It is not a
exaggeration to say that almost everything I know today is the product of their dedication
and excellence at teaching. Without any doubts, they thought me more than I could
learn. Thank you Dr. Alvaro Cofre. I would not be here writing these lines if it was not for
your constant support and inspiration.
Finally, I would like to thank my parents Jorge Fuentes and Edith Melendez. It is
because of their unconditional love and support that I have been able to reach this far.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1 Two Formulations of the Problem . . . . . . . . . . . . . . . . . . . . . . . 91.2 Inference on the Selected Mean . . . . . . . . . . . . . . . . . . . . . . . 10
2 INTERVAL ESTIMATION FOLLOWING THE SELECTION OF ONEPOPULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 The Known Variance Case . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 The Unknown Variance Case . . . . . . . . . . . . . . . . . . . . . . . . . 212.3 Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4 Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 CONFIDENCE INTERVALS FOLLOWING THE SELECTION OF k ≥ 1POPULATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 An Alternative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3 Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 INTERVAL ESTIMATION FOLLOWING THE SELECTION OF A RANDOMNUMBER OF POPULATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 Connection to FDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 APPLICATION EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Fixed Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Random Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.3 Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5
LIST OF TABLES
Table page
2-1 Configuration of the new parameterization for the coverage probability . . . . . 24
2-2 Configuration of the new parameterization for the case p = 3 . . . . . . . . . . 24
2-3 Representation of the parameters ∆i ,j when p = k + 1 . . . . . . . . . . . . . . 24
2-4 Coverage probability of 95% CI for the selected mean when p = 4 . . . . . . . 25
3-1 Structure of the ∆’s for the case p = 4, k = 2 . . . . . . . . . . . . . . . . . . . 44
3-2 Coverage probabilities for the number of population means vs the number ofselected populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3-3 Observed confidence coefficient for 95% CI when p = 6 . . . . . . . . . . . . . 44
3-4 Cutoff points for 95% CI using the new method . . . . . . . . . . . . . . . . . . 45
5-1 Confidence intervals for fixed top log-score differences . . . . . . . . . . . . . . 55
5-2 Confidence intervals for random top log-score differences . . . . . . . . . . . . 55
6
LIST OF FIGURES
Figure page
2-1 Coverage probability as a function of ∆21 and ∆32 when p = 3 . . . . . . . . . . 25
2-2 Plot of ∂h/∂∆21 when p = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2-3 Plots of the first two terms of ∂h/∂∆21 . . . . . . . . . . . . . . . . . . . . . . . 26
2-4 Confidence coefficient vs the number of populations for the iid case and α =0.05 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2-5 Cutoff point versus number of populations for the iid case and α = 0.05 . . . . 28
3-1 Coverage probabilities as a function of ∆ when p = 6 . . . . . . . . . . . . . . . 45
4-1 Individual components for the coverage probability for random K . . . . . . . . 50
4-2 Lower bound for random K varying the probability selection . . . . . . . . . . . 51
4-3 Coverage probabilities for random K for different values of p . . . . . . . . . . . 52
7
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
INTERVAL ESTIMATION FOR THE MEAN OF THE SELECTED POPULATIONS
By
Claudio Fuentes
August 2011
Chair: Dr. George CasellaMajor: Statistics
Consider an experiment in which p independent populations πi , with corresponding
unknown means θi are available and suppose that for every 1 ≤ i ≤ p, we can obtain
a sample Xi1, ... ,Xin from πi . In this context, researchers are sometimes interested
in selecting the populations that give the largest sample means as a result of the
experiment, and to estimate the corresponding population means θi . In this dissertation,
we present a frequentist approach to the problem, based on the minimization of the
coverage probability, and discuss how to construct confidence intervals for the mean
of k ≥ 1 selected populations, assuming the populations πi are normal and have a
common variance σ2. Finally, we extend the results for the case when the value of k
is randomly chosen and discuss the potential connection of the procedure with false
discovery rate analysis. We include numerical studies and a real application example
that corroborate this new approach produces confidence intervals that maintain the
nominal coverage probability while taking into account the selection procedure.
8
CHAPTER 1INTRODUCTION
Given a set of p available technologies (treatments, machines, etc.), researchers
must often determine which one is the best, or simply rank them according to a certain
pre-specified criteria. For instance, researchers may be interested in determining what
treatment is more efficient in fighting a certain disease, or they could be interested in
ranking a class of vehicles following a safety standard. This type of problems is known
as ranking and selection problems and specific solutions and procedures have been
proposed in the literature since the second half of the 20th century, with a start which
is usually traced back to Bechhofer (1954), Gupta and Sobel (1957). In his paper,
Bechhofer presents a single sample multiple decision procedure for ranking means of
normal populations. Assuming the variances of the populations are known, he is able
to obtain closed form expressions for the probabilities of a correct ranking in different
scenarios. This approach is more concerned with selection of the population with the
largest mean rather than estimation of that mean. Gupta and co-authors have pioneered
the subset selection approach, in which a subset of populations is selected with a
minimum probability guarantee of containing the largest mean with certain probability
P∗ (see Gupta and Panchapakesan (2002)); while Bechhofer uses an indifferent zone.
That is, there is a minimum guaranteed probability of selecting the population with the
largest mean, as long as that mean is separated from the second largest by a specified
distance δ (see Bechhofer et al. (1995)).
1.1 Two Formulations of the Problem
Here we are concerned with estimation, and describe two formulations of this
problem, with subtle differences between them. Suppose that we have p populations,
with unknown means θi (1 ≤ i ≤ p). Assuming that for every 1 ≤ i ≤ p we can obtain a
sample Xi1, ... ,Xini from the population πi , we can either:
1. Select the population that has the largest parameter, max{θ1, ... , θp}, and estimateits value.
9
2. Select the population with the largest sample mean, and estimate the correspondingθi .
The first of these problems has been widely discussed in the literature. For
example, Blumenthal and Cohen (1968) consider estimating the larger mean from
two normal populations and compare different estimators, but they do not discuss how
to make the selection. In this direction, Guttman and Tiao (1964) propose a Bayesian
procedure consisting in the maximization of the expected posterior utility for a certain
utility function U(θi). In the same direction, but from a frequentist perspective, Saxena
and Tong (1969), Saxena (1976), and Chen and Dudewicz (1976) consider point and
interval estimation of the largest mean.
1.2 Inference on the Selected Mean
Surprisingly, the second problem has received less attention. In this context, a
common and widely used estimator is δ(X) =∑pi=1 Xi I (Xi = X(p)). This estimator
has been discussed in the literature and is known to be biased (Putter and Rubinstein
(1968)). This issue becomes clear if we consider all the populations to be identically
distributed, for we will be estimating the population mean by an extreme value.
Dahiya (1974) addresses this problem for the case of two normal populations and
proposed estimators that perform better in terms of the MSE. Progress was made by
Cohen and Sackrowitz (1982), Cohen and Sackrowitz (1986) and Gupta and Miescke
(1990), where Bayes and generalized Bayes rules were obtained and studied. However,
performance theorems are scarce. One exception is Hwang (1993), who proposes an
empirical Bayes estimator and shows that it performs better in terms of the Bayes risk
with respect to any normal prior. Another exception is Sackrowitz and Samuel-Cahn
(1984) who, in the case of the negative exponential distribution, find UMVUE and
minimax estimators of the mean of the selected population.
The problem of improving the intuitive estimator is technically difficult. In addition,
despite the obvious bias problem, it has been difficult to establish its optimality
10
properties. Standard investigations in admissibility and minimaxity, following ideas
such as Berger (1976), Brown (1979) and Lele (1993) are not straightforward. In
this direction, Stein (1964) established the minimaxity and admissibility of the naive
estimator for k = 2. Minimaxity for the general case, was established later by Sackrowitz
and Samuel-Cahn (1986), were they discussed the case normal case for k ≥ 3.Admissibility, for the general case, appears to be still open.
Similarly, interval estimation is an equally challenging and again, little can be
found in the literature. Typically, confidence intervals are constructed in the usual way,
using the standard normal distribution as a reference to attain the desired coverage
probability. However these intervals do not maintain the nominal coverage probability, as
the number of populations increase.
Qiu and Hwang (2007) propose an empirical Bayes approach to construct
simultaneous confidence intervals for K selected means, but we are not aware of
any other attempts to solve this problem. In their paper, Qiu and Hwang consider a
normal-normal model for the mean of the selected population, which assumes that
each population mean θi follows a normal distribution. Under these assumptions they
are able to construct simultaneous confidence intervals that maintain the nominal
coverage probability and are substantially shorter than the intervals constructed
using the Bonferroni’s bounds. However the confidence intervals they propose are
asymptotically optimal, and since their coverage probabilities are obtained averaging
over both sample space and prior, they do not give a valid frequentist interval. We are
not aware of any other attempts to solve this problem.
Recently, a modern variation of this problem has become very popular, with a
major reason being the explosion of genomic data, calling for the development of
new methodologies. For instance, in genomic studies, looking either for differential
expression or genome wide association, thousands of genes are screened, but only
a smaller number are selected for further study. Consequently, the assessment of
11
significance, through testing or interval estimation, must take this selection mechanism
into account. If the usual confidence intervals are used (not accounting for selection) the
actual confidence coefficient is smaller than the nominal level, and approaches zero as
the number of genes (populations) increases.
In this dissertation, we address the problem of interval estimation and present
a frequentist approach to construct confidence intervals for the means of the selected
populations, where the selection mechanism are properly described in the corresponding
chapters. In Chapter 2 we focus on the problem of selecting one population. In Chapter
3 we introduce a novel methodology to produce confidence intervals when selecting
k > 1 populations, where k is a fixed and known number. Later, in Chapter 4 we extend
the results for the case k > 1, when k is a random quantity. Finally, in Chapter 5 we
discuss the main conclusions and possible extensions of the results presented on this
dissertation.
12
CHAPTER 2INTERVAL ESTIMATION FOLLOWING THE SELECTION OF ONE
POPULATION
For 1 ≤ i ≤ p, let Xi1, ... ,Xin be a random sample from a population πi with unknown
mean θi and variance σ2. Assume the populations πi are independent and normally
distributed, so that the sample mean Xi = n−1∑nj=1 Xij ∼ N(θi ,σ2/n) for i = 1 ... p and
define the order statistics X(1), ... ,X(p) as the sample values placed in descending order.
In other words, the order statistics satisfy X(1) ≥ ... ≥ X(p). In this context, we want
to construct confidence intervals for the mean of the population that gives the largest
sample mean as a result of the experiment.
Formally, if we define θ(1) =∑pi=1 θi I (Xi = X(1)), our aim is to produce confidence
intervals for θ(1), based on X(1), such that the confidence coefficient is at least 1 − α, for
any 0 < α < 1 specified prior to the experiment.
It is not difficult to realize that the standard confidence intervals do not maintain
the nominal coverage probability. For instance, if all the populations πi are normally
distributed with mean θ and variance 1, then, for samples of size n = 1, X1, ... ,Xp ∼iid N(θ, 1). It follows that P(X(1) ≤ x) = Φp(x − θ), where Φ(·) denotes the cdf of the
standard normal distribution. Moreover the mean of the selected population θ(1) = θ and
hence
P(θ(1) ∈ X(1) ± c) = Φp(c)−Φp(−c).
for any value of c > 0.
In particular, when p = 3, we obtain
P(θ(1) ∈ X(1) ± c) = Φ3(c)−Φ3(−c)
= (Φ(c)−Φ(−c))(Φ2(c) + Φ(c)Φ(−c) + Φ2(−c))
= (2Φ(c)− 1)(1−Φ(c) + Φ2(c)).
13
Since 1−Φ(c)+Φ2(c) < 1, we have the standard confidence interval is smaller than
the nominal level given by 2Φ(c) − 1. In fact, it is easy to show that coverage probability
maintain the nominal level only for p = 1 and 2, and then decreases as p goes to infinity.
The problem is that the traditional intervals do not take into account the selection
mechanism. Thus, in order to construct confidence intervals that maintain the nominal
level we must take into account the selection procedure. To this end, we first consider
the partition of the sample space induced by the order statistics and write
P(θ(1) ∈ X(1) ± c) =p∑
i=1
P(θi ∈ Xi ± c ,Xi = X(1)). (2–1)
Observe that each term in the sum (2–1) can be explicitly determined using the joint
distribution of (X1, ... ,Xp). For example, when i = 1 (the first term of the sum), we have
P(θ1 ∈ X1 ± c ,X1 = X(1)) = P(θ1 ∈ X1 ± c ,X1 ≥ X2, ... ,X1 ≥ Xp). (2–2)
In the next section we derive a closed form expression for the coverage probability in
(2–1), assuming the population variance σ2 is known, and present a new approach to
obtain the desired confidence intervals.
2.1 The Known Variance Case
Suppose the population variance σ2 is known and define Zj =√n(Xj − θj)/σ for
j = 1, ... , p. It follows that Z1, ... ,Zp ∼ iid N(0, 1) and
X1 ≥ Xj ⇔ √n(X1 − θ1)/σ ≥
√n(Xj − θj + θj − θ1)/σ
⇔ Z1 ≥ Zj +∆j1⇔ Z1 − Zj ≥ ∆j1,
where ∆j1 =√n(θj − θ1)/σ for j = 1, ... , p.
At this point, to simplify the notation we take n = σ2 = 1. Then, if we consider the
transformation
14
T :
z = z1
ω2 = z1 − z2...
ωp = z1 − zpwe can rewrite (2–2) in terms of ∆21 ... ∆p1, and obtain
P(θ1 ∈ X1 ± c ,X1 ≥ X2, ... ,X1 ≥ Xp) = P(|z | ≤ c ,ω2 ≥ ∆21, ... ,ωp ≥ ∆p1)
=1
(2π)p/2
∫ c
−c
{p∏
j=2
∫ ∞
∆j1
e−12(ωj−z)2dωj
}e−
12z2dz .
Notice that for fixed z , the integrals within the curly brackets { } are essentially the
tail probability of a normal distribution centered at z . Therefore, we can write
P(|z | ≤ c ,ω2 ≥ ∆21, ... ,ωp ≥ ∆p1) =∫ c
−c
{p∏
j=2
Φ(z − ∆j1)}
φ(z)dz ,
where φ(·) denotes the pdf of the standard normal distribution.
Of course, the same argument is valid for the remaining terms of the sum in (2–1).
It follows that we can fully describe the probability P(θ(1) ∈ X(1) ± c) in terms of a new
set of parameters ∆ij ’s, where ∆ij = θi − θj for 1 ≤ i , j ≤ p. Under this representation,
for every c > 0, the value of the coverage probability P(θ(1) ∈ X(1) ± c) is determined
by the relative distances between the population means θi , i = 1, ... , p. In other
words, we coverage probability defines a function hc(∆) = P(θ(1) ∈ X(1) ± c), where
∆ = (∆11, ∆12, ... , ∆pp) is the vector of possible configurations of the relative distances
∆ij ’s.
In this context, we can obtain confidence intervals for θ(1), that have (at least) the
right nominal level, by minimizing first the function hc . Specifically, given 0 ≤ α ≤ 1, we
can determine the value of c > 0 that satisfies
P(θ(1) ∈ X(1) ± c) ≥ min∆hc(∆) = 1− α. (2–3)
15
In order to minimize the function hc , we first notice the following properties of the
parameters ∆ij ’s:
1. ∆jj = 0, for every j .
2. ∆ij = −∆ji , for every i , j .
3. For j > k , ∆jk = ∆j ,j−1 + ∆j−1,j−2 + ... + ∆k+1,k .
These properties reveal a certain underlying symmetry in the structure of the
problem. This symmetry is portrayed in Table 2-1 where every entry ∆ij corresponds to
the difference between the values of θi and θj located in row i and column j respectively.
In addition, Property 3 indicates that we only need to consider p − 1 parameters in
order to determine the value of P(θ(1) ∈ X(1) ± c). In fact, for any given ordering of the
parameters θi ’s, we can always choose a representation of the probability in (2–1) based
on p − 1 parameters ∆ij . As a result, we have that the true ordering of the population
means θi ’s is not particularly relevant in this approach, and hence, we will assume
(without any loss of generality) that θ1 ≥ θ2 ≥ ... ≥ θp.
Although the introduction of the new parameterization seems to reduce (in a sense)
the complexity of the problem, the minimization of hc is still difficult. First, because of
the delicate balance existing between the ∆ij ’s in the full expression (see Table 2-1) and
second, because the formula of the coverage probability is somehow involved.
To illustrate these problems, let us discuss the case p = 2. We have
P(θ(1) ∈ X(1) ± c) =∫ c
−cΦ(z − ∆12)φ(z)dz +
∫ c
−cΦ(z + ∆12)φ(z)dz
=
∫ c
−c[Φ(z − ∆12) + Φ(z + ∆12)]φ(z)dz ,
where ∆12 > 0.
Since only the quantity in brackets [ ] depends on ∆21 and φ(z) > 0, it seems
reasonable to think that hc(∆12) = P(θ(1) ∈ X(1) ± c) is minimized at the same point
where gz(∆12) = Φ(z − ∆12) + Φ(z + ∆12) finds its minimum. However, differentiating gz
16
with respect to ∆12 we obtain
dgzd∆12
= φ(z +∆12)− φ(z − ∆12)
≥ 0, z ≤ 0
< 0, z > 0
where we observe that the value of the derivative depends on ∆12 and z , and consequently,
the minimum of hc can not be determined by simple examination of the behavior of gz .
From the analysis of g′z , we conclude that gz(∆12) is minimized at ∆12 = 0, when
z ≤ 0 and (asymptotically) at ∆12 = +∞, when z > 0. Then, we can establish the
inequality
P(θ(1) ∈ X(1) ± c) ≥∫ 0
−c2Φ(z)φ(z)dz +
∫ c
0
φ(z)dz ,
however, this lower bound is not obtained by direct minimization of the coverage
probability and is less appealing. The problem is that a strategy based on this type
of lower bounds may be too conservative and lead to extremely wide intervals when
applied to higher dimensions (p > 2).
In order to find a formal solution to the minimization problem, we start with the case
p = 3. For this case, we can fully describe the probability of interest in terms of the two
parameters ∆12 and ∆23, as is shown in Table 2-2. We obtain
P(θ(1) ∈ X(1) ± c) = 1√2π
∫ c
−cΦ(z − ∆12)Φ(z − ∆23 − ∆12)e− 12 z2dz (2–4)
+1√2π
∫ c
−cΦ(z + ∆12)Φ(z − ∆23)e− 12 z2dz
+1√2π
∫ c
−cΦ(z + ∆23)Φ(z +∆23 + ∆12)e
− 12z2dz ,
where ∆12, ∆23 ≥ 0 and Φ(·) denotes the cdf of the standard normal distribution.
Preliminary studies suggest that the global minimum of hc(∆12, ∆23) = P(θ(1) ∈X(1) ± c) is located at the origin (see Figure 2-1), but a formal proof is required. To this
end, it is sufficient to show that ∂hc/∂∆23 > 0 and ∂hc/∂∆12 > 0.
17
Taking partial derivatives with respect to ∆21 we obtain
∂hc∂∆12
=1
2π
∫ c
−cΦ(z + ∆23)e
− 12(∆23+∆12+z)
2− 12z2dz (2–5)
− 12π
∫ c
−cΦ(z − ∆12)e− 12 (∆23+∆12−z)2− 12 z2dz
+1
2π
∫ c
−cΦ(z − ∆23)e− 12 (∆12+z)2− 12 z2dz
− 12π
∫ c
−cΦ(z − ∆23 − ∆12)e− 12 (∆12−z)2− 12 z2dz .
Since the partial derivative depends on both ∆12 and ∆23, the behavior of its sign
is not obvious, but different numerical studies support the idea that the derivative is
non-negative. Figure 2-2 shows the plot of the integrand of ∂hc/∂∆12 for fixed values of
∆12 and ∆23.
Notice that if we group the first two terms and the last two terms of (2–5), we can
look at the partial derivative as the sum of two differences. In Figure 2-3 we observe (in
separate plots) the integrands of the first two terms of the partial derivative ∂hc/∂∆12, for
fixed values of ∆12 and ∆23. The plot suggest that the integrands differ only by a location
parameter. In fact, changing variables, we can rewrite the expression in (2–5) as
∂hc∂∆12
= D1 +D2, (2–6)
where
D1 =1
2π
{∫ ∆23+∆12+c
∆23+∆12−c−
∫ c
−c
}Φ(z − ∆12)e− 12 (∆23+∆12−z)2− 12 z2dz ,
D2 =1
2π
{∫ ∆12+c
∆12−c−
∫ c
−c
}Φ(z − ∆23 − ∆12)e− 12 (∆12−z)2− 12 z2dz .
Recall that ∆12 > 0, then looking at D2, we have two possibilities for the intervals of
integration:
1. −c < ∆12 − c < c < ∆12 + c .
2. −c < c < ∆12 − c < ∆12 + c .
18
In other words, the intervals may overlap or not. Denoting by R1 and R2 the
non-common regions of integration, that is
• R1 = (−c , ∆12 − c) and R2 = (c , ∆12 + c) for case (1).• R1 = (−c , c) and R2 = (∆12 − c , ∆12 + c) for case (2).
We have that D2 is guaranteed to be positive, as long as the integral over R2 is greater
than the integral over R1, regardless of the case.
We first notice that R1 and R2 are intervals of the same length. In fact, `(R1) =
`(R2) = ∆12 for case (1), and `(R1) = `(R2) = 2c for case (2). Then, we only need to
show that for any two points z1 ∈ R1 and z2 ∈ R2 located at a certain distance ε > 0 from
the extremes of the corresponding intervals, the integrand evaluated at z2 is greater than
the integrand evaluated at z1.
Observe that for any z1 < z2,
Φ(z2 − ∆23 − ∆12)× ez2∆12−z22Φ(z1 − ∆23 − ∆12)× ez1∆12−z21
= q × exp{(z2 − z1)[∆12 − (z2 + z1)]}, (2–7)
where q = Φ(z2 − ∆23 − ∆12)/Φ(z1 − ∆23 − ∆12) > 1.
Then, for any 0 < ε < min{∆12, 2c}, take z1 = ∆12 − c − ε, z2 = c + ε whenever
min{∆12, 2c} = ∆21 (i.e. case 1) and z1 = c− ε, z2 = ∆12− c+ ε whenever min{∆12, 2c} =2c (i.e. case 2). Replacing these values in (2–7) we obtain the ratio is greater than 1
(regardless the case) which is compelling to conclude that D2 > 0.
Notice that the argument still holds if we replace the cdf Φ(·) by any non-decreasing
function or if we change the interval (−c , c) for (−c1, c2), where c1, c2 > 0. This way, we
obtain the following more general result:
Proposition 2.1. Let ∆1, ∆2, c1, c2 > 0 and let the function f (z ,λ) be non decreasing in
z , where λ is an arbitrary set of parameters. Then,
{∫ ∆1+c2
∆1−c1−
∫ c2
−c1
}f (z ,λ)e−
12(∆1−z)2− 12 z2dz ≥ 0,
where the inequality is strict whenever the function f is monotonically increasing in z .
19
An immediate consequence of Proposition 2.1 is that D1 > 0. As a result, we obtain
that ∂h/∂∆12 > 0. A similar argument shows that ∂h/∂∆23 > 0, completing the proof. It
follows that coverage probability P(θ(1) ∈ X(1) ± c) is minimized at ∆12 = ∆23 = 0, that is,
whenever θ1 = θ2 = θ3.
Observe that Proposition 2.1 gives a straightforward proof for the case p = 2. In
effect, for hc(∆12) = P(θ(1) ∈ X(1) ± c), we have
dhcd∆21
=
∫ ∆12+c
∆12−cφ(z − ∆12)φ(z)dz −
∫ c
−cφ(z − ∆12)φ(z)dz .
Then, applying Proposition 2.1 with f = 1/2π, we obtain that h′c(∆12) ≥ 0. It
immediately follows that the coverage probability is minimized at ∆12 = 0, or equivalently,
when θ1 = θ2.
For the general case (p > 3), we observe that when moving from the case p = k
to the case p = k + 1, we only need to include the extra parameter ∆k+1,k in order to
describe the problem (see Table 2-3). Then, using Proposition 2.1 and mathematical
induction we obtain the following result:
Lemma 1. Let c1, c2 > 0 and for p ≥ 2, let X1, ... ,Xp be independent random variables
with Xi ∼ N(θi , 1). Then,
minθ1,...,θp
P(θ(1) ∈ (X(p) − c1,X(p) + c2)) = p
∫ c2
−c1Φp−1(z)φ(z)dz
= Φp(c2)−Φp(−c1),
where Φ(·) and φ(·) are respectively the cdf and pdf of the standard normal distribution.
Using this lemma, we can easily obtain the following theorem, that summarizes the
main results of this section. The proof is straightforward.
Theorem 2.1. Let 0 < α < 1 and for i = 1, ... , p, suppose that Xi1, ... ,Xin is a random
sample from a N(θi ,σ2), where θi is unknown, but σ2 is known. Then, a confidence
interval for θ(1) =∑pi=1 θi I (Xi = X(1)) with a confidence coefficient of (at least) (1 − α) is
20
given by
X(1) ± σ√nc ,
where the value of c satisfies
Φp(c)−Φp(−c) = 1− α.
2.2 The Unknown Variance Case
If the variance σ2 is unknown, we need to estimate its value. We assume that
we have an independent estimate s2 of σ2, such that s/σ has a pdf ϕ. In a regular
experiment, where we observe a sample of size n from each population, s2 can be
taken as the pooled variance estimate and s2/σ2 ∼ χ2ν , a chi-square distribution with
ν = p(n − 1) degrees of freedom.
Suppose first that p = 3 and for simplicity take n = 1. Then, the coverage probability
can be written as
P(θ(1) ∈ X(1) ± sc) = P(|Z1| ≤ cs/σ,Z1 ≥ Z2 + ∆21,Z1 ≥ Z3 +∆31)
+P(Z2 ≥ Z1 + ∆21, |Z2| ≤ cs/σ,Z2 ≥ Z3 + ∆32)
+P(Z3 ≥ Z1 + ∆13,Z3 ≥ Z2 + ∆32, |Z3| ≤ cs/σ) (2–8)
where Zi = (Xi − θi)/σ and ∆ij = (θi − θj)/σ for 1 ≤ i , j ≤ 3.Notice that taking t = s/σ we can rewrite each term in the sum (2–8) as a mixture.
We obtain
P(θ(1) ∈ (X(1) − sc) =∫ ∞
0
P(|Z1| ≤ ct,Z1 ≥ Z2 + ∆21,Z1 ≥ Z3 + ∆31|t)ϕ(t)dt
+
∫ ∞
0
P(Z2 ≥ Z1 + ∆21, |Z2| ≤ ct,Z2 ≥ Z3 + ∆32|t)ϕ(t)dt
+
∫ ∞
0
P(Z3 ≥ Z1 + ∆13,Z3 ≥ Z2 +∆32, |Z3| ≤ ct|t)ϕ(t)dt,
21
where ϕ(·) denotes the pdf of t. It follows that
P(θ(1) ∈ X(1) ± sc) =∫ ∞
0
P(θ(1) ∈ X(1) ± tc |t)ϕ(t)dt,
where we know (from Section 2.1) that the probability P(θ(1) ∈ X(1) ± tc |t) in the integral
is minimized at θ1 = θ2 = θ3.
The generalization of this result follows from a direct application of Lemma 1.
Lemma 2. Let c1, c2 > 0 and for p ≥ 2, let X1, ... ,Xp be independent random vari-
ables with Xi ∼ N(θi , 1), where both θi and σ2 are unknown. If s2 is an estimate of σ2
independent of X1, ... ,Xn, then
minθ1,...,θp
P(θ(1) ∈ (X(1) − sc1,X(1) − sc2)) =∫ ∞
0
(Φp(c2t)−Φp(−c1t))ϕ(t)dt,
where ϕ(·) is the pdf of s/σ and Φ(·) is the cdf of the standard normal distribution.
We end this section with the following theorem. The proof follows directly form
Lemma 2.
Theorem 2.2. Let 0 < α < 1 and for i = 1, ... , p, suppose that Xi1, ... ,Xin is a random
sample from a N(θi ,σ2), where θi and σ2 are unknown. Then, a confidence interval for
the θ(1) =∑pi=1 θi I (Xi = X(1)) with a confidence coefficient of (at least) (1−α) is given by
X(1) ± s√nc ,
where s =√p−1(n − 1)∑p
i=1 s2i , s
2i = (n − 1)−1
∑nj=1(Xij − Xi)2 for i = 1, ... , p and c
satisfies ∫ ∞
0
(Φp(ct)−Φp(−ct))ϕ(t)dt = 1− α.
2.3 Numerical Studies
In this chapter, we have proposed a method to construct confidence intervals for the
mean of the selected population that takes into account the selection procedure. In this
section we present some numerical results that compare the performance of the new
and the traditional intervals.
22
First, we study the behavior of the confidence coefficient, as a function of the
number of populations. Results show that the confident coefficient of the traditional
intervals decreases rapidly as the number of population increases. This effect is
particularly extreme when all the populations have the same mean. Figure 2-4 shows
the result of simulations considering up to 30 populations with the same mean and
setting α = 0.05. The solid blue line represents the confidence coefficient obtained using
our proposed confidence intervals and the dashed red line depicts the behavior of the
confidence coefficient obtained using the standard confidence intervals. Observe that
the solid line is constant at the nominal level 95%.
Intuitively, in order to maintain the coverage probability constant, the confidence
intervals need to get wider. However, this increment is not dramatic and slow down as
the number of populations increase. For instance, if we consider 10000 populations, the
value of the cutoff point is only about 4.41. In fact, from the inequality in Theorem 2.1 it
can be determined that the behavior of the cutoff value c ≈√log(p).
An indirect way to obtain confidence intervals for θ(1), that attain (at least) the
nominal level, would be to construct simultaneous confidence intervals for the means of
all the populations considered in the experiment using, for instance, Bonferroni intervals.
The natural question is whether such a procedure produces better intervals, in terms of
the length. The answer is no. In fact, the size of the Bonferroni intervals increase at a
faster rate compared to the intervals we propose. Figure 2-5 shows the behavior of the
cutoff point c , as the number of populations increase for the case α = 0.05. The solid
line correspond to the value of the standard cutoff point for a 95% confidence interval
(zα/2 = 1.96). The dashed/dotted line represents the value of c for the new confidence
intervals and the dashed line correspond to the cutoff values for the Bonferroni intervals.
In an applied situation, the population means θi (1 ≤ i ≤ p) will be rarely identical.
Hence we need to compare the performance of the confidence intervals when the
populations means are different. Table 2-4 summarize some results obtained by
23
simulations for the case p = 4. The first column shows the true value of the population
means (all of them with variance equal to 1) while the second and third column show the
observed coverage probability for the traditional and new intervals at a confidence level
of 95%. The reported values correspond to the average for the coverage probabilities
after ten replications and the numbers in parenthesis are the corresponding standard
errors.
We observe that our proposed intervals outperform the traditional ones, even when
the population means are far apart. It is interesting to notice that even in situations
where one of the population should be somehow distinguishable (see row four in Table
2-4), the traditional intervals may perform poorly.
2.4 Tables and Figures
Table 2-1. Configuration of the new parameterization for the probabilityP(θ(1) ∈ X(1) ± c). In the table ∆ij = θi − θj .
θ1 θ2 · · · θpθ1 0 -∆21 · · · -∆p1θ2 ∆21 0 · · · -∆p2...
...... . . . ...
θp ∆p1 ∆p2 · · · 0
Table 2-2. Configuration of the new parameterization for the case p = 3, when ∆12 and∆23 are the free parameters. In the table ∆ij = θi − θj .
θ1 θ2 θ3θ1 0 -∆12 -(∆23 + ∆12)θ2 ∆12 0 -∆23θ3 ∆23 +∆12 ∆23 0
Table 2-3. Representation of the parameters ∆i ,j for the case p = k + 1.θ1 θ2 ... θk θk+1
θ1 0 ∆21 ... ∆k,k−1 + ... + ∆21 ∆k+1,k + ... + ∆21θ2 −∆21 0 ... ∆k,k−1 + ... + ∆32 ∆k+1,k + ... + ∆32...
......
. . ....
...θk -(∆k,k−1 + ... + ∆21) -(∆k,k−1 + ... + ∆32) ... 0 ∆k+1,kθk+1 -(∆k+1,k + ... + ∆21) -(∆k+1,k + ... + ∆32) ... −∆k+1,k 0
24
Table 2-4. Observed coverage probability of 95% CI for the mean of the selectedpopulation out of four populations using the traditional and the new method.The reported values correspond to the average after ten replications and thenumber in parenthesis is the corresponding standard error.
(θ1, θ2, θ3, θ4) Trad CI New CI(0,0,0,0) 0.904 0.952
(0.0016) (0.0012)(0,0.25,0.5,1) 0.907 0.952
(0.0020) (0.0011)(0,5,10,15) 0.950 0.974
(0.0014) (0.0009)(0,0,0,2) 0.928 0.9584
(0.0042) (0.0027)(0,0,0,5) 0.952 0.973
(0.0031) (0.0028)
0.0
0.5
1.0
1.5
2.00
1
2
3
0.86
0.87
0.88
0.89
0.90
Figure 2-1. Coverage probability as a function of ∆21 and ∆32 when p = 3.
25
-4 -2 2 4
-0.00005
0.00005
Figure 2-2. Plot of ∂h/∂∆21 for predetermined values of ∆21 and ∆32.
-4 -2 2 4
0.00002
0.00004
0.00006
0.00008
0.00010
0.00012
-4 -2 2 4
0.00002
0.00004
0.00006
0.00008
0.00010
0.00012
Figure 2-3. Plots of the first two terms of ∂h/∂∆21 for predetermined values of ∆21 and∆32.
26
0 5 10 15 20 25 30
0.5
0.6
0.7
0.8
0.9
Number of Populations
Co
nfid
en
ce
Co
eff
icie
nt
New
Traditional
Figure 2-4. Confidence coefficient versus number of populations for the case of identicalpopulation means and α = 0.05. The solid blue line corresponds to theconfidence coefficient for the new confidence intervals, and the dashed redline corresponds to the confidence coefficient for the traditional confidenceintervals.
27
0 5 10 15 20 25 30
1.5
2.0
2.5
3.0
Number of Populations
Cu
toff
Va
lue
Nominal
New
Bonferroni
Figure 2-5. Cutoff point versus number of populations for the case of identical populationmeans and α = 0.05. The dashed blue line corresponds to the cutoff valuefor the traditional confidence interval, zα/2 = 1.96. The dashed red linecorresponds to the cutoff value for the new intervals and the dashed linecorresponds to the cutoff value for the Bonferroni intervals.
28
CHAPTER 3CONFIDENCE INTERVALS FOLLOWING THE SELECTION OF K ≥ 1
POPULATIONS
Using the same framework as in Chapter 2, we assume that for i = 1, ... , p, we
have independent random variables Xj ∼ N(θj ,σ2/n). Also, we define the order statistics
X(1), ... ,X(p) according the inequalities X(1) ≥ ... ≥ X(p) and for simplicity, we start
considering σ2 = n = 1. Then, we observe that the mean of the population from which
the j th biggest observation, X(j), is sampled, can be written as
θ(j) =
p∑
i=1
θi I (Xi = X(j)).
In this context, we want to find the value of c > 0 such that
P(θ(1) ∈ X(1) ± c , ... , θ(k) ∈ X(k) ± c) ≥ 1− α (3–1)
for any 0 < α < 1 and 1 ≤ k ≤ p.Following the same approach we used in Chapter 2, we can write the probability in
(3–1) as
∑
j1 6=...6=jkP(θ(1) ∈ X(1) ± c , ... , θ(k) ∈ X(k) ± c ,X(1) = Xj1, ... ,X(k) = Xjk ),
where the sum has(pk
)terms.
Let us consider first, the case p = 4 and k = 2. Then, the probability of interest is
P(θ(1) ∈ X(1) ± c , θ(2) ∈ X(2) ± c) =∑
i 6=jP(θi ∈ Xi ± c , θj ∈ Xj ± c ,X(1) = Xi ,X(2) = Xj),
(3–2)
where 1 ≤ i , j ≤ 4.In order to obtain closed form expressions for each terms in the sum, observe that
for X(1) = X1 and X(2) = X2, we have (X(1) = X1,X(2) = X2) = (X1 ≥ X2,X2 ≥ X3,X2 ≥X4). In other words, the relative order between X3 and X4 is irrelevant.
29
It follows that we only need to pay attention to possible configurations of the random
variables that are the top. In this case the possible configurations are
(X1 ≥ X2,X2 ≥ X3,X2 ≥ X4) (X3 ≥ X1,X1 ≥ X2,X1 ≥ X4)
(X1 ≥ X3,X3 ≥ X2,X3 ≥ X4) (X3 ≥ X2,X2 ≥ X1,X2 ≥ X4)
(X1 ≥ X4,X4 ≥ X2,X4 ≥ X3) (X3 ≥ X4,X4 ≥ X1,X4 ≥ X2)
(X2 ≥ X1,X1 ≥ X3,X1 ≥ X4) (X4 ≥ X1,X1 ≥ X1,X1 ≥ X3)
(X2 ≥ X3,X3 ≥ X1,X3 ≥ X4) (X4 ≥ X2,X2 ≥ X1,X2 ≥ X3)
(X2 ≥ X4,X4 ≥ X1,X4 ≥ X3) (X4 ≥ X3,X3 ≥ X1,X3 ≥ X2)
If we define Zj = Xj − θj (1 ≤ j ≤ 4) and ∆ij = θi − θj (1 ≤ i , j ≤ 4), we observe
X1 ≥ X2 ⇔ Z1 ≥ Z2 + ∆21X2 ≥ X3 ⇔ Z2 ≥ Z3 + ∆32X2 ≥ X4 ⇔ Z2 ≥ Z4 + ∆42,
where Z1, ... ,Z4 are iid N(0, 1).
Then, the first term of the sum in (3–2) can be written
P(θ1 ∈ X1 ± c , θ2 ∈ X2 ± c ,X1 ≥ X2,X2 ≥ X3,X2 ≥ X4) =
P(|Z1| ≤ c , |Z2| ≤ c ,Z2 ≤ Z1 + ∆12,Z3 ≤ Z2 + ∆23,Z4 ≤ Z2 + ∆24)
and making use of the normality assumptions, we can explicitly write
P(θ1 ∈ X1 ± c , θ2 ∈ X2 ± c ,X1 ≥ X2,X2 ≥ X3,X2 ≥ X4)
=
∫ c
−c
∫ min(c,z1−∆21)
−cΦ(z2 − ∆32)Φ(z2 − ∆43})φ(z1)φ(z2)dz2dz1
+
∫ c
−c
∫ min(c,z2−∆12)
−cΦ(z1 − ∆31)Φ(z1 − ∆41})φ(z1)φ(z2)dz1dz2
Of course, the same argument is valid for the other terms in the sum. This way,
considering all the 12 possible configurations for the order of the random variables X1,
30
X2, X3 and X4 we can write the sum in (3–2) in closed form
P(θ(1) ∈ X(1) ± c , θ(2) ∈ X(2) ± c)
=
∫ c
−c
∫ min(c,z1−∆21)
−cΦ(z2 − ∆32)Φ(z2 − ∆43})φ(z1)φ(z2)dz2dz1
+
∫ c
−c
∫ min(c,z2−∆12)
−cΦ(z1 − ∆31)Φ(z1 − ∆41})φ(z1)φ(z2)dz1dz2
+
∫ c
−c
∫ min(c,z1−∆31)
−cΦ(z3 − ∆23)Φ(z3 − ∆43})φ(z1)φ(z3)dz3dz1
+
∫ c
−c
∫ min(c,z3−∆13)
−cΦ(z1 − ∆21)Φ(z1 − ∆41})φ(z1)φ(z3)dz1dz3
+
∫ c
−c
∫ min(c,z1−∆41)
−cΦ(z4 − ∆24)Φ(z4 − ∆34})φ(z1)φ(z4)dz4dz1
+
∫ c
−c
∫ min(c,z4−∆14)
−cΦ(z1 − ∆21)Φ(z1 − ∆31})φ(z1)φ(z4)dz1dz4
+
∫ c
−c
∫ min(c,z2−∆32)
−cΦ(z3 − ∆13)Φ(z3 − ∆43})φ(z3)φ(z2)dz3dz2
+
∫ c
−c
∫ min(c,z3−∆23)
−cΦ(z2 − ∆12)Φ(z2 − ∆42})φ(z3)φ(z2)dz2dz3
+
∫ c
−c
∫ min(c,z2−∆42)
−cΦ(z4 − ∆14)Φ(z4 − ∆34})φ(z4)φ(z2)dz4dz2
+
∫ c
−c
∫ min(c,z4−∆24)
−cΦ(z2 − ∆12)Φ(z2 − ∆32})φ(z4)φ(z2)dz2dz4
+
∫ c
−c
∫ min(c,z3−∆43)
−cΦ(z4 − ∆14)Φ(z4 − ∆24})φ(z3)φ(z4)dz4dz3
+
∫ c
−c
∫ min(c,z4−∆34)
−cΦ(z3 − ∆13)Φ(z3 − ∆23})φ(z3)φ(z4)dz3dz4
In order to minimize this expression, we need to address two difficulties equally
challenging:
• First, the construction of any lower bound need to take into account the delicatebalance between the ∆ij ’s in the expression.
• Second, special attention need to be paid to the limits of integration. The “corners”of the form min(c , z − ∆ij) make nearly impossible any procedure based ondifferentiation.
31
To overcome the difficulty due to the “corners”, we notice that the events (Z2 ≤Z1+∆12,Z3 ≤ Z2+∆23,Z4 ≤ Z2+∆24) and (Z2 ≥ Z1+∆12,Z3 ≤ Z1+∆13,Z4 ≤ Z1+∆14)are disjoint. Hence, we can express the sum of the probabilities for these two events
as the probability of their union. Consequently, instead of writing down 12 terms for the
sum (one term per configuration), we can express the probability of interest using only 6
terms, each of them describing the two random variables positioned at the top.
Working the details, we obtain:
• X1 and X2 at the top.
P(|Z1| ≤ c , |Z2| ≤ c ,Z3 ≤ max{Z1 + ∆13,Z2 +∆23},Z4 ≤ max{Z1 + ∆14,Z2 + ∆24})=
∫ c
−c
∫ c
−cΦ(max{z1 + ∆13, z2 +∆23})Φ(max{z1 + ∆14, z2 + ∆24})φ(z1)φ(z2)dz1dz2
• X1 and X3 at the top.
P(|Z1| ≤ c , |Z3| ≤ c ,Z2 ≤ max{Z1 + ∆12,Z3 − ∆23},Z4 ≤ max{Z1 + ∆14,Z3 + ∆34})=
∫ c
−c
∫ c
−cΦ(max{z1 + ∆12, z3 − ∆23})Φ(max{z1 + ∆14, z3 + ∆34})φ(z1)φ(z3)dz1dz3
• X1 and X4 at the top.
P(|Z1| ≤ c , |Z4| ≤ c ,Z2 ≤ max{Z1 + ∆12,Z4 − ∆24},Z3 ≤ max{Z1 + ∆13,Z4 − ∆24})=
∫ c
−c
∫ c
−cΦ(max{z1 + ∆12, z4 − ∆24})Φ(max{z1 + ∆13, z4 − ∆34})φ(z1)φ(z4)dz1dz4
• X2 and X3 at the top.
P(|Z2| ≤ c , |Z3| ≤ c ,Z1 ≤ max{Z2 − ∆12,Z3 − ∆13},Z4 ≤ max{Z2 + ∆24,Z3 + ∆34})=
∫ c
−c
∫ c
−cΦ(max{z2 − ∆12, z3 − ∆13})Φ(max{z2 + ∆24, z3 + ∆34})φ(z2)φ(z3)dz2dz3
• X2 and X4 at the top.
P(|Z2| ≤ c , |Z4| ≤ c ,Z1 ≤ max{Z2 − ∆12,Z4 +∆14},Z3 ≤ max{Z2 + ∆23,Z4 − ∆34})=
∫ c
−c
∫ c
−cΦ(max{z2 − ∆12, z4 − ∆14})Φ(max{z2 + ∆23, z4 − ∆34})φ(z2)φ(z4)dz2dz4
32
• X3 and X4 at the top.
P(|Z3| ≤ c , |Z4| ≤ c ,Z1 ≤ max{Z3 − ∆13,Z4 − ∆14},Z2 ≤ max{Z3 − ∆23,Z4 − ∆24})=
∫ c
−c
∫ c
−cΦ(max{z3 + ∆13, z4 − ∆14})Φ(max{z3 − ∆23, z4 − ∆24})φ(z3)φ(z4)dz3dz4
This way, an alternative representation for the probability of interest is
P(θ(1) ∈ X(1) ± c , θ(2) ∈ X(2) ± c) (3–3)
=
∫ c
−c
∫ c
−cΦ(max{z1 + ∆13, z2 + ∆23})Φ(max{z1 + ∆14, z2 + ∆24})φ(z1)φ(z2)dz1dz2
+
∫ c
−c
∫ c
−cΦ(max{z1 + ∆12, z3 − ∆23})Φ(max{z1 + ∆14, z3 + ∆34})φ(z1)φ(z3)dz1dz3
+
∫ c
−c
∫ c
−cΦ(max{z1 + ∆12, z4 − ∆24})Φ(max{z1 + ∆13, z4 − ∆34})φ(z1)φ(z4)dz1dz4
+
∫ c
−c
∫ c
−cΦ(max{z2 − ∆12, z3 − ∆13})Φ(max{z2 + ∆24, z3 + ∆34})φ(z2)φ(z3)dz2dz3
+
∫ c
−c
∫ c
−cΦ(max{z2 − ∆12, z4 − ∆14})Φ(max{z2 + ∆23, z4 − ∆34})φ(z2)φ(z4)dz2dz4
+
∫ c
−c
∫ c
−cΦ(max{z3 + ∆13, z4 − ∆14})Φ(max{z3 − ∆23, z4 − ∆24})φ(z3)φ(z4)dz3dz4
Observe that this new representation does not completely solve the problem of the
“corners”, but rather removes them from the limits of integration and puts them inside
the integrand. Now, we find expressions of the form max{z + ∆ij} in the argument of
the normal cdf’s Φ(·), which still makes difficult any minimization approach based on
differentiation.
However, this new representation reveals more clearly the symmetry in the structure
of the ∆’s, as is portrayed in Table 3-1. This pattern is particularly important since it
suggests to generalize the expression for any values of p and k .
In order to determine the configuration of ∆’s that minimize the expression in (3–3),
we assume (without loss of generality) that θ1 ≥ θ2 ≥ θ3 ≥ θ4, this way ∆ij ≥ 0 for any
i ≤ j . Also, we consider ∆12, ∆23 and ∆34 as free parameters.
Based on our previous results, it is reasonable to believe that the minimum of (3–3)
is reached at the origin. In order to prove this claim we have studied the behavior of the
33
coverage probability (CP) for different configurations of the ∆ij ’s, with special attention to
the behavior at the boundary. Among others we considered the following cases:
• ∆12 = ∆23 = ∆34 = 0
CP = 6∫ c
−c
∫ c
−cΦ2(max{z1, z2})φ(z1)φ(z2)dz1dz2
• ∆12 > 0, ∆23 = ∆34 = 0
CP = 3
∫ c
−c
∫ c
−cΦ2(max{z1 + ∆12, z2})φ(z1)φ(z2)dz1dz2
+ 3
∫ c
−c
∫ c
−cΦ(max{z2 − ∆12, z3 − ∆12})Φ(max{z2, z3})φ(z2)φ(z3)dz2dz3
−→ 3
∫ c
−c
∫ c
−cφ(z1)φ(z2)dz1dz2, as ∆12 ↑ +∞
• ∆12, ∆23 > 0 and ∆34 = 0:
CP =
∫ c
−c
∫ c
−cΦ2(max{z1 +∆13, z2 + ∆23})φ(z1)φ(z2)dz1dz2
+ 2
∫ c
−c
∫ c
−cΦ(max{z1 + ∆12, z3 − ∆23})Φ(max{z1 + ∆13, z3})φ(z1)φ(z3)dz1dz3
+ 2
∫ c
−c
∫ c
−cΦ(max{z2 − ∆12, z3 − ∆13})Φ(max{z2 + ∆23, z3})φ(z2)φ(z3)dz2dz3
+
∫ c
−c
∫ c
−cΦ(max{z3 − ∆13, z4 − ∆13})Φ(max{z3 − ∆23, z4 − ∆23})φ(z3)φ(z4)dz3dz4
−→ 3
∫ c
−c
∫ c
−cφ(z1)φ(z2)dz1dz2, as ∆12, ∆23 ↑ +∞
However, none of the cases we considered provided conclusive (analytical) evidence
that the minimum is at the origin. On the contrary, various numerical studies has
suggested that the minimum is not located located at the origin (see Figure 3-1), but the
current formulation of the problem makes difficult even to establish that is not located at
the interior of the region determined by ∆12, ∆23 and ∆34.
These difficulties call for a different approach which we discuss in the following
section.
34
3.1 An Alternative Approach
So far, we have approached the problem considering partitions of the coverage
probability based on the possible configurations of the vector (X(1),X(2), ... ,X(k)). Notice
that such approach, by construction, takes into account the relative orderings between
the variables that are selected (the top k).
Instead, we can consider an alternative that do not take explicit consideration of the
ordering between the variables that have been selected. Notice there are(pk
)different
ways to select k out of p populations, without considering the order. Suppose that j
indexes one of such arrangements and denote by Xj1, ... ,Xjk the top k variables, and by
Xjk+1, ... ,Xjp are the bottom p − k . Then, we can separate the sample space according to
min{Xj1, ... ,Xjk} ≥ max{Xjk+1, ... ,Xjp} for j = 1, ... ,(pk
).
This way, the coverage probability can be written
P(θ(1) ∈ X(1) ± c , ... , θ(k) ∈ X(k) ± c)
=
(pk)∑
j=1
P(θj1 ∈ Xj1 ± c , ... , θjk ∈ Xjk ± c , min{Xj1, ... ,Xjk} ≥ max{Xjk+1, ... ,Xjp})
Let us consider first the term where (X1,X2, ... ,Xk) are at the top. For this case, the
corresponding piece of relevant probability is
P(θ1 ∈ X1 ± c , ... , θk ∈ Xk ± c , min{X1, ... ,Xk} ≥ max{Xk+1, ... ,Xp})
=
∫ θ1+c
θ1−c· · ·
∫ θk+c
θk−c
p∏
j=k+1
Pθj (Xj ≤ min{x1, ... , xk})f (x1, ... , xk)dx1 · · · dxk
where f (x1, ... , xk) is the joint density of (X1, ... ,Xk).
Hence, making use of the the normality assumptions, we have
P(θ1 ∈ X1 ± c , ... , θk ∈ Xk ± c , min{X1, ... ,Xk} ≥ max{Xk+1, ... ,Xp})
=
∫ c
−c· · ·
∫ c
−c
p∏
j=k+1
Φ(min{z1 + θ1, ... , zk + θk} − θj)
k∏
i=1
φ(zi)dzi ,
where zi = xi − θ1 for i = 1, ... , k .
35
From here, it is not difficult to obtain the following expression for the coverage
probability
P(θ(1) ∈ X(1) ± c , ... , θ(k) ∈ X(k) ± c)
=
(pk)∑
j=1
∫ c
−c· · ·
∫ c
−c
∏
m∈I cj
Φ
(min`∈Ij{z` + θ`} − θm
) ∏
`∈Ijφ(z`)dz`, (3–4)
where Ij = {j1, ... , jk}, the set of indices for the top k variables in the j-th arrangement
and I cj = {jk+1, ... , jp}, the set of indices for the bottom p − k variables in the j-th
arrangement.
Notice that if k = 1 we are back in the case discussed in Chapter 2 and the case
k = p correspond to simultaneous confidence intervals.
Let us take a closer look at this formula and consider first the case p = 6 and k = 3.
In such case, the sum in (3–4) will have(63
)= 20 terms determined by the configurations
123|456 234|156 345|126 456|123124|356 235|146 346|125125|346 236|145 356|124126|345 245|136134|256 246|136135|246 256|134136|245145|236146|235156|234
where the numbers to the left of the vertical line are the indices of the set Ij (the
populations being selected) and the numbers to the right the indices of the set I cj
(the populations being not selected). Observe that all the indices appear on the left side
(and on the right side) the same number of times (10), revealing some symmetry in the
problem.
36
Using this symmetry, suppose that θ1 ≤ θ2 ≤ θ3 ≤ θ4 ≤ θ5 ≤ θ6 and let
θ6 ↑ ∞. Then, for the 10 groups for which 6 is on the right side, the corresponding
term goes to zero. For the remaining groups (for which 6 appears on the left) the value
of Φ(min`=1,...,k{zj` + θj`} − θjm) is not affected by θ6, and the coverage probability is
determined by the following configuration
12|345 23|145 34|125 45|12313|245 24|135 35|12414|235 25|13415|234
which correspond to the possible ways of choosing 2 out of 5 populations.
Repeating the argument, but letting θ5 ↑ ∞ we obtain the configuration
1|234 3|1242|134 4|123
which are the possible ways to choose 1 out of 4 populations. For this case, we know
(from Chapter 2) that the minimum is reached at θ1 = θ2 = θ3 = θ4. This example
suggests that the coverage probability is minimized when the biggest p−k−1 population
means are sent to +∞ and the remaining k + 1 are set to be equal. However, a formal
argument is required.
For the general case (1 ≤ k < p) the number of possible configurations is
(p
k
)=
(p − 1k
)+
(p − 1p − k
)
=
(p − 1k
)+
(p − 1k − 1
)
where(p−1k
)is the number of times that any given index j appears on the right side
(population j is not selected) and(p−1k−1
)is the number of configurations that have index j
on the left side (population j is selected).
37
Suppose (without any loss of generality) that θ1 ≤ ... ≤ θp and define
Ij(θp) = I
(min
`∈Ij−{p}{z` + θ`} ≥ zp + θp
)
I cj (θp) = I
(min
`∈Ij−{p}{z` + θ`} < zp + θp
)
where I (·) is the indicator function.
From the definition, it immediately follows
min`∈Ij{z` + θ`} = (zp + θp)Ij(θp) + min
`∈Ij−{p}{z` + θ`}I cj (θp) (3–5)
and therefore, the coverage probability can be written as
P(θ(1) ∈ X(1) ± c , ... , θ(k) ∈ X(k) ± c)
=
(pk)∑
j=1
∫ c
−c· · ·
∫ c
−c
∏
m∈I cj
Φ((zp + θp)− θm) Ij(θp)∏
`∈Ijφ(z`)dz`
+
(pk)∑
j=1
∫ c
−c· · ·
∫ c
−c
∏
m∈I cj
Φ
(min
`∈Ij−{p}{z` + θ`} − θm
)I cj (θp)
∏
`∈Ijφ(z`)dz`
Now, observe that as θp ↑ ∞
min`∈Ij{z` + θ`} = (zp + θp)Ij(θp) + min
`∈Ij−{p}{z` + θ`}I cj (θp)
→ min`∈Ij−{p}
{z` + θ`}
and hence
∏
m∈I cj
Φ
(min`∈Ij{z` + θ`} − θm
)→
∏
m∈I cj
Φ
(min
`∈Ij−{p}{z` + θ`} − θm
)
for all the terms for which θp is on the left side.
At the same time, for the terms where θp is on the right side, we have
∏
m∈I cj
Φ
(min`∈Ij{z` + θ`} − θm
)→ 0,
38
and therefore, as θp ↑ ∞, the coverage probability converges to
(p−1k−1)∑
j=1
∫ c
−c· · ·
∫ c
−c
∏
m∈I cj
Φ
(min
`∈Ij−{p}{z` + θ`} − θm
) ∏
`∈Ijφ(z`)dz`.
Before we move forward, let us consider the example p = 3, k = 2. Then, the
coverage probability is
P(θ(1) ∈ X(1) ± c , ... , θ(3) ∈ X(3) ± c)
=
(32)∑
j=1
∫ c
−c
∫ c
−c
∏
m∈I cj
Φ
(min`∈Ij{z` + θ`} − θm
) ∏
`∈Ijφ(z`)dz`
=
∫ c
−c
∫ c
−cΦ(min{z1 + θ1, z2 + θ2} − θ3)φ(z1)φ(z2)dz1dz2
+
∫ c
−c
∫ c
−cΦ(min{z1 + θ1, z3 + θ3} − θ2)φ(z1)φ(z3)dz1dz3 (3–6)
+
∫ c
−c
∫ c
−cΦ(min{z2 + θ2, z3 + θ3} − θ1)φ(z2)φ(z3)dz2dz3,
and, as θ3 ↑ ∞, we obtain
M =
∫ c
−c
∫ c
−cΦ(z1 + θ1 − θ2)φ(z1)φ(z3)dz1dz3 (3–7)
+
∫ c
−c
∫ c
−cΦ(z2 + θ2 − θ1)φ(z2)φ(z3)dz2dz3.
Suppose now, that for a fixed θp, min{z1 + θ1, z3 + θ3} = z3 + θ3. Since we are
assuming that θ1 ≤ θ2 ≤ θ3, this can only happens for certain values of z1 and z3. Let
R1 = {(z1, z3) : min{z1 + θ1, z3 + θ3} = z1 + θ1} and R2 = {(z1, z3) : min{z1 + θ1, z3 + θ3} =z3 + θ3}. Then, the integral in (3–6) can be written as
∫∫
R1
Φ(z1 + θ1 − θ2)φ(z1)φ(z3)dz1dz3 +
∫∫
R2
Φ(z3 + θ3 − θ2)φ(z1)φ(z3)dz1dz3.
Similarly, the integral in (3–7) can be written as
∫∫
R1
Φ(z1 + θ1 − θ2)φ(z1)φ(z3)dz1dz3 +
∫∫
R2
Φ(z1 + θ1 − θ2)φ(z1)φ(z3)dz1dz3
39
and, since θ3 − θ2 ≥ θ1 − θ2, we obtain
∫ c
−c
∫ c
−cΦ(min{z1 + θ1, z3 + θ3} − θ2)φ(z1)φ(z3)dz1dz3
≥∫ c
−c
∫ c
−cΦ(z1 + θ1 − θ2)φ(z1)φ(z3)dz1dz3.
Using similar argument with the third integral in the coverage probability, we
conclude that P(θ(1) ∈ X(1) ± c , ... , θ(3) ∈ X(3) ± c) ≥ M.
For the general case, suppose that θp (fixed) is such that Ij(θp) = 1 for some j .
That is min`∈Ij{z` + θ`} = (zp + θp). Under the assumption θ1 ≤ ... ≤ θp, we have
θp − θm ≥ θ` − θm for any 1 ≤ m, ` ≤ p and therefore, Ij(θp) can be equal to 1 only in
a certain region of the hyper-cube (−c , c)k . Then, partitioning the integrals accordingly,
we obtain
P(θ(1) ∈ X(1) ± c , ... , θ(k) ∈ X(k) ± c)
=
(pk)∑
j=1
∫ c
−c· · ·
∫ c
−c
∏
m∈I cj
Φ
(min`∈Ij{z` + θ`} − θm
) ∏
`∈Ijφ(z`)dz`
≥(p−1k−1)∑
j=1
∫ c
−c· · ·
∫ c
−c
∏
m∈I cj
Φ
(min
`∈Ij−{p}{z` + θ`} − θm
) ∏
`∈Ijφ(z`)dz`, (3–8)
where the equality is attained asymptotically as θp approaches infinity.
Integrating (3–8) with respect to zp, we obtain
(Φ(c)−Φ(−c))(p−1k−1)∑
j=1
∫ c
−c· · ·
∫ c
−c
∏
m∈I cj
Φ
(min
`∈Ij−{p}{z` + θ`} − θm
) ∏
`∈Ij−{p}φ(z`)dz`
where the quantity in brackets [ ] is exactly the coverage probability for selecting k − 1out of p − 1.
Repeating the argument, but now letting θp−1 ↑ ∞, we obtain the lower bound
(Φ(c)−Φ(−c))2(p−2k−2)∑
j=1
∫ c
−c· · ·
∫ c
−c
∏
m∈I cj
Φ
(min
`∈Ij−{p}{z` + θ`} − θm
) ∏
`∈Ij−{p,p−2}φ(z`)dz`
.
40
This way, continuing the procedure until there is only 1 population on the left side
(selected) and p− k + 1 on the right side (not selected), the resulting lower bound for the
coverage probability is
(Φ(c)−Φ(−c))k−1p−k+1∑
j=1
∫ c
−c
∏
m∈I cj
Φ(z + θj − θm)φ(z)dz
.
Again, notice that the expression in brackets [ ] correspond to the coverage probability
for selecting 1 out of p − k + 1 populations, which we already know is minimized at
θ1 = ... = θp.
Observe that nothing changes in the argument if we replace the intervals (−c , c),by intervals of the form (−c1, c2), with c1, c2 > 0. This observation leads to the following
lemma:
Lemma 3. Let c1, c2 > 0 and for p ≥ 2, let X1, ... ,Xp be independent random variables
with Xi ∼ N(θi , 1). Then,
minθ1,...,θp
P(θ(1) ∈ (X(1) − c1,X(1) − c2), ... , θ(k) ∈ (X(k) − c1,X(k) − c2))
= (Φ(c2)−Φ(−c1))k−1[Φp−k+1(c2)−Φp−k+1(−c1)
],
where Φ(·) is the cdf of the normal standard distribution.
If the variance σ2 is unknown, we can follow the same strategy used in Chapter 2
and extend this result by writing the coverage probability as a mixture. We obtain:
Lemma 4. Let c1, c2 > 0 and for p ≥ 2, let X1, ... ,Xp be independent random vari-
ables with Xi ∼ N(θi , 1), where both θi and σ2 are unknown. If s2 is an estimate of σ2
independent of X1, ... ,Xn, then
minθ1,...,θp
P(θ(1) ∈ (X(1) − c1,X(1) − c2), ... , θ(k) ∈ (X(k) − c1,X(k) − c2))
=
∫ ∞
0
(Φ(c2t)−Φ(−c1t))k−1[Φp−k+1(c2t)−Φp−k+1(−c1t)
]ϕ(t)dt,
where ϕ(·) is the pdf of s/σ and Φ(·) is the cdf of the standard normal distribution.
41
The following theorem summarize the main results of this chapter:
Theorem 3.1. Let 0 < α < 1 and for i = 1, ... , p, suppose that Xi1, ... ,Xin is a random
sample from a N(θi ,σ2), where θi is unknown.
Case 1: If the variance σ2 is known, then confidence intervals for the θ(1), ... , θ(k),
with a simultaneous confidence coefficient of (at least) (1− α), are given by
X(j) ± σ√nc , j = 1, ... , k ,
where the value of c satisfies
(Φ(c)−Φ(−c))k−1 [Φp−k+1(c)−Φp−k+1(−c)] = 1− α.
Case 2: If the variance σ2 is unknown, then confidence intervals for the θ(1), ... , θ(k),
with a simultaneous confidence coefficient of (at least) (1− α), are given by
X(j) ± s√nc , j = 1, ... , k ,
where s =√p−1(n − 1)∑p
i=1 s2i , s
2i = (n − 1)−1
∑nj=1(Xij − Xi)2 for 1 ≤ i ≤ p, c satisfies
∫ ∞
0
(Φ(ct)−Φ(−ct))k−1 [Φp−k+1(ct)−Φp−k+1(−ct)] ϕ(t)dt = 1− α
and ϕ(·) is the pdf of s/σ.
3.2 Numerical Studies
The results obtained in Section 3.1 suggests that the minimum coverage probability
is attained when k − 1 population mean go to infinity and the remaining p − k + 1populations have the same mean. To confirm this behavior we performed several
simulation studies in which we consider the empirical coverage probability for confidence
intervals setting very large numbers for the values of θ diverging to infinity and setting
the remaining ones equal to zero. Table 3-2 shows the result of a simulation study in
which we considered six populations and we varied the number of of the selected ones.
In the first column we can see the number of population means set equal to zero (the
42
rest was set equal to 100 to represent infinity) and we observe that for every 1 ≤ k ≤ 6,the minimum coverage probability is obtained when 6 − k + 1 populations have equal
mean.
A different concern is whether the new intervals maintain the nominal level. Table
3-3 summarizes the results for the observed coverage probabilities considering 6
populations obtained in a numerical study. The nominal level is 95%. In the table, the
first column shows different configurations for the population means and the first rows
indicate the number of selected populations. We observed that for every configuration
the observed coverage probability is never below the nominal level. These result remain
valid for every other configuration we have considered (including changing the number of
populations) which validates the reliability of the procedure.
Finally, we studied the behavior of the length of the intervals. In Chapter 2
we observed that the confidence intervals increase in length as the number of
populations increases. This behavior is also expected when we are selecting k > 1
populations, however it it is important to determine how the value of k affects the
length of the intervals. Table 3-4 shows the results of a numerical study in which we
considered different values of p (total number of populations) and k (number of selected
populations). In the table, the first columns shows the number of populations, and the
first row the number of selected populations. In the body we observe the values of the
cutoff points for a 95% confidence intervals for the corresponding configuration, and the
last column shows the cutoff values for 95% simultaneous confidence intervals using
Bonferroni. We notice that the proposed intervals are always shorter to Bonferroni, even
when we select all the available populations (p = k). This difference increases as the
number of populations increases.
43
3.3 Tables and Figures
Table 3-1. Structure of the ∆’s for the case p = 4, k = 2 (see 3–3). Each row represent aterm in the sum.
Top ∆’s(X1,X2) +∆13 +∆23 +∆14 +∆24(X1,X3) +∆12 −∆23 +∆14 +∆34(X1,X4) +∆12 −∆24 +∆13 −∆34(X2,X3) −∆12 −∆13 +∆24 +∆34(X2,X4) −∆12 −∆14 +∆23 −∆34(X3,X4) −∆13 −∆14 −∆23 −∆24
Table 3-2. Coverage probabilities for the number of population means equal to 0 (firstcolumn) vs the number of selected population (first row).
# of θi = 0 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6
6 0.740 0.740 0.739 0.738 0.714 0.5315 0.898 0.698 0.698 0.697 0.682 0.5314 0.904 0.813 0.662 0.662 0.654 0.5313 0.861 0.853 0.730 0.626 0.623 0.5312 0.819 0.818 0.805 0.658 0.592 0.5311 0.777 0.777 0.776 0.757 0.590 0.5310 0.740 0.740 0.739 0.738 0.714 0.531
Table 3-3. Observed coverage probability for 95% CI for the mean of the selectedpopulations when p = 6 using the new method.(θ1, θ2, θ3, θ4, θ5, θ6) k = 1 k = 2 k = 3 k = 4 k = 5 k = 6
(0, 0, 0, 0, 0, 0) 0.955 0.954 0.960 0.968 0.969 0.953(0, 1, 2, 3, 4, 5) 0.977 0.966 0.959 0.959 0.957 0.957(0, 3, 6, 9, 12, 15) 0.982 0.972 0.965 0.953 0.951 0.953(0, 0, 0, 0, 3, 3) 0.978 0.968 0.954 0.957 0.961 0.955
44
Table 3-4. Cutoff points for 95% CI for different values of p and k using the new method.Num Pop k = 1 k = 2 k = 3 k = 4 k = 5 Bonf1 1.960 1.9602 1.960 2.236 2.2413 2.121 2.236 2.388 2.3944 2.234 2.319 2.388 2.491 2.4985 2.319 2.387 2.443 2.491 2.569 2.576
0 2 4 6 8 10
0.88
0.90
0.92
0.94
0.96
Selecting 1 out of 6
Norm of Delta
Cove
rage P
robabili
ty
0 2 4 6 8 10
0.87
0.88
0.89
0.90
0.91
Selecting 2 out of 6
Norm of Delta
Cove
rage P
robabili
ty
0 2 4 6 8 10
0.856
0.860
0.864
0.868
Selecting 3 out of 6
Norm of Delta
Cove
rage P
robabili
ty
0 2 4 6 8 10
0.82
0.83
0.84
0.85
0.86
Selecting 4 out of 6
Norm of Delta
Cove
rage P
robabili
ty
0 2 4 6 8 10
0.80
0.82
0.84
0.86
Selecting 5 out of 6
Norm of Delta
Cove
rage P
robabili
ty
0 2 4 6 8 10
0.0
0.2
0.4
0.6
Selecting 6 out of 6
Norm of Delta
Cove
rage P
robabili
ty
Figure 3-1. Coverage probabilities as a function of ∆ when p = 6. The plots suggest theminimum is not reached at the origin.
45
CHAPTER 4INTERVAL ESTIMATION FOLLOWING THE SELECTION OF A RANDOM NUMBER OF
POPULATIONS
From an application perspective, an interesting variation of the selection problem
occurs when the number populations to be selected is random and depends on the
outcome of the experiment. For instance, in a standard multiple testing scheme, a
common approach is to run all the tests independently (without any corrections such as
Tukey or Bonferroni) and then, only declare significant a subset of the significant tests,
using procedures such as false discovery rate (FDR).
In addition to the notation we introduced in the previous chapters, we assume that
we observe a random sequence of numbers d1 > ... > dp obtained as a result of an
experiment, such that di ∈ (−∞,∞) for 1 ≤ i ≤ p. In this context, for any 0 < α < 1, we
want to determine the value of c > 0 such that
P(θ(1) ∈ X(1) ± c , ... , θ(K) ± c) ≥ 1− α
where K ∈ {0, ... , p} is a random quantity.
In order to obtain an expression for the coverage probability, we first write the
coverage probability as the sum
P(θ(1) ∈ X(1) ± c , ... , θ(K) ± c) =p∑
j=1
P(θ(1) ∈ X(1) ± c , ... , θ(K) ± c |K = j)P(K = j)
=
p∑
j=1
P(θ(1) ∈ X(1) ± c , ... , θ(j) ± c)P(K = j). (4–1)
From our previous results, we notice that for every term in the sum
P(θ(1) ∈ X(1) ± c , ... , θ(j) ± c) ≥ (Φ(c)−Φ(−c))j−1[Φp−j+1(c)−Φp−j+1(−c)] ,
46
and therefore, we can re-write (4–1) and obtain
P(θ(1) ∈ X(1) ± c , ... , θ(K) ± c)
≥p∑
j=1
(Φ(c)−Φ(−c))j−1 [Φp−j+1(c)−Φp−j+1(−c)]P(K = j), (4–2)
where Φ(·) is the cdf of the standard normal distribution.
Since the inequality above is not obtained by direct minimization of the coverage
probability in (4–1), any solution based on (4–2) is likely to be too conservative.
Therefore, it is important to assess the performance of the proposed bound in terms
of ots proximity to the coverage probability. The first thing to determine, is the behavior
of the lower bounds at the component level (K = j). Figure 4-1 shows the results of
a numerical study considering the components K = 1, ... ,K = 6 of the coverage
probability when p = 6. The dashed blue line shows the behavior of the respective
component as the norm of θ = (θ1, ... , θ6) increases and the red solid line shows the
corresponding lower bound. We observe the lower bound (for the individual terms) is not
extremely conservative.
On the other hand, the probability that K = j is given by
P(K = j) =
(pj)∑
i=1
∏
i∈IiP(Xi ≥ dj)
∏
i∈I ci
P(Xi ≤ dj)
=
(pj)∑
i=1
∏
i∈Ii[1−Φ(dj − θi)]
∏
i∈I ci
Φ(dj − θi), (4–3)
where P(Xi ≥ dj) = 1 − Φ(dj − θi) is the probability of selection for population i .
Notice that the expression in (4–3) resembles a binomial distribution. In fact, taking
θ1 = ... = θp = θ, we have
(pj)∑
i=1
∏
i∈Ii[1−Φ(dj − θi)]
∏
i∈I ci
Φ(dj − θi) =
(p
j
)[1−Φ(dj − θ)]j [Φ(dj − θ)]p−j ,
47
the binomial distribution with probability of success 1 − Φ(dj − θ). This observation
suggests we can use the quantities dj − θ as tuning parameters in order to improve the
performance of the lower bound. Figure 4-2 shows the results of a numerical study in
which we take d1 = ... = dp = d , and use the quantity q − θ as a tuning parameter. We
see that by changing the value of the probability of selection we can move the position
lower bound (red solid line) and produce some improvement in the approximation of the
the coverage probability.
Based on the previous observations, we can obtain an approximate solution to the
problem and determine c > 0 using the equation
1− α
=
p∑
j=1
(p
j
)(Φ(c)−Φ(−c))j−1 [Φp−j+1(c)−Φp−j+1(−c)] [1−Φ(qj − θ)]j [Φ(qj − θ)]p−j ,
for any 0 < α < 1.
Numerical studies suggest the results based on the expression above are not
extremely conservative. In addition, the results suggest the performance of the method
greatly improves as the number of population increases (see Figure 4-3).
4.1 Connection to FDR
The false discovery rate (FDR) procedure was introduced by Benjamini and
Hochberg (1995) and is a technique commonly used by practitioners in the context
of multiple testing. The main idea is to control the proportion of errors committed by
falsely rejecting null hypotheses. In simple terms, the procedure works in the following
way: suppose that we need to test H1, ... ,Hm hypotheses and we are not willing to
accept a proportion of false discoveries greater than q. We first rank the P-values (and
corresponding hypothesis) resulting from all the tests from smaller to largest and define
the sequence q1, q2, ... , qp according to qi = (i/m)q for i = 1 ... p). Then, we define k
to be the largest i such that P-valuei < qi . If we reject all the hypotheses corresponding
48
to the first k ordered P-values, the procedure guarantees to control the FDR with a
proportion no greater than q.
In the context of our problem, we observe that the FDR procedure can be easily
connected with the random selection idea. Suppose that we have m = p hypotheses
of the form H0 : θi ≤ 0 vs. H1 : θi > 0. In other words, we are interested in performing
p one-sided tests for the population means. Then, extreme observations will have
small P-values, and therefore, the selection criteria P-valuei < qi can be expressed
as X(i) > di . It follows that for the sequence q1 < ... < qp ∈ (0, 1) we can construct
a corresponding sequence d1 > ... > dp ∈ (−∞,∞), and hence, we can produce
confidence intervals for the top K selected populations, where the value of K is
determined by the FDR.
49
4.2 Tables and Figures
0 2 4 6 8 10
0.88
0.90
0.92
0.94
0.96
Component K = 1
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.85
0.87
0.89
0.91
Component K = 2
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.83
0.84
0.85
0.86
0.87
Component K = 3
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.81
0.83
0.85
0.87
Component K = 4
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.80
0.82
0.84
0.86
Component K = 5
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.75
0.77
0.79
0.81
Component K = 6
Norm of DeltaC
over
age
Pro
babi
lity
Figure 4-1. Individual components and corresponding bounds for the terms of thecoverage probability for random K when p = 6. The blue dashed linecorrespond to the coverage probability and the red solid line is the lowerbound.
50
0 2 4 6 8 10
0.76
0.80
0.84
Prob of Sel 0.3
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.80
0.82
0.84
0.86
Prob of Sel 0.4
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.810.820.830.840.850.86
Prob of Sel 0.5
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.81
0.83
0.85
Prob of Sel 0.6
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.80
0.82
0.84
0.86
Prob of Sel 0.7
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.80
0.82
0.84
0.86
Prob of Sel 0.8
Norm of Delta
Cov
erag
e P
roba
bilit
y
Figure 4-2. Behavior of the lower bound for random K when p = 6 as the probability ofselection varies.
51
0 2 4 6 8 10
0.83
0.84
0.85
0.86
Random out of 5
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.72
0.74
0.76
0.78
0.80
Random out of 10
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.60
0.64
0.68
0.72
Random out of 15
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.52
0.56
0.60
0.64
Random out of 20
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.25
0.30
0.35
0.40
Random out of 40
Norm of Delta
Cov
erag
e P
roba
bilit
y
0 2 4 6 8 10
0.12
0.16
0.20
0.24
Random out of 60
Norm of Delta
Cov
erag
e P
roba
bilit
y
Figure 4-3. Behavior of the coverage probabilities and respective lower bounds forrandom K as the population size p varies.
52
CHAPTER 5APPLICATION EXAMPLE
In this chapter we show a potential application for the procedures we have
introduced in this dissertation. We consider data from a genetic experiment that
compares gene expressions between two different tissue types, infection cushions
(IC) and vegetative hyphae (VH), through competitive hybridization. Using a single
probe, a total of five hybridizations (independent biological replications) were run for
7494 genes. The data consists of the processed signal intensities as a measurement of
the fluorescence reaction of every gene to the probe.
In the context of the experiment, one question of interest is to determine what
genes are differentially expressed between the two tissue types. Since all the genes are
exposed to the same probe, differences that can not be explained by chance indicate
a variation associated to the tissues, in the corresponding genes. A related question
is what is fold increase or decrease in gene expression between the treatments. In
other words, what is the mean signal ratio for each gene between treatments. Here, we
consider the problem of determining confidence intervals for the mean signal ratio in
those genes that give the top largest increase between the two treatments.
First, we implement the procedure for the top k genes, where k is fixed and
pre-specified. We end this chapter showing how the procedure works when K is chosen
at random, determined by FDR.
5.1 Fixed Selection
Suppose first that the number of populations to be selected is determined prior
to the experiment. Specifically, suppose that k = 100. For every gene we take the
difference of the log-scores for each of the 5 replications. Then, we have a total of
p = 7494 populations, from which we take independent samples of size n = 5. Although
the number of replications is not large enough to assume the central limit theorem (CLT),
53
the data does not show clear deviations from normality. To correct for heterogeneity, we
use of the log-scores for the analysis.
Then, we rank the average of the differences in descending order and select the
genes corresponding for the top 100 values of the sample means. Using the results
presented in Chapter 3, the cutoff value for 95% confidence intervals when p = 7494
and k = 100 is c = 4.35. Table 5-1 shows the mean, standard deviation and confidence
intervals for the top 5 and bottom 5 selected genes. The table show that, although the
value of the cutoff point c is seemingly large, the actual confidence intervals are narrow
enough to draw practical conclusions.
5.2 Random Selection
An alternative approach consists in performing one-sided t-tests for the mean
log-score difference for every gene. Then, all the P-values are ranked in ascending
order and we declare significance controlling for FDR. Since the number of genes
that will be selected depends on the outcome of the experiment, we use the results
presented in Chapter 4.
Controlling for a false discovery rate of 5%, we select K = 25 populations. Using
the results form Chapter 4, we obtain for p = 7494 populations, the cutoff point for
95% confidence intervals is c = 4.44 (slightly bigger than the obtained in the previous
section). Table 5-2 shows the mean, standard deviation, P-value and confidence
intervals for the mean difference of the 25 populations selected using the FDR criteria.
Again, we observe the intervals are narrow enough to carry out meaningful inference. In
fact the results of all the intervals agree with the conclusions of the tests.
54
5.3 Tables and Figures
Table 5-1. Confidence intervals based on the selection of the top 100 log-scoredifferences
Ranking Mean St Dev 95% CI1 4.76 0.262 (4.247,5.268)2 4.38 0.303 (3.790,4.969)3 3.93 0.203 (3.534,4.325)4 3.79 0.519 (2.782,4.804)5 3.52 0.600 (2.351,4.685)96 1.35 0.930 (-0.457,3.163)97 1.35 0.680 (0.029,2.675)98 1.35 1.459 (-1.488,4.189)99 1.35 0.911 (-0.428,3.118)100 1.34 0.915 (-0.445,3.117)
Table 5-2. Confidence intervals based on the selection of the top log-score differences,randomly chosen using FDR.
Mean St Dev P-value 95% CI4.38 0.301 2.67e-08 (3.778,4.981)3.93 0.203 3.93e-08 (3.526,4.333)4.76 0.262 2.13e-07 (4.237,5.279)2.85 0.662 1.60e-06 (1.532,4.161)0.95 0.236 2.00e-05 (0.483,1.420)3.03 0.588 2.82e-05 (1.859,4.194)0.99 0.338 4.84e-05 (0.320,1.662)0.48 0.086 6.44e-05 (0.311,0.652)1.83 0.351 6.50e-05 (1.130,2.526)1.24 0.449 6.64e-05 (0.345,2.129)0.88 0.232 6.95e-05 (0.417,1.337)1.25 0.457 7.87e-05 (0.344,2.159)2.61 1.177 8.81e-05 (0.271,4.943)1.32 0.483 9.02e-05 (0.357,2.277)0.98 0.173 9.16e-05 (0.638,1.324)3.52 0.600 9.57e-05 (2.328,4.709)2.74 0.739 1.04e-04 (1.277,4.212)1.42 0.450 1.15e-04 (0.530,2.319)1.00 0.431 1.19e-04 (0.139,1.851)1.12 0.510 1.25e-04 (0.103,2.127)3.50 0.582 1.29e-04 (2.343,4.654)
55
CHAPTER 6CONCLUSIONS
We have proposed a method to construct confidence intervals for population means
following the selection of k ≥ 1 populations, where a population is selected if the
corresponding sample mean is among the top k sampled values. Unlike the traditional
intervals, our method takes into account the selection procedure and therefore does
not fail to maintain the nominal coverage probability. Numerical studies show that the
new intervals perform better than the traditional intervals for any configuration of the
population means and they are consistently narrower than the Bonferroni intervals.
The methodology we have proposed to construct the intervals is based on the
minimization of the coverage probability. In Chapter 2 we proved that for k = 1 the
configuration of the population means (θ1, ... , θp) that minimize the coverage probability
is the iid case, that is whenever θ1 = θ2 = ... = θp = θ, for any value of θ. Moreover,
when this is the case, the coverage probability of the confidence intervals is determined
by the cumulative distribution function of the first order statistic, X(1)max{X1, ... ,Xp}.For k > 1, we proved in Chapter 3 that the optimal configuration is reached
asymptotically when the top k − 1 population means go to +∞ and the remaining
p − k + 1 are equal. The approach we considered leads to an explicit formula for the
minimum of the coverage probability that contains the case k = 1 as a particular case.
In Chapter 4 we extended our results to the case where the number of selected
populations, K , is a random quantity depending on the outcome of the experiment.
Although we did not present a solution based on the direct minimization of the coverage
probability, we proposed a conservative approach, introducing a lower bound for the
coverage probability based on the results obtained in Chapter 3.
Intuitively, in order to construct confidence intervals that maintain the nominal level
in the context of selection, we need to take into account the variability coming from the
selection mechanism itself, and as a result, the confidence intervals are expected to be
56
longer. In addition, the conservative solutions tend to increase the length of the intervals.
However, the solutions we presented here, have shown to perform well in diverse
numerical studies and real applications. Although longer than the traditional intervals,
the proposed confidence intervals are not only shorter than Bonferroni, but also they
grow at a very slow rate. In addition, all the main results presented remain valid if we
consider intervals of the form (c1(x), c2(x)). This open the possibility of reducing the
length of the intervals by constructing non-symmetric confidence intervals, where the
interval limits can be shrink using, for instance, empirical Bayes estimators.
Finally, observed that the approach discussed in Chapter 4 encourage the use of
confidence intervals as a way to determine significance. Such approach could be use in
combination or competition with FDR but further investigation is required.
57
LIST OF REFERENCES
Bechhofer, R. (1954). A single-sample multiple decision procedure for ranking meansof normal populations with known variances. The Annals of Mathematical Statis-tics 25(1), 16–39.
Bechhofer, R., T. Santner, and D. Goldsman (1995). Design and analysis of experimentsfor statistical selection, screening, and multiple comparisons. Wiley.
Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practicaland powerful approach to multiple testing. Journal of the Royal Statistical Society.Series B (Methodological), 289–300.
Berger, J. (1976). Inadmissibility results for generalized Bayes estimators of coordinatesof a location vector. The Annals of Statistics, 302–333.
Blumenthal, S. and A. Cohen (1968). Estimation of the larger of two normal means.Journal of the American Statistical Association 63(323), 861–876.
Brown, L. (1979). A heuristic method for determining admissibility of estimators–withapplications. The Annals of Statistics, 960–994.
Chen, H. and E. Dudewicz (1976). Procedures for fixed-width interval estimation ofthe largest normal mean. Journal of the American Statistical Association 71(355),752–756.
Cohen, A. and H. Sackrowitz (1982). Estimating the Mean of the Selected Population.Third Purdue Symposium on Statistical Decision Theory and Related Topics.
Cohen, A. and H. Sackrowitz (1986). A Decision Theoretic Formulation for PopulationSelection Followed by Estimating the Mean of the Selected Population. Fourth PurdueSymposium on Statistical Decision Theory and Related Topics, 243–270.
Dahiya, R. (1974). Estimation of the mean of the selected population. Journal of theAmerican Statistical Association 69(345), 226–230.
Gupta, S. and K. Miescke (1990). On finding the largest normal mean and estimating theselected mean. Sankhya: The Indian Journal of Statistics, Series B 52(2), 144–157.
Gupta, S. and S. Panchapakesan (2002). Multiple decision procedures: theory andmethodology of selecting and ranking populations. Society for Industrial Mathematics.
Gupta, S. and M. Sobel (1957). On a statistic which arises in selection and rankingproblems. The Annals of Mathematical Statistics 28(4), 957–967.
Guttman, I. and G. Tiao (1964). A Bayesian approach to some best populationproblems. The Annals of Mathematical Statistics 35(2), 825–835.
Hwang, J. (1993). Empirical Bayes Estimation for the Means of the SelectedPopulations. Sankhya: The Indian Journal of Statistics, Series A 55(2), 285–304.
58
Lele, C. (1993). Admissibility results in loss estimation. The Annals of Statistics 21(1),378–390.
Putter, J. and D. Rubinstein (1968). On estimating the mean of a selected population.Tech. Kept 165.
Qiu, J. and J. Hwang (2007). Sharp simultaneous intervals for the means of selectedpopulations with application to microarray data analysis. Biometrics 63, 767–776.
Sackrowitz, H. and E. Samuel-Cahn (1984). Estimation of the mean of a selectednegative exponential population. Journal of the Royal Statistical Society. Series B(Methodological) 46(2), 242–249.
Sackrowitz, H. and E. Samuel-Cahn (1986). Evaluating the chosen population: a bayesand minimax approach. Lecture Notes-Monograph Series, 386–399.
Saxena, K. (1976). A single-sample procedure for the estimation of the largest mean.Journal of the American Statistical Association, 147–148.
Saxena, K. and Y. Tong (1969). Interval estimation of the largest mean of k normalpopulations with known variances. Journal of the American Statistical Association,296–299.
Stein, C. (1964). Contribution to the discussion of bayesian and non-bayesian decisiontheory. Handout Institute of Mathematical Statistics Meeting.
59
BIOGRAPHICAL SKETCH
Claudio Fuentes was born in Chile in 1977. Upon graduation from high school, he
enrolled as a student at the Pontificia Universidad Catolica de Chile, where he received
a degree of Bachelor of Science in mathematics in 2001. During his undergraduate
he was appointed as a teaching assistant for several courses. It was then when he
developed a deep appreciation for teaching and decided to pursue an academic career.
In December 2003, he received a master degree in statistics from the same institution.
In August 2005, he entered the graduate program in the Department of Statistics
at the University of Florida. During his education there, he had the opportunity to work
as a research assistant for Distinguished Professor Dr. George Casella, who became
his advisor. In August 2008 he earned the degree of Master of Science in statistics with
a thesis in cluster analysis and in August 2011 he earned his PhD. in Statistics with a
dissertation in interval estimation following selection. After graduation, he joined the
Department of Statistics at Oregon State University as assistant professor.
60