hypothesis ranking and the context of probabilities in an open-ended search

ELSEVIER Astroparticle Physics 4 (1996) 285-291

Astroparticle Physics

Hypothesis ranking and the context of probabilities in an open-ended search

S.D. Biller Department of Physics, University of Leeds, Leeds LS2 9JT, UK

Received 22 February 1995; revised 3 August 1995

Abstract

In observational sciences such as astronomy, there are often no clear guidelines regarding potential source types or emission characteristics. As a consequence, data analyses often tend to be somewhat open-ended as they attempt to address a broad range of questions, In GeV and TeV gamma-ray astronomy, where analyses often deal with marginal significance levels, this may lead to biases in the assessment of an overall significance for a given finding unless an appropriate statistical penalty is paid to account for the total number of independant tests performed. Furthermore, many tests may not be amenable to immediate repetition with independant data either owing to experimental sensitivity, limited exposure or source duty cycle. The assignment of a statistical penalty is especially difficult in open-ended analyses where one must allow for arbitrary exploration of the data. Ideally, one would like to structure this penalty so as to give more weight to particular hypotheses that are a priori deemed more likely. This paper will propose methods to rank hypotheses and place chance probability calculations within the context of an exhaustive, open-ended search.

1. Introduct ion

In recent years, numerous advances have been real- ized in the fields of GeV and TeV gamma-ray astronomy. Objects such as pulsars, active galactic nuclei (AGN) and gamma-ray bursts (GRBs) have produced an unexpected wealth of information. In the GeV- TeV regime the Crab nebula/pulsar is seen to undergo a transition from pulsed to unpulsed emission; AGN have been found with remarkably flat spectral slopes which extend (at least in one case) to TeV energies and show a high degree of variability over timescales of days to months; and GRBs, whose origin still re- mains a total mystery, have been seen to produce energies at least as high as tens of GeV. Nevertheless, the science is still in its infancy. Due to the generally low flux levels, current satellite experiments are starved

for statistics at energies above several GeV. On the other hand, due to the large, continuous, isotropic flux of hadronic primaries whose trajectories possess little or no directional information related to their origin, background rejection is still a key issue for ground- based TeV astronomy. Compounding the problem is the fact that astronomy is an observational science where there are no clear guidelines regarding potential source types or emission characteristics. As a consequence, data analyses often tend to be somewhat open-ended as they attempt to address a broad range of questions. This naturally leads to concerns when dealing with modest detection levels that an objective calculation of the significance of a given observation may be compromised by the application of a posteri- ori logic. Even data bases with clearly detected signals can be subject to such biases. In a field where 5

0927-6505/96/$15.00 (~) 1996 Elsevier Science B.V. All rights reserved SSDI 0927 -6505 ( 9 5 ) 0 0 0 3 6 - 4

286 S.D. Biller/Astroparticle Physics 4 (1996) 285-291

standard deviations above noise is considered to be a reasonable detection level and 3 standard deviations is considered to be "of interest" one must therefore be cautious when searching either for relatively weaker sources in the data or when dividing up a fairly strong signal to look for variability over shorter time scales or for energy-dependent effects. This is of particular concern when dealing with effects that may not be easily confirmed with independant data due to either limited experimental sensitivity/exposure or the inherent duty cycle of the source. In such cases it is important that each probability calculation related to a particular question is placed in context with all other questions that have been asked during the global analysis of the data. This context then determines the statistical penalty that must be applied to a given probability calculation in order to account for all other independant tests that have been performed. This context can also provide a means of giving more a priori statistical weight to particular hypotheses that are deemed more likely based either on theoretical models or previous observations. Finally, the formalism which produces this context should also provide a clear means of encorporating the results of future analyses.

Numerous authors (for example Sturrock 1973, Loredo and Lamb 1989) have applied Baysian statistical approaches to the problem of deciding among several, mutually exclusive astrophysical hypotheses based on a given data sample, where ~'~i P(D/Hi)P(Hi) --- 1. Here P(D/Hi) is the probability for the observed data given hypothesis Hi, and P (Hi) is the prior probability for the hypothesis. In this present case, one is concerned with accounting for the degrees of freedom when choosing the most significant result from a number of independent tests involving different data samples, where ~-~iP(D/Hi)P(Hi) could be anything since the hypotheses need not be mutually exclusive. The fol- lowing paper will address this issue in detail. It will start with a more formal description of "trial factors" as it applies to selecting the single most significant result from a series of tests. A method of extending this to place multiple tests of the same hypothesis in context so as to find the significance of recurrent episodes of a given type will then be discussed. Next, a procedure will be proposed for establishing a con- textual blueprint for separate hypothesis tests within an exhaustive, open-ended data analysis which fulfills

the requirements previously stated. Finally, a more general method will be presented for assigning a priori weights to those hypotheses deemed more likely to prove fruitful.

2. Testing a distribution of probabilities

When choosing the most "interesting" result from a series of studies, a question of considerable relevance concerns the final significance of that result given the number of other independent hypotheses that have been tested. In other words, one is concerned with the odds that the results from this or any other of the tests performed would yield an individual chance probability as small or smaller than that under consideration. In its simplest form, this relates to the binomial probability function:

N! p)N-n Pn=n! (N_n) !Pn (1 - . (1)

This equation gives the probability, P,, to observe n successes in N attempts, where the probability for each success is p per trial. Setting n = 1 thus provides the chance probability for observing a single success with individual probability p (in this case the smallest individual chance probability observed in a series of tests) given N attempts. However, one must also account for ensembles of similar experimental measurements that would have yielded 2 or more successes (several results with chance probabilities at least as small as p) . It is therefore necessary to calculate the cumulative binomial probability:

N N! p)N-n

P>l = Z n ! ( N - n)! pn(1 - n=l

= l - e o = l - ( l - p ) N. (2)

P_>l is then the chance probability for observing 1 or more results with individual chance probabilities of p or smaller given N attempts. This is approximately equal to N × p for cases when this product is much less than 1. In the discussions that follow, N will be referred to as the trials or trial factor that is needed to account for the number of independent hypotheses tested. In some instances the independence of several hypotheses may not be clear. For these cases, tech- niques such as Monte Carlo calculation may be used

S.D. Biller/Astroparticle Physics 4 (1996) 285-291 287

to estimate an effective trial factor so that the formula above will produce the correct probability distribution under the H0 assumption of zero signal. The value of p is typically referred to as the pre-trial chance probability, and P>I is thus referred to as the post-trial chance probability.

The post-trial chance probability as defined above simply considers the single most significant episode relative to the total number of entries in the probability distribution (e.g. the probability for the largest observed event excess from a source region compared with expectation for a given day). However, this method is not sensitive to lower-level emission that may be exhibited by a number of episodes, thus dis- torting the tail of the probability distribution. Simply applying a X 2 test to the distribution, while sensitive to a distortion in the overall shape, is relatively insen- sitive to cases where the emission is only manifest in a small fraction of the episodes. The subject of distribution "outliers" has been detailed by several authors (see for example Hawkins 1980). However, these works often tend to deal with the step-wise identification and elimination of individual outliers from a given distribution and do not concentrate on assessing a cumulative probability corresponding to the total number of eliminations performed. In other cases, they are concerned with assessing the probability for the existence of a given number of outliers, but not with the probability of "n or more" outliers nor with accounting for the degrees of freedom incurred in testing for several possible numbers and choosing the smallest resulting chance probability.

One alternative approach is as follows: Define PI to be the smallest chance probability of the distribution, P2 to be the second smallest, and so on. Given the total number of episodes in the distribution, compute the binomial probabilities for 1 or more episodes to attain a chance probability of/>1 or smaller, 2 or more to attain P2 or smaller, 3 or more to attain P3 or smaller, etc. Then choose the most significant of these probabilities and use a Monte Carlo calculation to account for the effective trial factor associated with this choice. Effective trial factors derived by this method are shown in Fig. 1 (solid lines). For N = 1000, and P = 10 -4, the calculated factor is 60.

This factor, which can obviously become quite large, may be further reduced in certain cases. For example, there may be instances where it may not

1000

T r i a l F a c t o r s f o r D i s t r i b u t i o n - S e a r c h

so l id = s e a r c h t h r o u g h ful| dJ~trlbution

d a s h e s ffi s e a r c h t h r o u g h u p p e r fOX or dtctribution

1 I • I . . . . . . ~ i i I I I l l l l l l I I I I J J I I

I0 i00 I000 I0000

Number of Episodes in Data Base

Fig. 1. Trial factors for searching a distribution of episodes for the multiple occurance of small chance probabilities. Solid lines refer to a search of the entire distribution, whereas the dashed lines refer to a search through the upper 10% tail of the distribution. Trial factors are given for pre-trial chance probabilities of 10 -6 (upper lines), 10 -5, lO -4, l0 -3, 10 -2 and 10 - l (lower lines), respectively.

be physically meaningful to search through the entire distribution, or where deviations associated with a significant portion of the distribution have already been ruled out by other tests (such as a test for continuous emission). In such cases, the prescription would be to only search through a given fraction of the entire distribution (chosen a priori). As an example, Fig. 1 (dashed lines) also shows the trial factors associated with searching through the upper 10% of a probability distribution (i.e. the 10% comprising the smallest chance probabilities in the distribution). For N = 1000, and P = l0 -4, this factor is 19. This factor still lowers the sensitivity compared to the approach of only testing the most significant episode if, indeed, the hypothesis of a single "hot" episode is correct. Therefore, one may wish to perform both tests, choose the most significant result, and pay an additional statistical penalty of ,,~ 2 for the choice.

A similar approach that has been employed by the Whipple collaboration to test distributions of Rayleigh powers (used to test for periodic signals) (Lewis 1990) differs in the specific choice of the test statistic. The statistic used in the Whipple analysis was that of the Fisher test (Fisher 1958). Application of Kolmogorov-Smirnov, Cramer-von Mises and related tests (Stephens 1970) were also investigated in that paper and found to be relatively less sensitive. How- ever, even the Fisher test tends to be less sensitive

288 S.D. Biller/Astroparticle Physics 4 (1996) 285-291

than the binomial approach described here for emission scenarios that would result in a cluster of similar burst probabilities. As an example, consider the case where 500 episodes are examined, with two episodes attaining Rayleigh powers of 9 (R1) and 10 (R2) respectively. Assuming that the rest of the distribution behaves as expected, the Fisher test approach would result in a pre-trial chance probability of 2.3% for the most significant deviation in the distribution of episodes. The binomial approach would yield a pre- trial probability of 0.18%. The trial factors for searching the distribution of burst probabilities in both tests are comparable. Under this scenario, the Fisher test would be more sensitive only when R1 is greater than 10, and R~ - RI > 5, at which point the gain is irrel- evant since H0 can easily be rejected by either test. Therefore, the binomial test appears to be a better general approach.

3. Context of probabilities in an exhaustive search for emission

In the absence of more specific theoretical or observational guidance to indicate which sources are most likely to be emitters of GeV/TeV radiation, or to pre- dict the nature of these emission characteristics, many different hypotheses must be examined. Consequently, given the potentially large number of trials associated with such a search, the relevance of any single probability calculation cannot be judged unless it is placed in context with the other questions that have been asked, thus indicating all appropriate degrees of freedom. One method of establishing this context is to form a "blueprint" for an exhaustive search of the data that both accounts for all analyses that have been performed, and provides a means of encorporating future analyses. Such a blueprint can be formed based on theoretical considerations and/or previous observational data which are independent of the data sample currently under investigation. It need not necessarily be formed prior to any data analysis but, if not, the blueprint should be clearcut and defensible as a reasonable scheme which is not based on the current set of data. As an example, a possible analysis scenario is shown in Fig. 2, where one specific analysis branch has been followed.

Each box in the diagram states the question being

addressed, and lines connecting the boxes indicate the relationship between these questions. Since the significance of a given analysis depends on an assessment of the related questions on lower levels, an accounting must be made for the number of independent hypothesis that have been tested at these levels. Appropriate trial factors must therefore be assessed at each branch point. Note that under this scenario, additional analyses that may be performed related to AGN, or the collective grouping of "other sources,' will not effect the trial factor associated with studies of plerions as they involve separate branches. Furthermore, this diagramatic approach leads to an obvious method of establishing an a priori ranking of hypotheses: those hypotheses tested at higher levels, where there are fewer branches, will be more heavily weighted as they will invoke smaller trial factors than those further down in the diagram.

Dashed boxes represent analyses that have not yet been completed but may be encorporated in the future, and dashes lines indicate other analysis branches that have not been followed in this illustration. Thus, as further analyses are performed, their results may be attached as additional branches at the appropriate level in the overall structure. Such attachments will effect the chance probability estimation only for those questions associated at the same level or higher.

In testing the more specific source hypotheses that occur further down in this tree, one must also be able to reject the null hypothesis at all higher levels (i.e. after accounting for trials due to tests performed on other branches). In other words, if, after fully accounting for all tests, there is no strong evidence of an overall signal from a specific source, it makes little sense to quote the chance probability for periodic emission from that source at a given frequency since it is clear that this result does not survive after the appropriate trial factors have been assessed. Likewise, if there is no significant deviation in the overall probability distribution of sources, it makes little sense to quote the chance probability associated with any specific source. Limited applications of this approach have been used by the CYGNUS collaboration (Alexandreas et al. 1993, Biller et al. 1994).

S.D. Biller/Astroparticle Physics 4 (1996) 285-291 289

Is There Eml~lon From Any $oume?

[~ ione I L " n m ~ e l

Prevlo~ly Other Frequency?/ / Observed Frequency ? /

olo,Jol ~

Fig. 2. Sample logic diagram defining the context of probabilities within an exhaustive search for emission: Each box states a particular question being asked, dashed lines represent results of other tests not shown, and dashed boxes represent analyses not yet performed. To answer a particular question requires that all subordinate questions connected via branches (solid lines) be addressed. These branches, therefore, represent the trials that must be accounted for at each step.

4. Generalized a priori hypothesis weighting

When choosing the most significant result from a number of hypothesis tests, a more general procedure can be employed to account for the assignment of a priori weights of arbitrary value to each hypothesis.

Consider the set of chance probabilities:

P1,P2 . . . . . Pn (3)

corresponding to the tests of n hypotheses. Assign to these hypotheses the relative rankings

a t , a 2 . . . . . an (4)

where the rankings are expressed as a fraction of the total number of hypotheses, so that )-'~i ai = 1. Larger values of tr correspond to more probable hypotheses. We now wish to redistribute the total number of trials ( 1 for each hypothesis) according to the the assigned

ranks. Specifically, the number of trials assigned to a given hypothesis should be inversly proportional to its rank, so that fewer trials are assessed for the more likely hypotheses. This is accomplished by defining a set of new quantifies:

Q1,Q2 . . . . . Qn (5)

where

Qi = 1 - (1 - Pi)II:i . (6)

The final chance probability, Pf, is then given by the smallest value of Q (corresponding to the most significant rank-corrected result). Note that the individual Qi values are not distributed as proper probabilities under the null hypothesis (i.e. uniformly between 0 and 1 ), however the ensemble of P f values are.

In practice, there is rarely enough detailed information to warrant the degree of ranking that this method

290 S.D. Biller /Astroparticle Physics 4 (1996) 285-291

is capable of providing. The simpler, diagramatic approach is therefore recommended for most applications. However, the technique presented here may be useful for analyses such as correlation studies, where the results from one set of data are used to rank tests to be performed in another, independent set of data.

5. Example

As a simple example of the hypothesis ranking previously described, let us begin with the hypothesis, H, that there is no detected emission from any source present in the data set, only background fluctuations. Then consider the case where this hypothesis is com- prised of three other hypotheses: there is no emission from source number 1 (Ht ) , there is no emission from source number 2 (/-/2) and there is no emission from source number 3 (H3). Tests of these hypotheses then result in pre-trial chance probabilites P1, P2 and P3, respectively. Assume now that, a priori, H1 is thought to have a greater chance for success than either/-/2 or/-/3 individually. One might therefore construct the diagram shown in Fig. 3, where H1 is put on an equal footing with the collective treatment of/-/2 and/-/3. In other words, at the highest level only two hypotheses are considered: HI and the hypothesis that there is no

'o'lals = 2

H,P

HI,P I H~, P~

A ~,ala = 2

HeP 2 H3,P s

Fig. 3. Logic diagram for the example described in the text which defines the context of hypotheses HI, /-/2 and H3 and their corresponding probabilities. H23 arrises from grouping H2 and H3 into a single hypothesis, which is then put on an equal footing with HI. H (with its corresponding probability P) represents the hypothesis that there is no observed emission from any source and that the data set as a whole is therefore consistent with background fluctuations. Trial factors used to assess P23 and P are also indicated.

100

Distribution of Qi and Pf Values

Solid line: Qt / dashed line: Q2 / solid banS: Pf

10

1

0.1 0.0 0.2 0.4 0.6 0.8 1,0

Qi or pf Value

Fig. 4. Numerically computed distributions of Q1, Q2 and P for the example described in the text.

emission from either source 2 or 3. This latter hypothesis, which will be designated/'/23, is tested by choosing the smallest chance probability of P2 and P3, and applying a trial factor of two for this choice to yield P23. The hypothesis H is then tested (which is the final goal) by choosing the smallest chance probability of Pt and P23, and applying another trial factor of two to yield the final probability, P, that there is no detected emission from any source present in the data set.

In the language of the previous section, this procedure is equivalent to choosing trl = 0.5, a2 = 0.25 and a3 = 0.25 to compute the values Qh Q2 and Q3. The smallest value of Q is identically P. Under the null hypthesis, the tendency with which the result of any given test will by selected by this procedure as being the most significant will, by design, be directly re- flected by the a values. Various distributions related to this scenario have been numerically computed based on 105 randomly generated emsembles. The probabilities Pt, P2 and P3 were taken as random variables uniformly distributed between 0 and 1 (i.e. the null hypothesis was assumed). Fig. 4 shows the distributions of QI and Q2 values (solid and dashed lines, respectively), and the distribution of P (equal to the minimum Q value in each ensemble). Note that, as previously stated, while the Q values yield a skewed distribution owing to the ranking which has been intro- duced, the final probability P is distributed uniformly and, thus, behaves in the statistically appropriate man- ner under the null hypothesis.

S.D. Biller /Astroparticle Physics 4 (1996) 285-291 291

Acknowledgement

I wish to thank the members of the CYGNUS col-

laboration for many useful discussions. This work has

been supported by the National Science Foundation

and the Particle Physics and Astronomy Research

Council (UK) .

References

Alexandreas, D.E. et al. (1993), Ap. J., 405, 353. Billet, S.D. et al. (1994), Ap. J., 423, 714.

Fisher, R.A. (1958), Statistical Methods for Research Workers (Oliver and Boyd: Edinburgh and London).

Hawkins, DM. (1980), Identification of Outliers (Chapman and Hall: London and New York).

Lewis, D.A. (1990), Arkansas Gamma-Ray and Neutrino Workshop "89, Nuc. Phys. B (Proc. Suppl.) 14A, ed. G.B. Yodh, D.C. Wold and W.R. Kropp (North-Holland) 299.

Loredo, T.J. and Lamb, D.Q. (1989), Proc. 14th Texas Symposium on Relativistic Astrophysics, Ann. N.Y. Acad. Sci. 571, 601, ed. E. Fenyves.

Stephens, M.A. (1970), J. R. Stat. Soc. 32, 115. Sturrock, EA. (1973), Ap. J. 182, 569.

hypothesis ranking and the context of probabilities in an open-ended search

Documents