reliable abc model choice via random forests

23
Reliable ABC model choice via random forests Pierre Pudlo *† , Jean-Michel Marin * , Arnaud Estoup , Jean-Marie Cornuet , Mathieu Gauthier and Christian P. Robert §¶ , * Universit´ e de Montpellier 2, I3M, Montpellier, France, Institut de Biologie Computationnelle (IBC), Montpellier, France, CBGP, INRA, Montpellier, France, § Universit´ e Paris Dauphine, CEREMADE, Paris, France, and University of Warwick, Coventry, UK Submitted to Proceedings of the National Academy of Sciences of the United States of America Approximate Bayesian computation (ABC) methods provide an elab- orate approach to Bayesian inference on complex models, includ- ing model choice. Both theoretical arguments and simulation ex- periments indicate, however, that model posterior probabilities are poorly evaluated by ABC. We propose a novel approach based on a machine learning tool named random forests to conduct selection among the highly complex models covered by ABC algorithms. We strongly shift the way Bayesian model selection is both understood and operated, since we replace the evidential use of model pos- terior probabilities by predicting the model that best fits the data with random forests and computing an associated posterior error rate. Compared with past implementations of ABC model choice, the ABC random forest approach offers several improvements: (i) it has a larger discriminative power among the competing models, (ii) it is robust to the number and choice of statistics summarizing the data, (iii) the computing effort is drastically reduced (with a minimum gain in computation efficiency around a factor of about fifty), and (iv) it includes an embedded and cost-free error evalua- tion conditional on the actual analyzed dataset. Random forest will undoubtedly extend the range of size of datasets and complexity of models that ABC can handle. We illustrate the power of the ABC random forest methodology by analyzing controlled experiments as well as real population genetics datasets. 1 Approximate Bayesian computation | model selection | summary statistics | k- nearest neighbors | likelihood-free methods | random forests | posterior predic- tive | error rate | Harlequin ladybird | Bayesian model choice Abbreviations: ABC, approximate Bayesian computation; RF, random forest; LDA, linear discriminant analysis; MAP, maximum a posteriori; nn, nearest neighbors; CART, classification and regression tree; SNP, single nucleotide polymorphism S ince its introduction (1, 2, 3), the approximate Bayesian computation (ABC) method has found an ever increasing range of applications covering diverse types of complex mod- els (see, e.g., 4, 5, 6, 7). The principle of ABC is to conduct Bayesian inference on a dataset through comparisons with nu- merous simulated datasets. However, it suffers from two ma- jor difficulties. First, to ensure reliability of the method, the number of simulations is large; hence, it proves difficult to ap- ply ABC for large datasets (e.g., in population genomics where ten to hundred thousand markers are commonly genotyped). Second, calibration has always been a critical step in ABC implementation (8, 9). More specifically, the major feature in this calibration process involves selecting a vector of summary statistics that quantifies the difference between the observed data and the simulated data. The construction of this vec- tor is therefore paramount and examples abound about poor performances of ABC algorithms related with specific choices of those statistics. In particular, in the setting of ABC model choice, the summaries play a crucial role in providing consis- tent or inconsistent inference (10, 11, 12). We advocate here a drastic modification of the way ABC model selection is conducted: we propose to both step away from a mere mimicking of exact Bayesian solutions like pos- terior probabilities, and reconsider the very problem of con- structing efficient summary statistics. First, given an arbi- trary pool of available statistics, we now completely bypass the selection of a subset of those. This new perspective di- rectly proceeds from machine learning methodology. Second, we also entirely bypass the ABC estimation of model poste- rior probabilities, as we deem the numerical ABC approxima- tions of such probabilities fundamentally untrustworthy, even though the approximations can preserve the proper ordering of the compared models. Having abandoned approximations of posterior probabilities, we implement the crucial shift to using posterior error rates for model selection towards assess- ing the reliability of the selection made by the classifier. The statistical technique of random forests (RF) (13) represents a trustworthy machine learning tool well adapted to complex settings as is typical for ABC treatments, and which allows an efficient computation of posterior error rates. We show here how RF improves upon existing classification methods in significantly reducing both the classification error and the computational expense. Model choice Bayesian model choice (14, 15) compares the fit of M mod- els to an observed dataset x 0 . It relies on a hierarchical modelling, setting first prior probabilities on model indices m ∈{1,...,M} and then prior distributions π(θ|m) on the parameter θ of each model, characterized by a likelihood func- tion f (x|m, θ). Inferences and decisions are based on the pos- terior probabilities of each model π(m|x 0 ). ABC algorithms for model choice. To approximate posterior probabilities of competing models, ABC methods (16) com- pare observed data with a massive collection of pseudo-data, generated from the prior; the comparison proceeds via a nor- malized Euclidean distance on a vector of statistics S(x) com- puted for both observed and simulated data. Standard ABC estimates posterior probabilities π(m|x 0 ) at stage (B) of Al- gorithm 1 below as the frequencies of those models within the k nearest-to-x 0 simulations, proximity being defined by the distance between s 0 and the simulated S(x)’s. Selecting a model means choosing the model with the high- est frequency in the sample of size k produced by ABC, such frequencies being approximations to posterior probabilities of models. We stress that this solution means resorting to a k- nearest neighbor (k-nn) estimate of those probabilities, for a set of simulations drawn at stage (A), whose records consti- Reserved for Publication Footnotes 1 PP, JMM, AE and CPR designed and performed research, PP, JMM, AE, JMC and MG analysed data, and PP, JMM, AE and CPR wrote the paper. www.pnas.org/cgi/doi/10.1073/pnas.xxx PNAS Issue Date Volume Issue Number 17

Upload: christian-robert

Post on 06-Dec-2014

4.569 views

Category:

Science


2 download

DESCRIPTION

first version of our ABC-RF model choice paper as submitted to PNAS

TRANSCRIPT

  • 1. Reliable ABC model choice via random forests Pierre Pudlo y, Jean-Michel Marin y , Arnaud Estoup z, Jean-Marie Cornuet z , Mathieu Gauthier z and Christian P. Robert x {, Universite de Montpellier 2, I3M, Montpellier, France,yInstitut de Biologie Computationnelle (IBC), Montpellier, France,zCBGP, INRA, Montpellier, France,xUniversite Paris Dauphine, CEREMADE, Paris, France, and {University of Warwick, Coventry, UK Submitted to Proceedings of the National Academy of Sciences of the United States of America Approximate Bayesian computation (ABC) methods provide an elab- orate approach to Bayesian inference on complex models, includ- ing model choice. Both theoretical arguments and simulation ex- periments indicate, however, that model posterior probabilities are poorly evaluated by ABC. We propose a novel approach based on a machine learning tool named random forests to conduct selection among the highly complex models covered by ABC algorithms. We strongly shift the way Bayesian model selection is both understood and operated, since we replace the evidential use of model pos- terior probabilities by predicting the model that best
  • 2. ts the data with random forests and computing an associated posterior error rate. Compared with past implementations of ABC model choice, the ABC random forest approach oers several improvements: (i) it has a larger discriminative power among the competing models, (ii) it is robust to the number and choice of statistics summarizing the data, (iii) the computing eort is drastically reduced (with a minimum gain in computation eciency around a factor of about
  • 3. fty), and (iv) it includes an embedded and cost-free error evalua- tion conditional on the actual analyzed dataset. Random forest will undoubtedly extend the range of size of datasets and complexity of models that ABC can handle. We illustrate the power of the ABC random forest methodology by analyzing controlled experiments as well as real population genetics datasets. 1 Approximate Bayesian computation j model selection j summary statistics j k- nearest neighbors j likelihood-free methods j random forests j posterior predic- tive j error rate j Harlequin ladybird j Bayesian model choice Abbreviations: ABC, approximate Bayesian computation; RF, random forest; LDA, linear discriminant analysis; MAP, maximum a posteriori; nn, nearest neighbors; CART, classi
  • 4. cation and regression tree; SNP, single nucleotide polymorphism Since its introduction (1, 2, 3), the approximate Bayesian computation (ABC) method has found an ever increasing range of applications covering diverse types of complex mod- els (see, e.g., 4, 5, 6, 7). The principle of ABC is to conduct Bayesian inference on a dataset through comparisons with nu- merous simulated datasets. However, it suers from two ma- jor diculties. First, to ensure reliability of the method, the number of simulations is large; hence, it proves dicult to ap- ply ABC for large datasets (e.g., in population genomics where ten to hundred thousand markers are commonly genotyped). Second, calibration has always been a critical step in ABC implementation (8, 9). More speci
  • 5. cally, the major feature in this calibration process involves selecting a vector of summary statistics that quanti
  • 6. es the dierence between the observed data and the simulated data. The construction of this vec- tor is therefore paramount and examples abound about poor performances of ABC algorithms related with speci
  • 7. c choices of those statistics. In particular, in the setting of ABC model choice, the summaries play a crucial role in providing consis- tent or inconsistent inference (10, 11, 12). We advocate here a drastic modi
  • 8. cation of the way ABC model selection is conducted: we propose to both step away from a mere mimicking of exact Bayesian solutions like pos- terior probabilities, and reconsider the very problem of con- structing ecient summary statistics. First, given an arbi- trary pool of available statistics, we now completely bypass the selection of a subset of those. This new perspective di- rectly proceeds from machine learning methodology. Second, we also entirely bypass the ABC estimation of model poste- rior probabilities, as we deem the numerical ABC approxima- tions of such probabilities fundamentally untrustworthy, even though the approximations can preserve the proper ordering of the compared models. Having abandoned approximations of posterior probabilities, we implement the crucial shift to using posterior error rates for model selection towards assess- ing the reliability of the selection made by the classi
  • 9. er. The statistical technique of random forests (RF) (13) represents a trustworthy machine learning tool well adapted to complex settings as is typical for ABC treatments, and which allows an ecient computation of posterior error rates. We show here how RF improves upon existing classi
  • 10. cation methods in signi
  • 11. cantly reducing both the classi
  • 12. cation error and the computational expense. Model choice Bayesian model choice (14, 15) compares the
  • 13. t of M mod- els to an observed dataset x0. It relies on a hierarchical modelling, setting
  • 14. rst prior probabilities on model indices m 2 f1; : : : ;Mg and then prior distributions (jm) on the parameter of each model, characterized by a likelihood func- tion f(xjm; ). Inferences and decisions are based on the pos- terior probabilities of each model (mjx0). ABC algorithms for model choice. To approximate posterior probabilities of competing models, ABC methods (16) com- pare observed data with a massive collection of pseudo-data, generated from the prior; the comparison proceeds via a nor- malized Euclidean distance on a vector of statistics S(x) com- puted for both observed and simulated data. Standard ABC estimates posterior probabilities (mjx0) at stage (B) of Al- gorithm 1 below as the frequencies of those models within the k nearest-to-x0 simulations, proximity being de
  • 15. ned by the distance between s0 and the simulated S(x)'s. Selecting a model means choosing the model with the high- est frequency in the sample of size k produced by ABC, such frequencies being approximations to posterior probabilities of models. We stress that this solution means resorting to a k- nearest neighbor (k-nn) estimate of those probabilities, for a set of simulations drawn at stage (A), whose records consti- Reserved for Publication Footnotes 1PP, JMM, AE and CPR designed and performed research, PP, JMM, AE, JMC and MG analysed data, and PP, JMM, AE and CPR wrote the paper. www.pnas.org/cgi/doi/10.1073/pnas.xxx PNAS Issue Date Volume Issue Number 1{7
  • 16. tute the so-called reference table. In fact, this interpretation provides a useful path to convergence properties of ABC pa- rameter estimators (17) and properties of summary statistics to compare hidden Markov random
  • 17. elds (18). Algorithm 1 General ABC algorithm (A) Generate Nref simulations (m; ; S(x)) from the joint (m)(jm)f(xjm; ). (B) Learn from this set to infer about m or at s0 = S(x0). A major calibration issue with ABC imposes selecting the summary statistics S(x). When considering the speci
  • 18. c goal of model selection, the ABC approximation to the posterior probabilities will eventually produce a right ordering of the
  • 19. t of competing models to the observed data and thus will select the right model for a speci
  • 20. c class of statistics when the in- formation carried by the data becomes important (12). The state-of-the-art evaluation of ABC model choice is thus that some statistics produce nonsensical decisions and that there exist sucient conditions for statistics to produce consistent model prediction, albeit at the cost of an information loss due to summaries that may be substantial. The toy example com- paring MA(1) and MA(2) models in SI and Fig. 1 clearly exhibits this potential loss. It may seem tempting to collect the largest possible num- ber of summary statistics to capture more information from the data. However, ABC algorithms, like k-nn and other local methods, suer from the curse of dimensionality, see e.g. Sec- tion 2.5 in (19), and yield poor results when the number of statistics is large. Selecting summary statistics is therefore paramount, as shown by the literature in the recent years. (See (9) surveying ABC parameter estimation.) Excursions into machine learning are currently limited, being mostly a dimension reduction device that preserves the recourse to k- nn methods. See, e.g., the call to boosting in (20) for selecting statistics in problems pertaining to parameter estimation (21). For model choice, two projection techniques are considered. First, (22) show that the Bayes factor itself is an acceptable summary (of dimension one) when comparing two models, but its practical evaluation via a pilot ABC simulation induces a poor approximation of model evidences (10, 11). The recourse to a regression layer like linear discriminant analysis (LDA) (23) is discussed below and in SI (Classi
  • 21. cation method sec- tion). Given the fundamental diculty in producing reliable tools for model choice based on summary statistics (11), we now propose to switch to a better adapted machine learning approach based on random forest (RF) classi
  • 22. ers. ABC model choice via random forests. SI provides a review of classi
  • 23. cation methods. The so-called Bayesian classi
  • 24. er, based on the maximum a posteriori (MAP) model, minimizes the 0{1 error (24). However, estimating the posterior proba- bilities has a major impact on the performances of the clas- si
  • 25. er, due to the substitution of a classi
  • 26. cation exercise by a more dicult regression problem (24). This diculty drives us to a paradigm shift, namely to give up the attempt at both estimating posterior probabilities by ABC and selecting summary statistics. Instead, our version of stage (B) in Al- gorithm 1 relies on a classi
  • 27. er that can handle an arbitrary number of statistics and extract the maximal information from the reference table obtained at stage (A). For this purpose, we resort to random forest (RF) classi
  • 28. ers (13) and call the re- sulting algorithm ABC-RF. Refraining here from a detailed entry to RF algorithms (see SI for such details), we recall that the technique stems from (25) bagging algorithms, applying to both classi
  • 29. ca- tion and regression. RF grows many over
  • 30. tted decision trees trained with a randomized CART (classi
  • 31. cation and regres- sion tree, see 26) algorithm on bootstrap sub-samples from the ABC reference table: it takes advantage of the weak de- pendency of these almost unbiased trees to reduce variance by aggregating the tree classi
  • 32. ers towards a majority-rule deci- sion. The justi
  • 33. cation for choosing RF to conduct an ABC model selection is that, both formally and experimentally, RF classi
  • 34. cation was shown to be mostly insensitive both to strong correlations between predictors and to the presence of noisy variables, even in relatively large numbers (19, Chapter 5), a characteristic that k-nn classi
  • 35. ers miss. For instance, consistency for a simpli
  • 36. ed RF procedure is such that the rate of convergence only depends on the intrinsic dimension of the problem (27). Consistency of the original algorithm was also proven for additive regression models (28), demonstrating that RF can apprehend large dimensions. Such a robustness justi
  • 37. es adopting an RF strategy to learn from an ABC reference table towards Bayesian model selection. Within an arbitrary collection of summary statis- tics, some may exhibit strong correlations and others be un- informative about the model index, but this does not jeopar- dize the RF performances. For model selection, RF is thus in competition with the two local classi
  • 38. ers commonly imple- mented within ABC, and mimicking exact Bayesian solutions, It is arguably superior to local logistic regression, as imple- mented in the DIYABC software (29); the latter includes a linear model layer within the k-nn selection (30), but suf- fers from the curse of dimensionality, which forces a selection among statistics, and is extremely costly | see, e.g., how (23) reduces the dimension using a linear discriminant projection before resorting to local logistic regression. The outcome of RF is a model index, corresponding to the most frequently predicted model index within the aggregated decision trees. This is the model best suited to the observed data. It is worth stressing that there is no direct connection between the frequencies of the model allocations of the data among the tree classi
  • 39. ers and the posterior probabilities of the competing models. In practice, the decision frequencies of the trees happen to show a strong bias towards 0 or 1, thus produce an unreliable quantitative indicator. We therefore propose to rely on an alternative posterior error estimation to measure the con
  • 40. dence in model choice produced by RF. Posterior error rate as con
  • 41. dence report Machine learning classi
  • 42. ers miss a distinct advantage of pos- terior probabilities, namely that the latter evaluate a con
  • 43. - dence degree in the selected (MAP) model. An alternative to those probabilities is the prior error rate, which provides an indication of the global quality of a given classi
  • 44. er ^m on the whole feature space. This rate is the expected value of the misclassi
  • 45. cation error over the hierarchical prior X m (m) Z 1f ^m(S(y))6=mgf(yj;m)(jm)dyd and it can be evaluated from simulations (; m; S(y)) drawn from the prior, independently of the reference table (18), or with the out-of-bag error in RF (19, Chapter 15), a procedure that requires no further simulation (see SI). Machine learning 2 www.pnas.org/cgi/doi/10.1073/pnas.xxx Pudlo, Marin et al.
  • 46. relies on this prior error to calibrate classi
  • 47. ers (e.g., the num- ber k of neighbors of k-nn and local logistic models, or the tuning parameters of RF). But this indicator remains poorly relevant, since the only point of importance in the dataset space is the observed dataset s0 = S(x0). A
  • 48. rst step addressing this issue is to obtain error rates conditional on the data as in (18). However, the statisti- cal methodology available for this purpose suers from the curse of dimensionality. We thus replace this conditional error with the average of the misclassi
  • 49. cation loss 1f ^m(S(x))6= mg taken over the posterior predictive distribution, namely X m (mjs0) Z 1f ^m(S(y))6=mgf(yj;m)(jm; s0)dyd [1] This solution answers criticisms on the prior error evaluation, since it weights the misclassi
  • 50. cation error by the posterior distribution; [1] is called the posterior error rate. + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + ++ + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + +++ + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ ++ + + + + + + + + + + + + + + + + + ++ + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 True posterior probability of MA(2) ABC posterior probability of MA(2) Fig. 1: Illustration of the discrepancy between true posterior probabilities and their ABC approxima- tions. The aim is to choose between two nested time series models, namely moving averages of order 1 and 2 (denoted MA(1) and MA(2) respectively; see SI for more details). Each point of the plot gives the two posterior probabilities of MA(2) for a dataset simulated either from the
  • 51. rst (blue) or second model (orange). Even though the
  • 52. rst two autocovariance statistics are informative for this model choice, values on the x-axis, equal to the exact posterior probabilities of MA(2), dif- fer substantially from their ABC counterparts on the y-axis. The practical derivation of the posterior error rate is easily conducted via a secondary ABC algorithm, described below (see Algorithm 2). This algorithm relies on a natural prox- imity between s0 and S(y) stemming from the RF, namely the number of times both inputs fall into the same tip of an RF tree. The sample (m; ; S(y)) of size k Npp produced in step (c) constitutes an ABC approximation of the posterior predictive distribution. The posterior error rate [1] is then approximated in step (d) by averaging prediction errors over this sample. Algorithm 2 Computation of the posterior error (a) Use the trained RF to compute proximity between each (m; ; S(x)) of the reference table and s0 = S(x0) (b) Select the k simulations with the highest proximity to s0 (c) For each (m; ) in the latter set, compute Npp new simu- lations S(y) from f(yj;m) (d) Return the frequency of erroneous RF predictions over these k Npp simulations Illustrations To illustrate the power of the ABC-RF methodology, we now report several controlled experiments as well as two genuine population genetic examples. Insights from controlled experiments. SI details controlled ex- periments on a toy problem, comparing MA(1) and MA(2) time-series models, and two controlled synthetic examples from population genetics, based on SNP and microsatellite data. The toy example is particularly revealing of the dis- crepancy between the posterior probability of a model and the version conditioning on the summary statistics s0. Fig. 1 shows how far from the diagonal are realizations of the pairs ((mjx0); (mjs0)), even though the autocorrelation statistic is quite informative (8). Note in particular the vertical accu- mulation of points near p(m = 2jx0) = 1. Table S1 demon- strates the further gap in predictive power for the full Bayes solution with a true error rate of 12% versus the best solution (RF) based on the summaries barely achieving a 17% error rate. For both controlled genetics experiments in SI, the com- putation of the true posterior probabilities of the three models is impossible. The predictive performances of the competing classi
  • 53. ers can nonetheless be compared on a test sample. Re- sults, summarized in Table S2 and S3 in the SI legitimate our Table 1: Harlequin ladybird data: estimated prior error rates for various classi
  • 54. cation methods and sizes of reference table. Classi
  • 55. cation method Prior error rates (%) trained on Nref = 10; 000 Nref = 20; 000 Nref = 50; 000 linear discriminant analysis (LDA) 39:91 39:30 39:04 standard ABC (k-nn) on DIYABC summaries 57:46 53:76 51:03 standard ABC (k-nn) on LDA axes 39:18 38:46 37:91 local logistic regression on LDA axes 41:04 37:08 36:05 random forest (RF) on DIYABC summaries 40:18 38:94 37:63 RF on DIYABC summaries and LDA axes 36:86 35:62 34:44 Performances of classi
  • 56. ers used in stage (B) of Algorithm 1. A set of 10; 000 prior simulations was use to calibrate the number of neighbors k in both standard ABC and local logistic regression, and of sub-samples Nboot for the trees of RF. Prior error rates were estimated as average misclassi
  • 57. cation errors on an independent set of 10; 000 prior simulations, constant over methods and sizes of the reference tables. Pudlo, Marin et al. PNAS Issue Date Volume Issue Number 3
  • 58. support of RF as the optimal classi
  • 59. er, with gains of several percents. Those experiments demonstrate in addition that the posterior error rate can highly vary compared with the average prior rate, hence making a case of its signi
  • 60. cance in data
  • 61. tting (for details, see Section 3 in the SI). A last feature worth mentioning is that, while LDA alone does not perform uniformly well over all examples, the conjunction of LDA and RF always produces improvement, with the
  • 62. rst LDA axes appearing within the most active summaries of the trained forests (Fig. S6 and S8). This stresses both the ap- peal of LDA as extra summaries and the amalgamating eect of RF, namely its ability to incorporate highly relevant statis- tics within a wide set of possibly correlated or non-informative summaries. Microsatellite dataset: retracing the invasion routes of the Harlequin ladybird.The original challenge was to conduct inference about the introduction pathway of the invasive Harlequin ladybird (Harmonia axyridis) for the
  • 63. rst recorded outbreak of this species in eastern North America. The dataset,
  • 64. rst analyzed in (31) and (23) via ABC, includes sam- ples from
  • 65. ve natural and biocontrol populations genotyped at 18 microsatellite markers. The model selection requires the formalization and comparison of 10 complex competing sce- narios corresponding to various possible routes of introduction (see analysis 1 in (31) and SI for details). We now compare our results from the ABC-RF algorithm with other classi
  • 66. ca- tion methods and with the original solutions by (31) and (23). RF and other classi
  • 67. ers discriminating among the 10 scenar- ios were trained on either 104, 2 104 or 5 104 simulated datasets. We included all summary statistics computed by the DIYABC software for microsatellite markers (29), namely 130 statistics, complemented by the nine LDA axes as additional summary statistics. More details about this example can be found in the SI. In this example, discriminating among models based on the observation of summary statistics is dicult. The over- lapping groups of Fig. S10 in the SI re ect that diculty, which source is the relatively low information carried by the 18 autosomal microsatellite loci considered here. Prior error rates of learning methods on the whole reference table are given in Table 1. As expected in such high dimension settings (19, Section 2.5), k-nn classi
  • 68. ers behind the standard ABC methods perform uniformly badly when trained on the 130 numerical summaries, even when well calibrated. On a much smaller set of covariates, namely the nine LDA axes, these local methods (standard ABC, and the local logistic regres- sion) behave much more nicely. The best classi
  • 69. er in term of prior error rates is a RF trained on the 130 summaries and the nine LDA axes, whatever the size of the reference table. Additionally, Fig. S11 shows that RFs are clearly able to au- tomatically determine the (most) relevant statistics for model comparison, including in particular some crude estimates of admixture rate de
  • 70. ned in (32), some of them not selected by the experts in (31). We stress here that the level of informa- tion of the summary statistics displayed in Fig. S11 is relevant for model choice but not for parameter estimation issues. In other words, the set of best summaries found with ABC-RF should not be considered as an optimal set for further pa- rameter estimations under a given model with standard ABC techniques (3). The evolutionary scenario selected by our RF strategy fully agrees with the earlier conclusion of (31), based on ap- proximations of posterior probabilities with local logistic re- gression solely on the LDA axes (i.e., the same scenario dis- plays the highest ABC posterior probability and the largest number of selection among the decisions taken by the aggre- gated trees of RF). Another noteworthy feature of this re- analysis is the posterior error rate of the best ABC-RF, ap- proximated by 40% when running Algorithm 2 on k = 500 neighbors and Npp = 20 simulated datasets per neighbor. In agreement with this, the posterior probability bearing the cho- sen scenario in (31) is relatively low (about 60%). It is worth stressing here that posterior error rate and posterior proba- bilities are not commensurable, i.e. they cannot be measured on the same scale. For instance, a posterior probability of 60% is not the equivalent of a posterior error rate of 40%, as 5 0 5 LD1 10 5 0 5 10 LD2 * 10 5 0 5 10 10 5 0 5 LD1 LD3 * 5 0 5 10 5 0 5 LD2 LD3 * 10 5 0 5 10 10 5 0 5 LD1 LD4 * Fig. 2: Human SNP data: projection of the reference table on the
  • 71. rst four LDA axes. Colors correspond to model in- dices. (See SI for the description of the models.) The location of the additional datasets is indicated by a large black star. Table 2: Human SNP data: estimated prior error rates for classi
  • 72. cation methods and three sizes of reference table. Classi
  • 73. cation method Prior error rates (%) trained on Nref = 10; 000 Nref = 20; 000 Nref = 50; 000 linear discriminant analysis (LDA) 9:91 9:97 10:03 standard ABC (k-nn) using DYIABC summaries 23:18 20:55 17:76 standard ABC (k-nn) using only LDA axes 6:29 5:76 5:70 local logistic regression on LDA axes 6:85 6:42 6:07 random forest (RF) using DYIABC initial summaries 8:84 7:32 6:34 RF using both DYIABC summaries and LDA axes 5:01 4:66 4:18 Same comments as in Table 1. 4 www.pnas.org/cgi/doi/10.1073/pnas.xxx Pudlo, Marin et al.
  • 74. the former is a transform of a vector of evidences, while the latter is an average performance over hypothetical datasets. These quantities are therefore not to be assessed on the same ground, one being a Bayesian construct of the probability of a model, the other one a weighted evaluation of the chances of selecting the wrong model. SNP dataset: inference about Human population history. Because ABC-RF performs well with a substantially lower number of simulations compared to standard ABC methods, it is expected to be of particular interest for the statistical processing of massive Single Nucleotide Polymorphism (SNP) datasets, whose production is on the increase in the
  • 75. eld of population genetics. We analyze here a dataset including 50,000 SNP markers genotyped in four Human populations (33). The four populations include Yoruba (Africa), Han (East Asia), British (Europe) and American individuals of African Ancestry, respectively. Our intention is not to bring new insights into Human population history, which has been and is still studied in greater details in research using genetic data, but to illustrate the potential of ABC-RF in this con- text. We compared six scenarios (i.e. models) of evolution of the four Human populations which dier from each other by one ancient and one recent historical events: (i) a single out- of-Africa colonization event giving an ancestral out-of-Africa population which secondarily split into one European and one East Asian population lineages, versus two independent out- of-Africa colonization events, one giving the European lineage and the other one giving the East Asian lineage; (ii) the pos- sibility of a recent genetic admixture of Americans of African origin with their African ancestors and individuals of Euro- pean or East Asia origins. The SNP dataset and the compared scenarios are further detailed in the SI. We used all the sum- mary statistics provided by DIYABC for SNP markers (29), namely 130 statistics in this setting complemented by the
  • 76. ve LDA axes as additional statistics. To discriminate among the six scenarios of Fig. S12 in SI, RF and others classi
  • 77. ers have been trained on three nested reference tables of dierent sizes. The estimated prior error rates are reported in Table 2. Unlike the previous example, the information carried here by the 50; 000 SNP markers is much higher, because it induces better separated simulations on the LDA axes (Fig. 2), and much lower prior error rates (Table 2). Even in this case, RF using both the initial sum- maries and the LDA axes provides the best results. ABC-RF on the Human dataset selects Scenario 2 as the forecasted scenario, an answer which is not visually obvious on the LDA projections of Fig. 2. But, considering previous pop- ulation genetics studies in the
  • 78. eld, it is not surprising that this scenario, which includes a single out-of-Africa coloniza- tion event giving an ancestral out-of-Africa population with a secondarily split into one European and one East Asian pop- ulation lineage and a recent genetic admixture of Americans of African origin with their African ancestors and European individuals, was selected among the six compared scenarios. This selection is associated with a high con
  • 79. dence level as in- dicated by an estimated posterior error rate equals to zero. As in the previous example, we used Algorithm 2 with k = 500 neighbors and then simulated Npp = 20 replicates per neigh- bor to estimate the posterior error rate. Computation time is a particularly important issue in the present example. Simulating the 10; 000 SNP datasets used to train the classi
  • 80. cation methods requires seven hours on a computer with 32 processors (Intel Xeon(R) CPU 2GHz). In that context, we are delighted to observe that the RF classi-
  • 81. er constructed on the summaries and the LDA axes and a 10; 000 reference table has a smaller prior error rate than all other classi
  • 82. ers, even when they are trained on a 50; 000 refer- ence table. It is worth noting that standard ABC treatments for model choice are based in practice on reference tables of substantially larger sizes: 105 to 106 simulations per scenario (23, 34). For the above setting in which six scenarios are com- pared, standard ABC treatments would request a minimum computation time of 17 days (using the same computation re- sources). According to the comparative tests that we carried out on various example datasets, we found that RF globally allowed a minimum computation speed gain around a factor of 50 in comparison to standard ABC treatments (see also Sec- tion 4 of SI for other considerations regarding computation speed gain). Conclusion The present paper is purposely focused on selecting a model, which is a classi
  • 83. cation problem trained on ABC simulations. Indeed, there exists a fundamental and numerical discrep- ancy between genuine posterior probabilities and probabilities based on summary statistics (10, 11). When statistics follow the consistency conditions of (12), the discrepancy remains, but the resulting algorithm asymptotically select the proper model as the size of the data grows. We defend here the paradigm shift of quantifying our con-
  • 84. dence in the selected model by the computation of a poste- rior error rate, along with the abandonment of approximating posterior probabilities since the latter cannot be assessed at a reasonable computational cost. The posterior error rate pro- duces an estimated error as an average over the a posteriori most likely part of the parameter space, including the infor- mation contained in the data. It further remains within the Bayesian paradigm and is a convergent evaluation of the true error made by RF itself, whence represents a natural substi- tute to the usually uncertain ABC approximation of posterior probabilities. Compared with past ABC implementations, ABC-RF of- fers improvements at least at
  • 85. ve levels: (i) on all experi- ments we studied, it has a lower prior error rate; (ii) it is robust to the size and choice of summary statistics, as RF can handle many super uous statistics with no impact on the performance rates (which mostly depend on the intrinsic di- mension of the classi
  • 86. cation problem (27, 28), a characteristic con
  • 87. rmed by our results); (iii) the computing eort is consid- erably reduced as RF requires a much smaller reference table compared with alternatives (i.e., a few thousands versus hun- dred thousands to billions of simulations); (iv) the method is associated with an embedded and free error evaluation which assesses the reliability of ABC-RF analysis; and (v) RF can be easily and cheaply calibrated (with no further simulations) from the reference table via the reliable out-of-bag error. As a consequence, ABC-RF allows for a more robust handling of the degree of uncertainty in the choice between models, pos- sibly in contrast with earlier and over-optimistic assessments. Due to a massive gain in computing and simulation eorts, ABC-RF will undoubtedly extend the range and complexity of datasets (e.g. number of markers in population genetics) and models handled by ABC. Once a given model has been chosen and con
  • 88. dence evaluated by ABC-RF, it becomes pos- sible to estimate parameter distribution under this (single) model using standard ABC techniques (e.g. 35) or alternative methods such as those proposed by (36). ACKNOWLEDGMENTS. The use of random forests was suggested to JMM and CPR by Bin Yu during a visit at CREST, Paris, in 2013. We are grateful to our col- leagues at CBGP for their feedback and support, to the Department of Statistics at Pudlo, Marin et al. PNAS Issue Date Volume Issue Number 5
  • 89. Warwick for its hospitality, and to G. Biau for his help about the asymptotics of ran- dom forests. Some parts of the research was conducted at BIRS, Ban, Canada, and the authors (PP and CPR) took advantage of this congenial research environment. The authors also acknowledge the independent research conducted on classi
  • 90. cation tools for ABC by M. Gutmann, R. Dutta, S. Kaski, and J. Corander. References 1. Tavare S, Balding D, Grith R, Donnelly P (1997) Inferring coalescence times from DNA sequence data. Genetics 145:505{ 518. 2. Pritchard J, Seielstad M, Perez-Lezaun A, Feldman M (1999) Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 16:1791{1798. 3. Beaumont M, Zhang W, Balding D (2002) Approximate Bayesian computation in population genetics. Genetics 162:2025{2035. 4. Beaumont M (2008) in Simulations, Genetics and Human Pre- history, eds Matsumura S, Forster P, Renfrew C (Cambridge: (McDonald Institute Monographs), McDonald Institute for Ar-chaeological Research), pp 134{154. 5. Toni T, Welch D, Strelkowa N, Ipsen A, Stumpf M (2009) Approximate Bayesian computation scheme for parameter in-ference and model selection in dynamical systems. Journal of the Royal Society Interface 6:187{202. 6. Beaumont M (2010) Approximate Bayesian computation in evolution and ecology. Annual Review of Ecology, Evolution, and Systematics 41:379{406. 7. Csillery K, Blum M, Gaggiotti O, Francois O (2010) Approxi-mate Bayesian computation (ABC) in practice. Trends in Ecol- ogy and Evolution 25:410{418. 8. Marin J, Pudlo P, Robert C, Ryder R (2011) Approximate Bayesian computational methods. Statistics and Computing pp 1{14. 9. Blum M, Nunes M, Prangle D, Sisson S (2013) A compar-ative review of dimension reduction methods in Approximate Bayesian Computation. Stat Sci 28:189{208. 10. Didelot X, Everitt R, Johansen A, Lawson D (2011) Likelihood-free estimation of model evidence. Bayesian Analysis 6:48{76. 11. Robert C, Cornuet JM, Marin JM, Pillai N (2011) Lack of con
  • 91. dence in ABC model choice. Proceedings of the National Academy of Sciences 108(37):15112{15117. 12. Marin J, Pillai N, Robert C, Rousseau J (2014) Relevant statis-tics for Bayesian model choice. J Roy Stat Soc B (to appear). 13. Breiman L (2001) Random forests. Machine Learning 45:5{32. 14. Berger J (1985) Statistical Decision Theory and Bayesian Analysis (Springer-Verlag, New York), Second edition. 15. Robert C (2001) The Bayesian Choice (Springer-Verlag, New York), second edition. 16. Grelaud A, Marin JM, Robert C, Rodolphe F, Tally F (2009) Likelihood-free methods for model choice in Gibbs random
  • 92. elds. Bayesian Analysis 3(2):427{442. 17. Biau G, Cerou F, Guyader A (2014) New insights into Approx-imate Bayesian Computation. Annales de l'IHP (Probability and Statistics). 18. Stoehr J, Pudlo P, Cucala L (2014) Adaptive ABC model choice and geometric summary statistics for hidden Gibbs random
  • 93. elds. Statistics and Computing pp 1{13. 19. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference, and prediction., Springer Series in Statistics (Springer-Verlag, New York), 2 edition. 20. Freund Y, Schapire RE, et al. (1996) Experiments with a new boosting algorithm Vol. 96, pp 148{156. 21. Aeschbacher S, Beaumont MA, Futschik A (2012) A novel approach for choosing summary statistics in Approximate Bayesian Computation. Genetics 192:1027{1047. 22. Prangle D, Blum MGB, Popovic G, Sisson SA (2013) Diag-nostic tools of approximate Bayesian computation using the coverage property. ArXiv e-prints. 23. Estoup A, et al. (2012) Estimation of demo-genetic model prob-abilities with Approximate Bayesian Computation using linear discriminant analysis on summary statistics. Molecular Ecology Ressources 12:846{855. 24. Devroye L, Gyor
  • 94. L, Lugosi G (1996) A probabilistic theory of pattern recognition, Applications of Mathematics (New York) (Springer-Verlag, New York) Vol. 31, pp xvi+636. 25. Breiman L (1996) Bagging predictors. Mach Learn 24:123{140. 26. Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classi
  • 95. - cation and regression trees (CRC press). 27. Biau G (2012) Analysis of a random forest model. Journal of Machine Learning Research 13:1063{1095. 28. Scornet E, Biau G, Vert JP (2014) Consistency of random forests., (arXiv), Technical Report 1405.2881. 29. Cornuet JM, et al. (2014) DIYABC v2.0: a software to make Approximate Bayesian Computation inferences about popula-tion history using Single Nucleotide Polymorphism, DNA se-quence and microsatellite data. Bioinformatics (to appear). 30. Cleveland W (1979) Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc 74:829{836. 31. Lombaert E, Guillemaud T, Thomas C, et al. (2011) Infer-ring the origin of populations introduced from a genetically structured native range by Approximate Bayesian Computa-tion: case study of the invasive ladybird Harmonia axyridis. Molecular Ecology 20:4654{4670. 32. Choisy M, Franck P, Cornuet JM (2004) Estimating admixture proportions with microsatellites: comparison of methods based on simulated data. Mol Ecol 13:955{968. 33. 1000 Genomes Project Consortium, Abecasis G, Auton A, et al. (2012) An integrated map of genetic variation from 1,092 hu-man genomes. Nature 491:56{65. 34. Bertorelle G, Benazzo A, Mona S (2010) ABC as a exible framework to estimate demography over space and time: some cons, many pros. Mol Ecol 19:2609{2625. 35. Beaumont M, Zhang W, Balding D (2002) Approximate Bayesian computation in population genetics. Genetics 162:2025{2035. 36. Excoer L, Dupanloup I, Huerta-Sanchez E, Sousa V, Foll M (2013) Robust demographic inference from genomic and SNP data. PLoS Genet p e1003905. 6 www.pnas.org/cgi/doi/10.1073/pnas.xxx Pudlo, Marin et al.
  • 96. Reliable ABC model choice via random forests | Supporting Information Pierre Pudlo y, Jean-Michel Marin y , Arnaud Estoup z, Jean-Marie Cornuet z , Mathieu Gautier z , and Christian P. Robert x { Universite de Montpellier 2, I3M, Montpellier, France,yInstitut de Biologie Computationnelle (IBC), Montpellier, France,zCBGP, INRA, Montpellier, France,xUniversite Paris Dauphine, CEREMADE, Paris, France, and {University of Warwick, Coventry, UK Table of contents 1. Classi
  • 97. cation methods 1 2. A revealing toy example: MA(1) versus MA(2) models 3 3. Examples based on controlled simulated population genetic datasets 5 4. Supplementary information about the Harlequin ladybird example 9 5. Supplementary informations about the Human population example 13 6. Computer software and codes 15 7. Summary statistics available in the DIYABC software 16 1. Classi
  • 98. cation methods Classi
  • 99. cation methods aim at forecasting a variable Y that takes values in a
  • 100. nite set, e.g. f1; : : : ;Mg, based on a pre-dicting vector of covariates X = (X1; : : : ;Xd) of dimension d. They are
  • 101. tted with a training database (xi; yi) of indepen-dent replicates of the pair (X; Y ). We exploit such classi
  • 102. ers in ABC model choice by predicting a model index (Y ) from the observation of summary statistics on the data (X). The classi
  • 103. ers are trained with numerous simulations from the hi-erarchical Bayes model that constitute the ABC reference ta-ble. For a more detailed entry on classi
  • 104. cation, we refer the reader to the entry (1) and to the more theoretical (2). Standard classi
  • 105. ers. Discriminant analysis covers a
  • 106. rst family of classi
  • 107. ers including linear discriminant analysis (LDA) and nave Bayes. Those classi
  • 108. ers rely on a full likelihood function corresponding to the joint distribution of (X; Y ), speci
  • 109. ed by the marginal probabilities of Y and the conditional density f(xjy) of X given Y = y. Classi
  • 110. cation follows by ordering the probabilities Pr(Y = yjX = x). For instance, linear dis-criminant analysis assumes that each conditional distribution of X is a multivariate Gaussian distribution with unknown mean and covariance matrix, when the covariance matrix is assumed to be constant across classes. These parameters are
  • 111. tted on a training database by maximum likelihood; see e.g. Chapter 4 of (1). This classi
  • 112. cation method is quite popu-lar as it provides a linear projection of the covariates on a space of dimension M 1, called the LDA axes, which sep-arate classes as much as possible. Similarly, nave Bayes as-sumes that each density f(xjy), y = 1; : : : ;M, is a product of marginal densities. Despite this rather strong assumption of conditional independence of the components of X, nave Bayes often produces good classi
  • 113. cation results. Note that one can assume that the marginals are univariate Gaussians and
  • 114. t those by maximum likelihood estimation, or else resort to a nonparametric kernel density estimator to recover these marginal densities when the training database is large enough. Logistic and multinomial regressions use a conditional like-lihood based on a modeling of Pr(Y = yjX = x), as special cases of a generalized linear model. Modulo a logit transform (p) = logfp=(1 p)g, this model assume linear dependency in the covariates; see e.g. Chapter 4 in (1). Logistic regres-sion results rarely dier from LDA estimates since the decision boundaries are also linear. The sole dierence stands with the procedure used to
  • 115. t the classi
  • 116. ers. Local methods. k-nearest neighbor (k-nn) classi
  • 117. ers require no model
  • 118. tting but mere computations on the training database. More precisely, it builds upon a distance on the feature space, X 3 X. In order to make a classi
  • 119. cation when X = x, k-nn derives the k training points that are the closest in distance to x and classi
  • 120. es this new datapoint x according to a major-ity vote among the classes of the k neighbors. The accuracy of k-nn heavily depends on the tuning of k, which should be calibrated, as explained below. Local logistic (or multinomial) regression adds a linear re-gression layer to these procedures and dates back to (3). In order to make a decision at X = x, given the k nearest neigh-bors in the feature space, one weights them by a smoothing kernel (e.g., the Epanechnikov kernel) and a multinomial clas-si
  • 121. er is then
  • 122. tted on this weighted sub-sample of the training database. More details on this procedure can be found in (4). Likewise, the accuracy of the classi
  • 123. er depends on the cali-bration of k. Random forest construction.RF aggregates decision trees built with a slight modi
  • 124. cation of the CART algorithm (5). PNAS Supplementary Information 1{17
  • 125. The latter procedure produces a binary tree that sets rules as labels of the internal nodes and predictions of Y as labels of the tips (terminal nodes). At a given internal node, the rule is of the form Xj < t, which determines a left-hand branch ris-ing from that vertex and a right-hand branch corresponding to Xj t. To predict the value of Y when X = x from this tree means following a path from the root by applying these binary rules and returning the label of the tip at the end of the path. The randomized CART algorithm used to create the trees in the forest recursively infers the internal and terminal labels of each tree i from the root on a training database (xi; yi) as follows. Given a tree built until a node v, daughter nodes v1 and v2 are determined by partitioning the data remaining at v in a way highly correlated with the outcome Y . Practically, this means minimizing an empirical divergence criterion (the sum of impurities of the resulting nodes v1 and v2) towards se-lecting the most discriminating covariate Xj among a random subset of the covariates, of size ntry, and the best threshold t. Assuming ^p(v; y) denotes the relative frequency of y among the part of the learning database that falls at node v, N(v) the size of this part of the database, the Gini criterion we minimize is N(v1)Q(v1) + N(v2)Q(v2), where Q(vi) = MX y=1 ^p(vi; y) f1 ^p(v;y)g : (See Chapter 9 in (1) for criteria other than the Gini index above.) The recursive algorithm stops Pwhen all terminal nodes v are homogeneous, i.e., Q(v) = M y=1 ^p(v; y)f1 ^p(v; y)g = 0 and the label of the tip v is the only value of y for which ^p(v; y) = 1. This leads to Algorithm S1, whose decision boundaries are noisy but approximately unbiased. The RF algorithm aggregates randomized CART trees trained on bootstrap sub-sample of size Nboot from the origi-nal training database (i.e., the reference table in our context). The prediction at a new covariate value X = x is the most fre-quent response predicted by the trees in the forest. Three tun-ing parameters have to be calibrated: the number B of trees in the forest, the number ntry of covariates that are sampled at each node by the randomized CART, and the size Nboot of the bootstrap sub-sample. Following (6), if d is the total number of predictors, the default number of covariates ntry is p d and the default Nboot is the size of the original train-ing database. The out-of-bag error is the average number of time an observation from the training database is misclassi-
  • 126. ed by trees trained on bootstrap samples that do not include this observation, and it is instrumental in tuning the above parameters. Algorithm S1 Randomized CART start the tree with a single root repeat pick a non-homogeneous tip v such that Q(v)6= 1 attach to v two daughter nodes v1 and v2 draw a random subset of covariates of size ntry for all covariates Xj in the random subset do
  • 127. nd the threshold tj in the rule Xj < tj that minimizes N(v1)Q(v1) + N(v2)Q(v2) end for
  • 128. nd the rule Xj < tj that minimizes N(v1)Q(v1) + N(v2)Q(v2) in j and set this best rule to node v until all tips v are homogeneous (Q(v) = 0) set the labels of all tips Algorithm S2 RF for classi
  • 129. cation for b = 1 to B do draw a bootstrap sub-sample Z of size Nboot from the training data grow a tree Tb trained on Z with Algorithm S1 end for output the ensemble of trees fTb; b = 1 : : :Bg Notice that the frequencies of predicted responses amid the trees of Algorithm S2 do not re ect any posterior related quantities and thus should not be returned to the user. In-deed, if it is fairly easy to reach the decision y at covariate value X = x, almost all trees will produce the same prediction y and the frequency of this class y will be much higher than Pr(Y = yjX = x). The way we build a RF classi
  • 130. er given a collection of sta-tistical models is to start from an ABC reference table in-cluding a set of simulation records made of model indices, parameter values and summary statistics for the associated simulated data. This table then serves as training database for a RF that forecasts model index based on the summary statistics. Once more, we stress that the frequency of each model amid the tree predictions does not re ect any poste-rior probability. We therefore propose the computation of a posterior error rate (see main text) that render a reliable and fully Bayesian error evaluation. Calibration of the tuning parameters. Many machine learning algorithms involve tuning parameters that need to be deter-mined carefully in order to obtain good results (in terms of what is called the prior error rate in the main text). Usually, the predictive performances (averaged over the prior in our context) of classi
  • 131. ers are evaluated on new data (validation procedures) or fake new data (cross-validation procedures); see e.g. Chapter 7 of (1). This is the standard way to com-pare the performances of various possible values of the tuning parameters and thus calibrate these parameters. For instance, the value of k for both k-nn and local logis-tic regression, as well as Nboot of RF, need to be calibrated. But, while k-nn performances heavily depend on the value of k, the results of RF are rather stable over a large range of values of Nboot as illustrated on Fig. S1. The plots in this Figure display an empirical evaluation of the prior error rates of the classi
  • 132. ers against dierent values of their tuning pa-rameter with a validation sample made of a fresh set of 104 simulations from the hierarchical Bayesian model. Because of the moderate Monte Carlo noise within the empirical error, we
  • 133. rst smooth out the curve before determining the calibra-tion of the algorithms. Fig. S1 displays this derivation for the ABC analysis of the Harlequin ladybird data with ma-chine learning tools. The last case is quite characteristic of the plateau structure of errors in RFs. The validation procedure described above requires new simulations from the hierarchical Bayesian model, which we can always produce because of the very nature of ABC. But such simulations might be computationally intensive when analyzing large datasets or complex models. The cross-validation procedure is an alternative (we do not present here) while RF oers a separate evaluation procedure: it takes ad-vantage of the fact that bootstrap samples do not contain the whole reference table, leftovers being available for testing. The resulting evaluation of the prior error rate is the out-of-bag estimator, see e.g. Chapter 15 of (1). Calibration for other classi
  • 134. ers involve new prior simulations or a computationally heavy cross-validation approximation of the error. Moreover, calibrating local logistic regression may prove computation-ally unfeasible since for each dataset of the validation sample 2 Pudlo, Marin et al.
  • 135. (the second reference table), the procedure involves searching for nearest neighbors in the (
  • 136. rst) reference table, then
  • 137. tting a weighted logistic regression on those neighbors. 0 500 1000 1500 2000 2500 3000 0.56 0.60 0.64 0.68 k Prior error rate 2000 4000 6000 8000 10000 0.371 0.374 0.377 k Prior error rate 0 10000 20000 30000 40000 50000 0.36 0.38 0.40 Nboot Prior error rate Fig. S1. Calibration of k-nn, the local logistic regression, and RF. Plot of the empirical prior error rate (in black) of three classi
  • 138. ers, namely k-nn (top), the local logistic regression (middle) and RF (bottom) as a function of their tuning parameter (k for the
  • 139. rst two methods, Nboot for RF) when analyzing the Harlequin ladybird data with a reference table of 10; 000 simulations (top and middle) or 50; 000 simulations (bottom). To remove the noise of these estimated errors on a validation set composed of 10; 000 independent simulations, estimated errors are smoothed by a spline method that produces the red curve. The optimal values of the parameters are k = 300, k = 3; 000 and Nboot = 40; 000, respectively. 2. A revealing toy example: MA(1) versus MA(2) models Given a time series (xt) of length T = 100, we compare
  • 140. ts by moving average models of order either 1 or 2, MA(1) and MA(2), namely xt = t 1t1 and xt = t 1t1 2t2 ; t N(0; 2) ; respectively. As previously suggested (7), a possible set of (insucient) summary statistics is made of the
  • 141. rst two (or higher) autocorrelations, set that yields an ABC reference table of size Nref = 104 with two covariates, displayed on Fig. S2. For both models, the priors are uniform distribu-tions on the stationarity domains (8): { for MA(1), the single parameter 1 is drawn uniformly from the segment (1; 1); { for MA(2), the pair (1; 2) is drawn uniformly over the triangle de
  • 142. ned by 2 < 1 < 2; 1 + 2 > 1; 1 2 < 1: In this example, we can evaluate the discrepancy between the true posterior probabilities and those based on summaries. The true marginal likelihoods can be computed by numerical integrations of dimension 1 and 2 respectively, while the poste-rior probabilities based on the summary statistics are derived from the ABC reference table by a kernel density estimation. Fig. 1 of the main text shows how dierent the (estimated) posterior probabilities are when based on (i) the whole series of length T = 100 and (ii) only the summary statistics, even though the latter remain informative about the problem. This graph induces us to caution as to the degree of approximation provided by ABC about the true posterior probabilities and it brings numerical support to the severe warnings of (9). + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + ++ ++ + + ++ + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++++ + + + + + +++ + + + + ++ + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + ++ + + ++ + + + + + + + ++ + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + ++ + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + ++ + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + ++ + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ ++ + + + + + + + ++ + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + ++ + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + ++ + + + + + + + + ++ + + + + ++ + ++ + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +