machine learning in the xenon experiment - esc.fnwi.uva.nl · the curve labeled minimum halo is...

31
Machine learning in the Xenon experiment Gijs Leguijt June 22, 2018 Studentnumber 11000279 Assignment Report Bachelor Project Physics and Astronomy Size 15 EC Conducted between April 3, 2018 and June 22, 2018 Institute Nikhef Faculty Faculteit der Natuurwetenschappen, Wiskunde en Informatica University Universiteit van Amsterdam Supervisor Prof. dr. P. Decowski Second Examiner Dr. H. Snoek Abstract The XENON1T-experiment tries to make a direct discovery of a dark matter par- ticle scattering of a xenon-atom. As the cross-section of this reaction is very low, the chance that a dark matter particle scatters twice is negligible. This thesis evaluates the performance of machine learning algorithms. Aiming to remove the multiple-scatter sig- nals from the data. The three different algorithms used are a support vector machine, a random forest classifier and a multi-layered perceptron. All three algorithms achieved higher accuracy than the current analysis, but as the vector machine took multiple factors longer to process the large amount of data, only the forest-classifier and the perceptron are recommended for implementation.

Upload: duongdang

Post on 15-Aug-2019

212 views

Category:

Documents


0 download

TRANSCRIPT

Machine learning in the Xenon experiment

Gijs Leguijt

June 22, 2018

Studentnumber 11000279

Assignment Report Bachelor Project Physics and Astronomy

Size 15 EC

Conducted between April 3, 2018 and June 22, 2018

Institute Nikhef

Faculty Faculteit der Natuurwetenschappen, Wiskunde en Informatica

University Universiteit van Amsterdam

Supervisor Prof. dr. P. Decowski

Second Examiner Dr. H. Snoek

Abstract

The XENON1T-experiment tries to make a direct discovery of a dark matter par-

ticle scattering of a xenon-atom. As the cross-section of this reaction is very low, the

chance that a dark matter particle scatters twice is negligible. This thesis evaluates the

performance of machine learning algorithms. Aiming to remove the multiple-scatter sig-

nals from the data. The three different algorithms used are a support vector machine,

a random forest classifier and a multi-layered perceptron. All three algorithms achieved

higher accuracy than the current analysis, but as the vector machine took multiple factors

longer to process the large amount of data, only the forest-classifier and the perceptron

are recommended for implementation.

Contents

1 Samenvatting 3

2 Introduction 4

3 Dark matter 5

3.1 Proof for dark matter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Consistence of dark matter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 The XENON1T-detector 8

4.1 Current data-analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1.1 Largest S2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1.2 Largest other S2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1.3 50%-width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1.4 Goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Machine learning 14

5.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.3 Random forest classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.4 Support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.5 Multi-layered perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.6 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.7 ROC-curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Results 25

6.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.2 Improving current cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.3 Comparing different algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7 Conclusion 30

2

1 Samenvatting

Er is bijna vijf keer zo veel donkere materie als alle normale massa. We weten dat er donkere

materie bestaat, en toch hebben we nog nooit een donker-materie deeltje gezien. Om de

deeltjes toch te detecteren worden er grote detectoren gemaakt, maar zelfs met grote detec-

toren is de kans klein dat een donker-materie deeltje botst, het meeste signaal is achtergrond.

Aangezien de kans op een botsing zo klein is, zijn alle deeltjes die twee keer botsen in de

detector dus geen donkere materie deeltjes.

Het is daarom belangrijk om alle deeltjes die meerdere keren botsen weg te filteren. Dit

wordt momenteel gedaan zonder machine learning, maar met machine learning zou dit ef-

ficieter kunnen. Deze thesis probeert de huidige analyse te verbeteren door machine learning

toe te passen, en gebruikt daarvoor meerdere algoritmes. Het blijkt dat alle geteste algo-

ritmes inderdaad in staat zijn beter onderscheid te maken tussen deeltjes die een keer botsen

en deeltjes die dat meerdere keren doen dan de huidige analyse-methode.

3

2 Introduction

On March 28, 2018, researchers announced the discovery of a galaxy with a very low concen-

tration of dark matter [1]. This absence, counter-intuitively, was said to prove the existence

of dark matter. The details of this paradox will be explained later, for now it is sufficient to

know that this claim made it into the news worldwide, which shows that dark matter is a big

topic in modern day physics.

All around the globe experiments are trying to find dark matter, which would contain

over five times as much mass as all matter known to us [2]! But if there is that much mass

represented as dark matter, why has no detector found it yet? The problem is that dark

matter does hardly interact with any matter known to us. You can compare it for example

with neutrino’s, they also pass through most matter without any interaction, but very rarely

leave a trail in the huge neutrino detectors.

Since no detector has ever found a dark matter particle, there are still multiple theo-

ries about what would be the individual constituents. Currently, most attention is given to

the WIMP-model (Weakly Interacting Massive Particle) [3], which is also the target of the

XENON1T-detector situated in a mine in Italy. XENON1T consist of roughly 1000 kg of

liquid xenon [4] in which all the interactions are closely monitored. Although a lot of work

has been put in reducing background noise, up to the level that this detector has the low-

est background noise in his category [4], most of the signal still consists of interactions that

did not include any dark matter particle. As a result, a lot of data needs to be filtered, in

particular particles that scattered multiple times within the detector. Since the cross section

of dark matter particles is very low, of the order of < 10−40cm−2, where the precise value of

the upper-bound depends on the WIMP-mass [5], the chance that it scatters multiple times

within the detector is negligible. Discriminating between single and double scatters is cur-

rently done using cuts in 2D-parameter spaces [6].

Of course dark matter is not the only field that has received major attention: machine

learning is one of the other hot topics in science. One can occasionally find a self driving car

on the road in some countries [7], world’s best GO-player was beaten by a computer [8], and,

maybe less known, it is making its way into big research experiments as well [9]. Whereas

the human mind is not naturally capable of working in more than three dimensions, there is

no such limitation for computers. A machine can operate in high dimensional spaces, seeking

correlations between many variables at once. As a result, letting a computer find its own way

of splitting the data may yield better results than telling the computer where the splits have

to be made [10].

So on the one hand there is a big detector searching for dark matter collisions in their

4

data, which is like trying to find a needle in a haystack. While on the other there is a pow-

erful data-analysis tool, that has classification as one of his main attributes. The aim of this

thesis is to merge the two sides, trying to improve the current splitting of the data, such that

XENON1T would be even more sensitive.

More specifically, the goal of this work will be twofold. At first the effect of adding

machine learning to the analysis will be evaluated. Subsequently, three different learning

algorithms will be compared with each other; the Random Forest Classifier (RFC), Support

Vector Machine (SVM) and the Multi-Layered Perceptron (MLP).

3 Dark matter

3.1 Proof for dark matter

As stated in the intro there are multiple experiments searching for dark matter. But if no

individual particle has been found yet, what is the motivation for these searches? Strong

hints for its existence are found in the rotation curves of galaxies [11]. The speed of orbiting

bodies is governed by the mass enclosed in the orbit and the distance to the centre of mass.

For a typical galaxy one would predict the lower curve in figure 1, when only considering the

visible matter. However, observations yield the upper curve which differs significantly from

the expectation.

1985

ApJ

. . .

295.

.30

5V

DISTRIBUTION OF DARK MATTER IN NGC 3198 309 No. 2, 1985

Fig. 4.—Fit of exponential disk with maximum mass and halo to observed rotation curve (dots with error bars). The scale length of the disk has been taken equal to that of the light distribution (60", corresponding to 2.68 kpc). The halo curve is based on eq. (1), a = 8.5 kpc, y = 2.1, p(R0) = 0.0040 M0 pc-3.

dark matter to visible matter inside the last point of the rota- tion curve (at 30 kpc) is 3.9. The enclosed halo mass is 0.8 times the disk mass at R25; the enclosed halo mass is 1.5 times the disk mass at the Holmberg radius. The total mass inside 30 kpc isl5xl010Mo. Another property of interest is the mass-to- light ratio of the disk; we find M/Lß(disk) < 3.6 Mq/LBq and M/LF(disk) < 4.4 Mq/LVq.

The disk-halo model shown in Figure 4 has the character- istic flat rotation curve over a large part of the galaxy. Beyond 30 kpc it is a mere extrapolation, but the observations inside 30

kpc do not show any sign of a decline, and the extrapolated curve may well be close to the true one. To obtain an estimate of the minimum amount of dark matter at large distances from the center we have also made a fit, shown in Figure 6, with a halo density law whose slope changes from —2 in the inner region to — 4 in the outer region :

PhaloW CC (2)

Fig. 5.—Cumulative distribution of mass with radius for disk and halo for the maximum disk mass case. Two halo fits are shown. The curve labeled “ normal ” halo is based on eq. (1); the parameters of the fit are the same as those in Fig. 4. The curve labeled “minimum” halo is based on eq. (2); it corresponds to a density distribution whose slope changes from —2 in the inner regions to —3.5 in the outer regions. This curve represents an estimate of the minimum amount of dark matter in NGC 3198 inside 50 kpc.

© American Astronomical Society • Provided by the NASA Astrophysics Data System

Figure 1: From [11]. The upper line shows the measured rotation curve of the galaxy NGC3198. The

two lower curves represent the contribution to the curve due to the disk and due to the dark matter

halo. In the absence of the halo, the observed plateau would not occur, and velocity of stars would

decrease for increasing distance to the center, after an initial increase up to roughly 5 kpc due to an

accumulation of mass near the center.

5

One way of explaining these high velocities far from the centre of the galaxy is introducing

additional mass in the galaxy. Since only the effect of their gravity is visible this additional

mass is called dark matter. If one would look at clusters of galaxies, instead of individual

systems, one would be in need of even more dark matter to explain the high velocities en-

countered in these clusters.

Independently of the rotation curves, the existence of dark matter can also be deduced

from the Cosmic Microwave Background (CMB) [12]. The shape of the CMB spectrum has

been measured with great accuracy by satellites that were increasingly sensitive [13]. Inho-

mogeneities in the spectrum result from a balance between gravity and photon-pressure [12].

Since dark matter hardly couples to photons, it creates gravitational potential wells instead

of starting to oscillate due to radiation pressure. This would have a specific signature on the

CMB. Current theories on the history of the universe can successfully reproduce the CMB

spectrum and the resulting bound on the amount of dark matter is consistent with the galaxy

rotation curves [14].

Although there are multiple observations that can be explained by dark matter, they are

all related to gravitational effects. Therefore there have been tries to explain the observations

by making new theories for gravity acting on large scales [15]. This is where the discovery of

the dark matter-less galaxy comes into play. Because this galaxy shows the expected rota-

tion curves, due to the absence of dark matter, theories with different gravitational laws face

difficulties trying to explain both the galaxies with dark matter, and those without.

Another discovery in favour of the dark matter theories is the Bullet Cluster [16], see

figure 2, in which two galaxy clusters collide. In the absence of dark matter most mass would

be contained in the interstellar gas. When the clusters collided, the stars were hardly affected

and moved through, while the gas was slowed down more heavily. As a result, most of the

mass, and thus most gravitational effect, should be split from the stars. Unless there is some-

thing else, that carried most of the mass in the cluster and that hardly interacted, then most

of the gravity would stay co-moving with the stars, which is indeed the case in figure 2. The

fact that the dust does not contain most of the mass is a strong argument for the existence

of dark matter.

6

Chandra X-Ray Observatory

CXC

1E 0657-56

Figure 2: From [16]. The bullet cluster is powerful evidence for the existence of dark matter. As

the two galaxies clusters collided, the individual galaxies moved through due to the vast distances

between them. However, the insterstellar gas clouds slowed down, as their higher density resulted

in more collisions. Since most ordinary mass is contained in the gas, one would expect that most

of the gravity would have stayed with the gas, and got seperated from the galaxies. Observations

show otherwise, the gas clouds, shown in red got separated from the stars, while the gravitational

effect, shown in blue, did not slow down. Therefore, something else must carry most of the mass,

something that does hardly interact, such that it would not slow down. Dark matter fits perfectly in

this observation.

3.2 Consistence of dark matter

Since no dark matter exists in the standard model, modifications or new theories have to be

developed to describe these particles. The dark matter candidates are generally divided in

two categories, MAssive Compact Halo Objects (MACHOs) and Weakly Interacting Massive

Particles (WIMPs). The first category is partly occupied by less exotic objects, like black

holes and neutron stars, while there is still a lot of speculation on what particles/objects make

up the WIMPs.

Since the discovery of the mass of the neutrino they have been opted as dark matter

candidates [17]. However, neutrinos alone are not sufficient to explain the observed effects

of dark matter [18]. Supersymmetry is beyond the scope of this thesis, however, it produces

a whole zoo of possible WIMPs such as neutralinos, sneutrinos and gravitinos [14]. With

WIMPs being considered the most promising dark matter category they form the focus of the

XENON-experiment.

7

4 The XENON1T-detector

The XENON1T-detector is the third in a series of xenon experiments and the predecessor

of XENONnT, which will be an even bigger detector [19]. It aims to find dark matter by

detecting the light produced by a WIMP colliding with a xenon atom. To this end the detector

contains a target mass of 1000 kg xenon, partly in a gaseous state and partly liquid. The

light is recorded by Photo Multiplier Tubes (PMT) situated both at the top and bottom of

the detector. The array of 248 PMTs can reconstruct the (x,y)-position of the collision by

analysing the yield of the individual PMTs. In principle the difference in light travel time

could be used to retrieve the z-position, but the speed of light in comparison to the size of

the experiment makes this practically impossible. Instead, the z-position is constructed using

the fact that there is both liquid and gaseous xenon present.

In a collision, a xenon atom is ionised, and in response it forms an excited Xe2-molecule.

The individual xenon atoms are transparent to the light that is emitted when the molecule de-

excites and this light is the first signal seen from a collision (S1). The electrons that resulted

from the ionization are accelerated upwards by an electric field into the gaseous xenon where

they receive sufficient energy to ionise the gas, which creates a second signal (S2). The higher

in the detector the collision took place, the less distance between the site and the gaseous

phase, resulting in a smaller time interval between the S1 and S2,. Therefore, the z-position

can be reconstructed by comparing the time of arrival of both signals, which is shown in the

left panel of figure 3, the right hand side shows the XENON1T-detector itself.

Figure 3: From [20]. A view into the XENON1T-detector, where the left panel shows the creation of a

signal in the detector. As a particle moves through the liquid xenon, it collides with an atom, resulting

in an ionization. Upon de-excitation, light is emited with is recorded as the S1-signal. Due to the

collision, there are free electrons in the liquid xenon, that get accelerated upwards by an electric field.

When the electrons reach the gaseous xenon, they have enough energy to ionise the xenon, resulting

in a second signal, the S2. The right panel shows a more detailed view of the detector.

8

Apart from xenon being transparent to the emitted light it also has a large cross-section

which enlarges the probability of finding dark matter. This also means it is useful as a

screening substance, and in addition to the 1 t used for detection, there is over 2 t of xenon

used solely for shielding purposes. To further reduce background signals the detector is placed

in a big water tank in the Laboratori Nazionali del Gran Sasso, which is situated in a mine

in Italy [4].

4.1 Current data-analysis

As mentioned earlier, the shielding with xenon, water and 3600m of rock is not sufficient to

remove all background noise. For example, even though all components of the detector have

been selected on low radio-activity, there are still neutrons and electrons in the detector that

originated from decaying isotopes. Neutrons that scatter only once in the detector are hard to

distinguish from WIMPs, however, neutrons are likely to scatter multiple times before leaving

the detector. To distinguish between multiple and single-scatter events cuts are made on the

following parameters (or features in machine-learning jargon): the ‘largest S2’, ‘largest other

S2’, ‘50% width’ and the ‘Goodness of Fit’ (GoF) [6].

4.1.1 Largest S2

The S2 signal originated from the accelerated electrons that ionised the gaseous xenon. Its size

depends both on the energy of the incident particle and the z-coordinate of the collision. A

higher energy collision within the liquid xenon results in a larger number of liberated electrons,

and therefore more ionization in the gaseous phase. Because the electrons are accelerated on

their way up, their ability to ionise the gas is affected by the z-coordinate of the collision.

Collisions lower in the detector enable more acceleration and result in higher S2 peaks. The

largest S2 signal is the highest peak, labelled as S2, that is recorded by the PMTs during the

event. It is being used as a reference to which the other features are plotted.

4.1.2 Largest other S2

If particles were to scatter twice in the detector, the second collision would be less energetic

because of a loss of energy in the former. Therefore, the second largest S2 peak can be coupled

to the second collision. Even in single-scatter events a second S2 could be measured because

of noise or delay of electrons due to collisions, however, these accidental S2’s are in general

smaller than S2’s originated from a second collision.

Currently this cut is being implemented as [6]:

LOS2 < 0.00832 · S2 + 72.3, (1)

9

where ‘largest other S2’ has been abbreviated as LOS2 and S2 corresponds to the largest S2

signal.

Figure 4: The current cut in the (Largest S2, Largest other S2)-parameter space. The cut tries to

discriminate between single- and multi-scatter events. Even in single-scatters, a second S2 signal can

be observed, due to delay of some of the electrons, however, in multi-scatters this second signal is ofter

relatively large, allowing discrimination between the two cases. Points that lie above the blue line,

eq.(1), are filtered away, as they are not likely to be a single-scatter event.

In figure 4 the (Largest S2, Largest other S2)-parameter space is shown, including the

current cut: eq.(1). Data-points above the line are removed, while the events below the line

are kept for further analysis.

4.1.3 50%-width

If the second largest S2 is well separated from the largest S2 it could be identified by the

‘largest other S2’-cut. However, if the time-separation between the two peaks is small enough

for the distributions to overlap, the peaks can be classified as a single S2, see figure 5.

Although the second case in figure 5 is classified as one peak, its shape is different from a

genuine single peak. The 50%-width depends on the shape of the curve, and can be used to

10

discriminate between single and double-scatters in case 2 of figure 5.

Time Time

Sign

alco

un

t

Sign

al c

ou

nt

Figure 5: Schematical view of two consecutive S2-signals. The left panel shows a case that can be

identified by the ‘largest other S2’-cut, however, this cut fails to classify the right panel. As the time

between the two signals is short, there is some overlap between the peaks, resulting in the detector to

see only one peak. As this “one peak” has a different shape than a genuine single peak, this false peak

can be identified by its half-width, as the half-width depends on the peak-shape.

Another parameter that influences the width of the signal is the height at which the

collision occurred. Electrons originated from lower collisions take longer to reach the PMT’s

at the top of the detector, and subsequently diffuse further apart, creating a broader signal.

To discriminate between the two effects on the width of the signal, the actual width is divided

by the expected width [6],

∆texpected(t) =

√4.0325Dt

v2+ w2

0, (2)

where D is the diffusion constant in the detector, 22.8cm2/s, v the drift velocity of the

electrons in the detector, 1.44mm/µs and w0 = 348.6ns an offset that takes into account

that the pulse is formed by multiple electron-signals. The dependence on the position in the

detector is reflected in the appearance of t, the time between collision and detection of the

signal, as the field inside the detector is known, the time translates to a distance covered, and

therefore into the vertical position.

Figure 6 includes the lines:

∆t50%(t)

∆texpected(t)<

√ppf(1− 10−5, S2/scg)

S2/scg − 1, (3)

∆t50%(t)

∆texpected(t)>

√ppf(10−5,S2/scg)

S2/scg − 1(4)

the current cut in this parameter space, with scg a dimensionless constant: 21.3 and ppf the

percent-point function. Only points that lie between the curves (3) and (4) stay.

11

Figure 6: The 50%-width cut consists of two lines: eq.(3) and (4). As the width of the signal also

depends on the travel-time of the electrons, the width is divided by the expected width, eq.(2), resulting

in the single-scatter events to be centered around a normalised width of 1. Points that differ too much

from the expected width, and therefore lie outside the blue lines, are filtered away.

4.1.4 Goodness of fit

In the previous two cuts the peaks were separated in time, the GoF-cut uses the spatial

distribution of the peaks instead. Figure 7 shows the signal of two peaks in the PMT’s

situated at the top of the detector. Because the electrons are accelerated in the vertical

direction, the resulting shapes are roughly circular.

If the signal were a single-scatter, there would be a single circular shape, however the

distortion by the second collision results in a different pattern. To qualify how well a signal

resembles a single-scatter it is being compared with simulated patterns, where the difference

in all PMTs to the best fit is added. As a result, a higher value of the GoF actually means

the pattern did not fit well with the simulated patterns, and “badness of fit” might have been

a more accurate name.

12

Figure 7: From [6]. A top-view of the detector, showing the individual PMTs. The event shown is

a double-scatter event, as it consists of two separated regions of high signal. The purple cross marks

the reconstructed position. To identify single- and multi-scatters, the measured signals are compared

to simulated patterns, corresponding to the same reconstructed position. The Goodness of Fit (GoF)

is defined as the difference of each individual PMT-signal to the simulation, meaning a large GoF

corresponds to a bad fit.

Figure 8: Whereas the previous cuts were based on time-separation, the current GoF-cut tries to

discriminate between single-scatters and multi-scatters based on space-separation. The yield of all

PMTs is compared to simulated single-scatter patterns, where a better fit corresponds to a lower GoF.

The resulting cut, eq.(5), removes all points that lie above the blue line.

13

The result of the GoF-cut is plotted in figure 8, with the cut being:

GoF < 0.0404 · S2 + 594 · S20.0737 − 686. (5)

Since a lower GoF means a better fit, only the points that lie below the curve are being

considered.

5 Machine learning

As mentioned in the introduction, machine learning is about letting the computer find the

best way to identify the data, instead of imposing how the cuts have to be made. To illus-

trate the difference with conventional methods a cartoon-example is used. This example is a

classification problem. These problems consist of a data with features (or variables) that are

used to predict the class of a certain event.

If a computer would have to discriminate between Superman and Spiderman using con-

ventional methods, it would need characteristics of both. Its input could for example be like

table 1, where the values of specific features have been pre-defined. This would also work with

continuous variables (like the amount of red in a picture), but for the sake of the argument

only binary variables have been listed.

Class

Feature Superman Spiderman

Cape +1 -1

Mouth +1 -1

Gloves -1 +1

etc.

Table 1: Cartoon classification problem. The values +1 correspond to present, while -1 means the

specific feature is absent in that class.

The machine learning approach is different in the sense that the feature-boundaries have

not been imposed, the machine would still be told to look at the presence of a cape, gloves,

the amount of red, etc., but the second and third column of table 1 would be left empty.

Instead, a dataset is used to let the algorithm complete the table by itself.

The above situation is an example of supervised learning. Machine learning is also usable

for unsupervised problems, in which case the number of columns would not be specified, or

by the use of deep learning for problems without a pre-defined number of rows. However, in

this study the number of classes is known, events are either single-scatters or multi-scatters,

14

and therefore supervised learning is sufficient.

The data set that is used to fill the columns is referred to as the training-set [21], the

algorithm gets access to both features and classes of the data-points in order to fit a function

in feature-space. Afterwards it can designate classes to new points based on their feature-

values. To reduce the risk of overfitting the performance of the trained algorithm is tested

on another set, the validation-set. Although both features and classes of this set are again

specified, only the features are shown to the machine. The outcome is then compared to the

known classes of the data-points to test the performance of the algorithm.

To avoid overfitting, under-fitting or some other problem in the classification, parameters

can be tuned, and the cycle repeated in order to improve performance on the validation-set.

However, by repeating these steps, there is a risk of overfitting on the combined training +

validation-set. To measure the final performance of the algorithm, it is tested on a test-set,

after which the algorithmic parameters can not be changed without creating a bias to the

testing. Figure 9 shows the process of optimising the algorithm, and the use of the different

data-sets.

Training-set

Validation-set

Test-set

Final test

Optimisingdecision boundaryfor training set + test on validation

Manually adjustingimposed model parameters, e.g. class weight

Figure 9: The process of training an algorithm consists of multiple steps. At first, different parameters

are imposed on the algorithm, which are then used to make the best cut in the training-set. As

these cuts could be subject to overfitting or have other problems, the performance of the algorithm is

evaluated on the validation-set. To achieve better results on the validation-set, the imposed parameters

can be changed, going into a new cycle of training and evaluation. When the desired accuracy on the

validation-set is reached, an independent test is needed to check the performance of the algorithm.

Therefore, it is tested on a new set, the test-set, which gives the final accuracy of the classifier.

The number of points that form each data-set depends on the specific problem. For exam-

ple, the desired accuracy and the amount of available data. To use the data as efficiently as

15

possible, one should keep the test-set at the minimum size that fulfils the accuracy-limitations

and use as much data as possible in the training-set. Adding more data to the training-set

improves the ability to fit functions in the parameter space, and also lowers the chance of

overfitting [21].

5.1 Overfitting

Even though models with higher complexity are able to get better results on the training-set,

they are not generally speaking better models. This paradox is shown in figure 10.

0 2 4 6 8 10x

0

5

10

15

20

25

30

y

Overfittingfitted pointsnew pointsnoiseless truthfit, order = 10

Figure 10: Example of overfitting. The blue and red points are points on the curve y = 2x+ 3 (green

line) with added noise. The blue points (training-set) have been used to fit a 10th order polynomial,

which fits reasonably well. However, the red points (validation-set) do not match the blue curve at

all, so the fit has to be adjusted.

Even though the 10th order polynomial fits the training-set better than a linear model

would have done, it is clear that the model is not able to predict unseen data with a reasonable

accuracy. If the red points would have laid within x ∈ [0, 8] the results would have been

similar, but less pronounced.

16

5.2 Decision tree

The algorithms used to find patterns within the training set are called ‘classifiers’. The

first one that will be considered is the decision tree. A decision tree uses consecutive cuts in

feature-space to split the parameter region into rectangular decision regions. In these decision

areas points will be identified to belong to a specific class by a ‘majority vote’: all points will

get the same class as the one that is most abundant in that part of the parameter space. [22]

To clarify the decision tree procedure the classification of weekend-days versus midweek-

days will be performed based on the average temperature and the number of people that

visited a (unspecified) beach that day. Figure 11 shows both the feature-space (left panel)

and the decision tree (right panel), including the decision regions.

Figure 11: Example of a classification problem. The goal is to identify weekend- and midweek-days

based on the temperature and the number of visitors on a beach. The left panel shows the feature-

space, with fake data. Green dots correspond to weekend-days, while the midweek is represented by

red dots. The colored regions show how the decision tree has divided the parameter space, where the

green regions are predicted to contain green dots. The cuts have been shown with black lines, where

the broadest line corresponds to the first cut. The right panel shows the decision tree that corresponds

to this splitting of data. The left branch of each cut corresponds to “yes”, while the right branch is

“no”.

On weekend-days there are more people that have time to visit the beach, and therefore

a high number of visitors is a good indicator of weekend-days. After this first cut is made,

17

the bottom rectangle is further divided. Although at high temperatures the beach can be

crowded even midweek, this is not the case at lower temperatures. As a result, the second

cut identifies days with low temperatures as part of weekends. This argument does not hold

for low number of visitors, however in the absence of additional cuts these days are seen as

weekend-days because of the majority vote.

5.3 Random forest classifier

Although every cut is designed to construct the optimal splitting at that step, it is possible

that better results can be obtained by starting with a less optimal cut that enables a better

splitting in the consecutive step. Another problem with decision trees is the fact that different

starting cuts can result in various distinctive final trees. Even though the different cuts can

result in similar prediction accuracies, an experiment like XENON1T needs algorithms that

are reproducible. Finally, it is easy to overfit data by adding too many cuts [23].

To counter these problems, a Random Forest Classifier (RFC) is used; instead of using

one tree, multiple trees are used to predict the class of the data-points. The final outcome is

the weighted average of the individual predictions, where the weight is the fraction of points

that resulted in the majority vote [24]. The number of trees used to create the forest is a

parameter imposed on the algorithm. Using more trees results in better accuracy, however,

it also means more computation time.

To make sure different trees are created, only a randomly picked fraction of the features

are available for every cut. For example, using the features of the current data-analysis, the

first cut can be made either on the ‘largest other S2’ or the ‘50%-width’, while the second

cut is only allowed in the ‘largest other S2’- and ‘GoF’-spaces. The best cut will be different

depending on the features that can be used, and in this way different random trees are created

[25].

To evaluate what is the best possible cut the function

Φ = C1

∑i:yi=+1

|yi − h(xi)|

+ C2

∑i:yi=−1

|yi − h(xi)|

, (6)

that counts the number of misclassifications, is minimised. xi is a vector with all feature-

values of the i-th data-point and yi is the class to which it belongs (either +1 or -1). h(xi) is

the prediction, depending on how the cut is made. For the first cut in the above example it

would be of the form:

h(xi) =

+1 Nvisitors > 800

−1 Nvisitors < 800,(7)

18

where the value 800 would be the number to be optimised to find the best cut.

C1 and C2 in eq.(6) are class-weights, determining the effect of misclassifications. Higher

values for these parameters mean a higher penalty, allowing less violation in the training

set. In unbalanced data-sets (sets that have a different number of points for each class) the

sums in eq.(6) can differ heavily in size. To make sure that also misclassifications of the

less abundant class are considered the class weights are chosen inversely proportional to the

number of data-points [26].

5.4 Support vector machine

Although decision trees are usable in multi-dimensional problems, the possible cuts are lim-

ited by the way of construction. For example, a diagonal line in figure 11 would be difficult to

realise by the use of decision trees. Support Vector Machines (SVMs), originally constructed

by Vapnik [27], are classifiers that are able to fit differently shaped curves in feature space,

dividing it into decision regions.

The working principle of a SVM is mapping a feature-space (that can be multi-dimensional)

into a higher dimensional space (computational space). In this richer space a separating hy-

perplane is constructed (a hyperplane is the higher dimensional equivalent of a plane in 3D

and a straight line in 2D), which divides the data points into different classes [28]. Hereafter,

the map is inverted so that the decision region is transformed back to the feature space.

Even though the separating hyperplane is linear, the decision regions can be a higher order

polynomial, a Gaussian or other shaped function, depending on the map between feature and

computational space [28]:

xi → φ(xi) = (φ1(xi), φ2(xi), ...φn(xi)), (8)

where φ(xi) is the map from feature to computational space.

A hyperplane that separates classes without failure would have to fulfil:

w · xi + b ≥ +1 ∀ xi ∈ Class 1 (9)

w · xi + b ≤ −1 ∀ xi ∈ Class 2, (10)

where the hyperplane is specified by w and b [28].

If the ≥ or ≤-sign is an ‘equal’-sign instead, this particular point lies close to the boundary

and is called a support vector. The further the absolute value of an outcome differs from 1, the

larger the distance to the separating plane. As the support vectors lie close to the boundary,

they predominantly determine its shape [29].

The class y of a data-point is, in our 2-class problem, either -1 or 1, and therefore, in the

19

perfect case the following equation would hold:

yi(w · xi + b) ≥ 1, (11)

for all data-points i. However, if separation on the training set is not flawless, positive

predictions get multiplied by negative true values and vice-versa, such that eq.(11) becomes

yi(w · xi + b) ≥ 1− ξ (12)

instead, where ξ is a measure for misclassification on the training set and

ξ ≥ 0. (13)

The optimum separation is reached when the distance of all points to the boundary is

maximised, while the number of misclassifications, and their size, is minimised. To find this

hyperplane the following equation has to be minimised under the constraints (12) and (13)

[28]:

Φ(w,Ξ) =1

2|w|2 + C

(∑i

ξi

). (14)

C in eq.(14) again determines the effect of misclassifications. Alternatively, to include different

weighting for classes, the following function can be minimised:

Φ(w,Ξ) =1

2|w|2 + C1

∑i:yi=+1

ξi

+ C2

∑i:yi=−1

ξi

, (15)

which is more suitable for our problem.

To evaluate the discrimination power of the support vector machine, figure 11 has been

remade with a support vector machine, see figure 12. Instead of the consecutive cuts, the

decision region is now a smooth function. However, the position of the green and red areas is

still roughly equal.

20

0 5 10 15 20 25Temperature [oC]

0

200

400

600

800

1000

1200

Amou

nt o

f visi

tors

Days at the beach, classified by svm

Figure 12: The same data as in figure 11 has been re-classified, this time using a Support Vector

Machine (SVM). Comparing the two figures the green and red areas are similar in position, however,

the SVM resulted in a smoother decision boundary.

5.5 Multi-layered perceptron

Although computers are able to complete a large number of computations per time-frame,

there are still problems that humans can do better or faster, for example face recognition.

Therefore, algorithms have been proposed that try to mimic the human brain, trying to get

better results.

Figure 13: From [30]. A schematical view of a multi-layered perceptron, two-layered in this case. The

input layer is a vector with all the feature values and the output layer yields a vector with components

equal to the probability to belong to the corresponding class. The white circles are the individual

neurons, and the connecting lines all have an associated weight.

21

Instead of one powerful processing unit, Multi-Layered Perceptrons (MLPs) consist of

many non-complex processors in analogy to the neurons that construct our brain. A schematic

view of a MLP is shown in figure 13.

The left column is the input layer, which is processed by the hidden layers in the middle

before it reaches the output layer. The neurons (represented by white circles) are connected

by lines that correspond to a weight w, where w(l)ij corresponds to the weight from the i-th

neuron in the l-th layer to the j-th neuron in the (l + 1)-th layer. The output of this j-th

neuron depends on the input of all neurons in the previous layer [31]:

x(l+1)j = f

(N∑i=1

w(l)ij x

(l)i + θ

(l)j

), (16)

where N is the number of neurons in the l-th layer and θ(l)j is an offset corresponding to the

neuron on the left hand-side of the equation. The function f() is a non-linear activation func-

tion. The non-linearity ensures the stacking of layers does not yield trivial results. Although

there are multiple choices for the activation function, a common form is the Rectified Linear

Unit (ReLU) [32]:

ReLU(x) = max(0, x). (17)

As in eq.(6) the difference between prediction and truth will be minimised. However, the

parameters to be optimised are w(l)ij and θ

(l)j , for all i, j and l. For our problem, with four

features, an algorithm using three hidden layers, all consisting of ten neurons, would have

30 + 40 + 100 + 100 + 20 = 290 parameters to be optimised (30 neurons yield 30 θ(l)j and 260

connections, with corresponding weights).

5.6 Stochastic gradient descent

To solve the optimisation problems of eq.(6) and (15) is not easily done in an exact manner,

therefore, a numerical approach is used. From a chosen starting point, the gradient of Φ is

computed, hereafter, a step is made towards a lower value of Φ. By repeating these steps,

lower and lower values of Φ are reached, until eventually a minimum is reached. This mini-

mum can be either global or local, and therefore multiple starting positions can be tried in

succession, lowering the chance of ending in a local minimum.

The above method is called gradient descent, however, since the computation of the gra-

dient using all available data can be computationally demanding, a subset of the data can be

used for each step, in which gradient descent reduces to Stochastic Gradient Descent (SGD).

As the individual steps in SGD are less demanding than full gradient descent, it could con-

verge in fewer computation time. However, as less data is used for each step, it could be that

more steps are needed to reach the minimum and a balance between the number of steps and

22

the time for each step is needed.

A similar balance applies to the size of the steps taken, if SGD would be used to optimise

the w(l)ij for the MLP, it would take the following form:

w(l)ij,t = w

(l)ij,t−1 − η

(Φ′(w

(l)ij,t−1,xi, yi)

), (18)

where the consecutive steps are separated by time steps (t). Φ′ indicates the partial derivative

of Φ with respect to w(l)ij,t−1 [33].

The step-size is represented as η. Higher values for η enable coverage of a larger region

of the parameter space, in fewer steps, however, it comes with the risk of never finding the

minimum of Φ as is shown in figure 14.

Figure 14: Stochastic Gradient Descent (SGD) is a method to find the minimum of a function. Full

gradient descent calculates the full derivate, however, this can be computationally intensive for larger

data-sets. A solution is to use a fraction of the data to determine the direction of each step: SGD. The

above figure shows the effect of the step-size on the effecivity of the algorithm, where the density of

blue lines simbolises the gradient and the minimum of the function lies within the inner circle. Small

steps, yellow arrows, eventually reach the minimum, but it takes many steps to reach the end-point.

However, if the step-size is too large, red arrows, there is a risk of overshooting the minimum, and the

minimum might not be found.

To make sure one η is compatible with all dimensions in feature-space, feature-values

should be of similar size. This creates the need of normalisation of all features such that

neither the yellow, nor the red arrows in figure 14 are applicable to the algorithm.

5.7 ROC-curve

After the data has been split and the algorithms chosen and optimised, there is still the need

to evaluate the performance of the machine-learning code. Using the number of correctly

23

classified points in the test-set is not a correct representation of how well the algorithm per-

formed. Imagine a data-set with 10, 000 multiple-scatters and only a hundred single-scatters.

If a classifier would predict every data-point to be a multi-scatter it would reach an accuracy

of 99%, even though no single-scatters are correctly classified.

ROC-curves, or Receiver Operating Characteristics-curves, are a different way of measur-

ing the performance, countering the above problem [34]. A ROC-curve is a plot of the true

positive rate (tpr) against the false positive rate (fpr), which are defined as

tpr =Correctly classified positives

Total positives(19)

fpr =Incorrectly classified negatives

Total negatives(20)

respectively, where the term ‘positive’ corresponds to one class (the single-scatters), and ‘neg-

ative’ to the other (multiple-scatters).

As follows from eq.(19) and (20), both values are confined between 0 and 1. Figure 15

shows three different curves, one which has perfect classification (purple curve), one with de-

cent classification (blue curve) and one that did not succeed in separating the classes (green

curve).

Figure 15: Adjusted from [35]. ROC-curves are a measure of the performance of your algorithm,

where the true positive rate, eq.(19), is plotted against the false positive rate, eq.(20). In the perfect

case, no points would be misclassified, and the purple line would be obtained. In general, this is not

possible and ROC-curves of working algorithms look more like the blue line. If an algorithm fails to

discriminate between the classes, it is completely random whether it classifies a point directly. Getting

20% of the positive points correct would result in misclassifying 20% of negative points as well, and

the ROC-curve looks like the green curve.

24

Classifiers predict a point to belong to a class with a certain probability. By changing the

threshold on the certainty needed to be classified as positive, a tpr can be assigned to every

fpr. The blue curve in figure 15 is the standard form of a ROC-curve. Lowering the threshold

increases the number of correctly classified positives, while also increasing the fpr. After a

certain threshold, all the positives have been identified, and increasing the fpr even further

does not yield an increase in correctly classified points.

The green curve, corresponding to the diagonal line, shows a problem in the classifica-

tion. For every % of correctly classified positives, there is a same percentage of misclassified

negatives, meaning the classification is completely random. Predicting 50% of the one class

correctly results in 50% of the other class being misclassified. Therefore, counter-intuitively,

the line that looks like perfect correlation actually corresponds to no correlation between the

algorithm and the data.

6 Results

6.1 Data preparation

The data used in this thesis is simulated by M. Vonk [6]. With a Monte-Carlo program he

created 500, 000 S2-signals. Because S1-signals are used for time-measurements, they were

not needed in the simulation, as the time of collision is directly outputted by the simula-

tion. Although the data contained more features, only the four earlier mentioned were used:

‘largest S2’, ‘largest other S2’, ‘50%-width’, normalised by the expected width: eq(2) and the

‘goodness of fit’. For a detailed explanation of the data-simulation I refer to his work [6].

The data was split with 10%, 20% and 70% for the test-set, validation-set and training-set

respectively. The 350, 000 points in the training-set determined the normalisation. All the

features were divided by the highest value of that feature in the training-set, so also data from

the test- and validation-set, and afterwards the mean value of the feature in the training set

was subtracted. This resulted in the training-set having mean equal to zero for all features

and a difference between maximum and minimum of approximately one. The same applies

to the other two sets, but only in approximation.

As evident in figures 4, 6 and 8, the multiple-scatters are more abundant. The single-

scatter events make up about 10% of the data-points, which has to be accounted for in the

training of the algorithms.

25

6.2 Improving current cuts

The current data-analysis has a true positive rate of 0.928 with a corresponding false positive

rate of 0.163. To improve these cuts, the fpr needs to be lowered while the tpr remains

unchanged. Figure 16 shows the performance of a support vector machine on the test-set.

The current cuts are marked with the red dot, while the black dot is the corresponding point

on the ROC-curve.

The parameters imposed on the SVM are:

Parameter Value Explanation

classweight ‘balanced’ makes sure weights are inversely proportional to amount of data

kernel ‘rbf’ radial basis function

probability True to retrieve the confidence in the classification

C 1e2 weight of misclassification

gamma 0.4 Determines the region of effect of the data-points

Table 2: Parameters imposed on the support vector machine. The package used is scikit-learn [24].

0.0 0.2 0.4 0.6 0.8 1.0fpr

0.0

0.2

0.4

0.6

0.8

1.0

tpr

SVM, rbf-kernel

ROC-curvemachine cut, fpr = 0.085current cut, fpr = 0.163

Figure 16: The ROC-curve corresponding to the SVM, with parameters as in table 2. The red

dot shows the accuracy of the current cut, where the black dot is the accuracy of the machine cut,

corresponding to the same tpr. As the fpr of the machine cut is about a factor two smaller, the

background is reduced with the same factor, meaning the machine algorithm did considerably better

than the current data-cuts.

26

After it became clear that adding machine learning to the analysis yielded better results

than the current method, there were two hypotheses:

1. The two-dimensional cuts were not optimal and could be improved.

2. The multi-dimensional approach of the machine algorithm enabled better classification.

To check both hypotheses, a support vector machine was used to mimic the two-dimensional

cuts. The algorithm had no access to other features than the ‘Largest S2’ and the feature

considered, making it a purely 2D-problem.

0.0 0.2 0.4 0.6 0.8 1.0fpr

0.0

0.2

0.4

0.6

0.8

1.0

tpr

Largest other S2 cut

ROC-curvecurrent cut, fpr = 0.198machine cut, fpr = 0.160

1000 2000 3000 4000 5000Largest S2 [PE]

0

100

200

300

400

500

600

700

Larg

est o

ther

S2

[PE]

Largest other S2 cut

0.0 0.2 0.4 0.6 0.8 1.0fpr

0.0

0.2

0.4

0.6

0.8

1.0

tpr

50%-width cut

ROC-curvecurrent cut, fpr = 0.817machine cut, fpr = 0.775

1000 2000 3000 4000 5000Largest S2 [PE]

1

2

3

4

5

50%

wid

th /

Expe

cted

wid

th

50%-width cut

27

0.0 0.2 0.4 0.6 0.8 1.0fpr

0.0

0.2

0.4

0.6

0.8

1.0

tpr

Goodness of fit cut

ROC-curvecurrent cut, fpr = 0.896machine cut, fpr = 0.888

1000 2000 3000 4000 5000Largest S2 [PE]

500

1000

1500

2000

2500

3000

Good

ness

of f

it

Goodness of fit cut

Figure 17: To check whether the improvement of figure 16 was due to the high-dimensional analysis

or due to the 2D-cuts not being optimal, a SVM with the same parameters was used to classify the

scatter events based on only two features. The left column shows the ROC-curves of these algorithms,

including both the current accuracy and the machine accuracy. The SVM got better results in all

three cases, however, it is not sufficient to explain the factor of two encountered in figure 16. The right

column shows the decision boundaries that correspond to the machine cuts. Even though a larger

region of the parameter spaces is green, the fpr of the cuts is lower. This is due to the fact that the

green areas that lie outside the cuts have a low density of data, while the red areas inside the cuts

remove a large amount of points.

Figure 17 shows the results of these tests, where the same parameters were used as in figure

16, to make sure the results could be compared. The left column shows the ROC-curves,

while the corresponding decision boundaries are potted on the right.

As can be seen in figure 17, changing the two-dimensional cuts results in a better per-

formance. Although the machine cuts use larger regions in feature-space, they still remove

more multiple-scatters. This can be seen by looking at figure 4, 6 and 8, which show that the

density of data in the ‘extra-regions’ is small compared to the parts that are removed with

respect to the current cuts.

However, even though the first hypothesis holds, the improved cuts are not able to explain

the factor two difference in fpr seen in figure 16, meaning both the hypotheses are valid. The

fact that machine learning is indeed able to reach better classification than the current anal-

ysis justifies a comparison of the support vector machine with different algorithms, trying to

improve the cuts further.

28

6.3 Comparing different algorithms

Going through the same cycle of training and validation the following parameters were chosen

for the RFC- and MLP-classifiers:

Parameter Value Explanation

classweight ‘balanced’ weights are inversely proportional to amount of data

nestimators 100 amount of trees that form the forest

min. samples / leaf 6 the minimum amount of data-points left in a rectangle

max. features 2 the amount of features available for each cut

Table 3: Parameters for the boosted decision tree from the scikit-package [24].

Parameter Value Explanation

hidden-layer sizes (10,10,10) three layers, all with 10 neurons

activation ‘relu’ using a ReLU as activation function

solver ‘sgd’ using stochastic gradient descent

alpha 1e-4 similar to the C-parameter for other classifiers

Table 4: Parameters of the multi-layered perceptron [24].

0.0 0.2 0.4 0.6 0.8 1.0fpr

0.0

0.2

0.4

0.6

0.8

1.0

tpr

RFC

ROC-curvemachine cut, fpr = 0.079current cut, fpr = 0.163

0.0 0.2 0.4 0.6 0.8 1.0fpr

0.0

0.2

0.4

0.6

0.8

1.0

tpr

MLP

ROC-curvemachine cut, fpr = 0.084current cut, fpr = 0.163

Figure 18: To try to achieve better results, different machine algorithms were tested. The random

forest classifier and multi-layered perceptron both got slightly better results, where the random forest

performed best. However, the main difference with the SVM was in run-time. The SVM took multiple

hours to train, while these two algorithms finished the training phase within 15 minutes.

29

The results of both classifiers were similar to the support vector machine, see figure 18.

However, where the training of the SVM, using the full training set, took multiple hours, both

the RFC and the MLP finished training within 15 minutes.

7 Conclusion

As is evident from the previous chapters, adding machine learning to the current analysis

enables better results. A larger number of multi-scatters can be removed, without affecting

how many single-scatters are identified.

All three algorithms, the RFC, SVM and MLP, yield a comparable improvement in ac-

curacy, where the amount of multiple-scatter events that survive the data-cuts is roughly

halved. However, as the support vector machine faced difficulties with the large amount of

data, I would not recommend the algorithm for the XENON-experiment, where even larger

data-sets can be expected.

Even though the random forest classifier achieved slightly better classification than the

multi-layered perceptron, I would consider both for implementation. It could also be possible

that a combination of the two would yield even better results, just like the RFC is made up

of multiple decision trees.

While the machine classifiers are able to achieve better results, they lack the transparency

of the current cuts, which all have a well founded physical justification. If for this reason

machine learning is not chosen as data-splitter, then maybe a second look could be taken

at the current two-dimensional cuts. Slightly adjusting these, to avoid small regions with

high multiple-scatter density, might result in minor improvements of the accuracy, even if no

machine learning is used.

References

[1] P. Van Dokkum, et al., Nature 555, 629 (2018).

[2] P. A. Ade, et al., Astronomy & Astrophysics 594, A13 (2016).

[3] G. Bertone, Nature 468, 389 (2010).

[4] E. Aprile, et al., The European Physical Journal C 77, 881 (2017).

[5] E. Aprile, et al., Physical review letters 111, 021301 (2013).

[6] M. Vonk, Master thesis (2018).

[7] M. Bojarski, et al., arXiv preprint arXiv:1604.07316 (2016).

[8] D. Silver, et al., nature 529, 484 (2016).

[9] S. Marsland, Machine learning: an algorithmic perspective (CRC press, 2015).

30

[10] F. Sebastiani, ACM computing surveys (CSUR) 34, 1 (2002).

[11] T. S. van Albada, K. Begeman, R. Sanscisi, J. Bahcall, Dark Matter in the Universe

(World Scientific, 2004), pp. 7–23.

[12] W. Hu, S. Dodelson, Annual Review of Astronomy and Astrophysics 40, 171 (2002).

[13] E. Hivon, et al., The Astrophysical Journal 567, 2 (2002).

[14] G. Bertone, D. Hooper, J. Silk, Physics Reports 405, 279 (2005).

[15] E. Verlinde, Journal of High Energy Physics 2011, 29 (2011).

[16] M. Markevitch, arXiv preprint astro-ph/0511345 (2005).

[17] L. Bergstrom, Nuclear Physics B-Proceedings Supplements 138, 123 (2005).

[18] L. Bergstrom, Reports on Progress in Physics 63, 793 (2000).

[19] X. collaboration, G. Plante, et al., Columbia University (2016).

[20] J. Aalbers, PhD thesis (2018).

[21] W.-L. Chao, Disp. Ee. Ntu. Edu. Tw (2011).

[22] J. R. Quinlan, Machine learning 1, 81 (1986).

[23] L. Breiman, et al., The annals of statistics 26, 801 (1998).

[24] F. Pedregosa, et al., Journal of Machine Learning Research 12, 2825 (2011).

[25] G. Biau, E. Scornet, Test 25, 197 (2016).

[26] J. J. Rodriguez, L. I. Kuncheva, C. J. Alonso, IEEE transactions on pattern analysis and

machine intelligence 28, 1619 (2006).

[27] V. Vapnik, A. Y. Lerner, Avtomat. i Telemekh 24, 774 (1963).

[28] E. Osuna, R. Freund, F. Girosi (1997).

[29] C.-C. Chang, C.-J. Lin, ACM transactions on intelligent systems and technology (TIST)

2, 27 (2011).

[30] M. W. Gardner, S. Dorling, Atmospheric environment 32, 2627 (1998).

[31] A. K. Jain, J. Mao, K. M. Mohiuddin, Computer 29, 31 (1996).

[32] L. Noriega, School of Computing. Staffordshire University (2005).

[33] T. Zhang, Proceedings of the twenty-first international conference on Machine learning

(ACM, 2004), p. 116.

[34] T. Fawcett, Pattern recognition letters 27, 861 (2006).

[35] B. Buxton, W. Langdon, S. Barrett, Measurement and Control 34, 229 (2001).

31