machine learning in the xenon experiment - esc.fnwi.uva.nl · the curve labeled minimum halo is...
TRANSCRIPT
Machine learning in the Xenon experiment
Gijs Leguijt
June 22, 2018
Studentnumber 11000279
Assignment Report Bachelor Project Physics and Astronomy
Size 15 EC
Conducted between April 3, 2018 and June 22, 2018
Institute Nikhef
Faculty Faculteit der Natuurwetenschappen, Wiskunde en Informatica
University Universiteit van Amsterdam
Supervisor Prof. dr. P. Decowski
Second Examiner Dr. H. Snoek
Abstract
The XENON1T-experiment tries to make a direct discovery of a dark matter par-
ticle scattering of a xenon-atom. As the cross-section of this reaction is very low, the
chance that a dark matter particle scatters twice is negligible. This thesis evaluates the
performance of machine learning algorithms. Aiming to remove the multiple-scatter sig-
nals from the data. The three different algorithms used are a support vector machine,
a random forest classifier and a multi-layered perceptron. All three algorithms achieved
higher accuracy than the current analysis, but as the vector machine took multiple factors
longer to process the large amount of data, only the forest-classifier and the perceptron
are recommended for implementation.
Contents
1 Samenvatting 3
2 Introduction 4
3 Dark matter 5
3.1 Proof for dark matter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Consistence of dark matter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 The XENON1T-detector 8
4.1 Current data-analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1.1 Largest S2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1.2 Largest other S2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1.3 50%-width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1.4 Goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 Machine learning 14
5.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3 Random forest classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.4 Support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.5 Multi-layered perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.6 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.7 ROC-curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Results 25
6.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Improving current cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3 Comparing different algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7 Conclusion 30
2
1 Samenvatting
Er is bijna vijf keer zo veel donkere materie als alle normale massa. We weten dat er donkere
materie bestaat, en toch hebben we nog nooit een donker-materie deeltje gezien. Om de
deeltjes toch te detecteren worden er grote detectoren gemaakt, maar zelfs met grote detec-
toren is de kans klein dat een donker-materie deeltje botst, het meeste signaal is achtergrond.
Aangezien de kans op een botsing zo klein is, zijn alle deeltjes die twee keer botsen in de
detector dus geen donkere materie deeltjes.
Het is daarom belangrijk om alle deeltjes die meerdere keren botsen weg te filteren. Dit
wordt momenteel gedaan zonder machine learning, maar met machine learning zou dit ef-
ficieter kunnen. Deze thesis probeert de huidige analyse te verbeteren door machine learning
toe te passen, en gebruikt daarvoor meerdere algoritmes. Het blijkt dat alle geteste algo-
ritmes inderdaad in staat zijn beter onderscheid te maken tussen deeltjes die een keer botsen
en deeltjes die dat meerdere keren doen dan de huidige analyse-methode.
3
2 Introduction
On March 28, 2018, researchers announced the discovery of a galaxy with a very low concen-
tration of dark matter [1]. This absence, counter-intuitively, was said to prove the existence
of dark matter. The details of this paradox will be explained later, for now it is sufficient to
know that this claim made it into the news worldwide, which shows that dark matter is a big
topic in modern day physics.
All around the globe experiments are trying to find dark matter, which would contain
over five times as much mass as all matter known to us [2]! But if there is that much mass
represented as dark matter, why has no detector found it yet? The problem is that dark
matter does hardly interact with any matter known to us. You can compare it for example
with neutrino’s, they also pass through most matter without any interaction, but very rarely
leave a trail in the huge neutrino detectors.
Since no detector has ever found a dark matter particle, there are still multiple theo-
ries about what would be the individual constituents. Currently, most attention is given to
the WIMP-model (Weakly Interacting Massive Particle) [3], which is also the target of the
XENON1T-detector situated in a mine in Italy. XENON1T consist of roughly 1000 kg of
liquid xenon [4] in which all the interactions are closely monitored. Although a lot of work
has been put in reducing background noise, up to the level that this detector has the low-
est background noise in his category [4], most of the signal still consists of interactions that
did not include any dark matter particle. As a result, a lot of data needs to be filtered, in
particular particles that scattered multiple times within the detector. Since the cross section
of dark matter particles is very low, of the order of < 10−40cm−2, where the precise value of
the upper-bound depends on the WIMP-mass [5], the chance that it scatters multiple times
within the detector is negligible. Discriminating between single and double scatters is cur-
rently done using cuts in 2D-parameter spaces [6].
Of course dark matter is not the only field that has received major attention: machine
learning is one of the other hot topics in science. One can occasionally find a self driving car
on the road in some countries [7], world’s best GO-player was beaten by a computer [8], and,
maybe less known, it is making its way into big research experiments as well [9]. Whereas
the human mind is not naturally capable of working in more than three dimensions, there is
no such limitation for computers. A machine can operate in high dimensional spaces, seeking
correlations between many variables at once. As a result, letting a computer find its own way
of splitting the data may yield better results than telling the computer where the splits have
to be made [10].
So on the one hand there is a big detector searching for dark matter collisions in their
4
data, which is like trying to find a needle in a haystack. While on the other there is a pow-
erful data-analysis tool, that has classification as one of his main attributes. The aim of this
thesis is to merge the two sides, trying to improve the current splitting of the data, such that
XENON1T would be even more sensitive.
More specifically, the goal of this work will be twofold. At first the effect of adding
machine learning to the analysis will be evaluated. Subsequently, three different learning
algorithms will be compared with each other; the Random Forest Classifier (RFC), Support
Vector Machine (SVM) and the Multi-Layered Perceptron (MLP).
3 Dark matter
3.1 Proof for dark matter
As stated in the intro there are multiple experiments searching for dark matter. But if no
individual particle has been found yet, what is the motivation for these searches? Strong
hints for its existence are found in the rotation curves of galaxies [11]. The speed of orbiting
bodies is governed by the mass enclosed in the orbit and the distance to the centre of mass.
For a typical galaxy one would predict the lower curve in figure 1, when only considering the
visible matter. However, observations yield the upper curve which differs significantly from
the expectation.
1985
ApJ
. . .
295.
.30
5V
DISTRIBUTION OF DARK MATTER IN NGC 3198 309 No. 2, 1985
Fig. 4.—Fit of exponential disk with maximum mass and halo to observed rotation curve (dots with error bars). The scale length of the disk has been taken equal to that of the light distribution (60", corresponding to 2.68 kpc). The halo curve is based on eq. (1), a = 8.5 kpc, y = 2.1, p(R0) = 0.0040 M0 pc-3.
dark matter to visible matter inside the last point of the rota- tion curve (at 30 kpc) is 3.9. The enclosed halo mass is 0.8 times the disk mass at R25; the enclosed halo mass is 1.5 times the disk mass at the Holmberg radius. The total mass inside 30 kpc isl5xl010Mo. Another property of interest is the mass-to- light ratio of the disk; we find M/Lß(disk) < 3.6 Mq/LBq and M/LF(disk) < 4.4 Mq/LVq.
The disk-halo model shown in Figure 4 has the character- istic flat rotation curve over a large part of the galaxy. Beyond 30 kpc it is a mere extrapolation, but the observations inside 30
kpc do not show any sign of a decline, and the extrapolated curve may well be close to the true one. To obtain an estimate of the minimum amount of dark matter at large distances from the center we have also made a fit, shown in Figure 6, with a halo density law whose slope changes from —2 in the inner region to — 4 in the outer region :
PhaloW CC (2)
Fig. 5.—Cumulative distribution of mass with radius for disk and halo for the maximum disk mass case. Two halo fits are shown. The curve labeled “ normal ” halo is based on eq. (1); the parameters of the fit are the same as those in Fig. 4. The curve labeled “minimum” halo is based on eq. (2); it corresponds to a density distribution whose slope changes from —2 in the inner regions to —3.5 in the outer regions. This curve represents an estimate of the minimum amount of dark matter in NGC 3198 inside 50 kpc.
© American Astronomical Society • Provided by the NASA Astrophysics Data System
Figure 1: From [11]. The upper line shows the measured rotation curve of the galaxy NGC3198. The
two lower curves represent the contribution to the curve due to the disk and due to the dark matter
halo. In the absence of the halo, the observed plateau would not occur, and velocity of stars would
decrease for increasing distance to the center, after an initial increase up to roughly 5 kpc due to an
accumulation of mass near the center.
5
One way of explaining these high velocities far from the centre of the galaxy is introducing
additional mass in the galaxy. Since only the effect of their gravity is visible this additional
mass is called dark matter. If one would look at clusters of galaxies, instead of individual
systems, one would be in need of even more dark matter to explain the high velocities en-
countered in these clusters.
Independently of the rotation curves, the existence of dark matter can also be deduced
from the Cosmic Microwave Background (CMB) [12]. The shape of the CMB spectrum has
been measured with great accuracy by satellites that were increasingly sensitive [13]. Inho-
mogeneities in the spectrum result from a balance between gravity and photon-pressure [12].
Since dark matter hardly couples to photons, it creates gravitational potential wells instead
of starting to oscillate due to radiation pressure. This would have a specific signature on the
CMB. Current theories on the history of the universe can successfully reproduce the CMB
spectrum and the resulting bound on the amount of dark matter is consistent with the galaxy
rotation curves [14].
Although there are multiple observations that can be explained by dark matter, they are
all related to gravitational effects. Therefore there have been tries to explain the observations
by making new theories for gravity acting on large scales [15]. This is where the discovery of
the dark matter-less galaxy comes into play. Because this galaxy shows the expected rota-
tion curves, due to the absence of dark matter, theories with different gravitational laws face
difficulties trying to explain both the galaxies with dark matter, and those without.
Another discovery in favour of the dark matter theories is the Bullet Cluster [16], see
figure 2, in which two galaxy clusters collide. In the absence of dark matter most mass would
be contained in the interstellar gas. When the clusters collided, the stars were hardly affected
and moved through, while the gas was slowed down more heavily. As a result, most of the
mass, and thus most gravitational effect, should be split from the stars. Unless there is some-
thing else, that carried most of the mass in the cluster and that hardly interacted, then most
of the gravity would stay co-moving with the stars, which is indeed the case in figure 2. The
fact that the dust does not contain most of the mass is a strong argument for the existence
of dark matter.
6
Chandra X-Ray Observatory
CXC
1E 0657-56
Figure 2: From [16]. The bullet cluster is powerful evidence for the existence of dark matter. As
the two galaxies clusters collided, the individual galaxies moved through due to the vast distances
between them. However, the insterstellar gas clouds slowed down, as their higher density resulted
in more collisions. Since most ordinary mass is contained in the gas, one would expect that most
of the gravity would have stayed with the gas, and got seperated from the galaxies. Observations
show otherwise, the gas clouds, shown in red got separated from the stars, while the gravitational
effect, shown in blue, did not slow down. Therefore, something else must carry most of the mass,
something that does hardly interact, such that it would not slow down. Dark matter fits perfectly in
this observation.
3.2 Consistence of dark matter
Since no dark matter exists in the standard model, modifications or new theories have to be
developed to describe these particles. The dark matter candidates are generally divided in
two categories, MAssive Compact Halo Objects (MACHOs) and Weakly Interacting Massive
Particles (WIMPs). The first category is partly occupied by less exotic objects, like black
holes and neutron stars, while there is still a lot of speculation on what particles/objects make
up the WIMPs.
Since the discovery of the mass of the neutrino they have been opted as dark matter
candidates [17]. However, neutrinos alone are not sufficient to explain the observed effects
of dark matter [18]. Supersymmetry is beyond the scope of this thesis, however, it produces
a whole zoo of possible WIMPs such as neutralinos, sneutrinos and gravitinos [14]. With
WIMPs being considered the most promising dark matter category they form the focus of the
XENON-experiment.
7
4 The XENON1T-detector
The XENON1T-detector is the third in a series of xenon experiments and the predecessor
of XENONnT, which will be an even bigger detector [19]. It aims to find dark matter by
detecting the light produced by a WIMP colliding with a xenon atom. To this end the detector
contains a target mass of 1000 kg xenon, partly in a gaseous state and partly liquid. The
light is recorded by Photo Multiplier Tubes (PMT) situated both at the top and bottom of
the detector. The array of 248 PMTs can reconstruct the (x,y)-position of the collision by
analysing the yield of the individual PMTs. In principle the difference in light travel time
could be used to retrieve the z-position, but the speed of light in comparison to the size of
the experiment makes this practically impossible. Instead, the z-position is constructed using
the fact that there is both liquid and gaseous xenon present.
In a collision, a xenon atom is ionised, and in response it forms an excited Xe2-molecule.
The individual xenon atoms are transparent to the light that is emitted when the molecule de-
excites and this light is the first signal seen from a collision (S1). The electrons that resulted
from the ionization are accelerated upwards by an electric field into the gaseous xenon where
they receive sufficient energy to ionise the gas, which creates a second signal (S2). The higher
in the detector the collision took place, the less distance between the site and the gaseous
phase, resulting in a smaller time interval between the S1 and S2,. Therefore, the z-position
can be reconstructed by comparing the time of arrival of both signals, which is shown in the
left panel of figure 3, the right hand side shows the XENON1T-detector itself.
Figure 3: From [20]. A view into the XENON1T-detector, where the left panel shows the creation of a
signal in the detector. As a particle moves through the liquid xenon, it collides with an atom, resulting
in an ionization. Upon de-excitation, light is emited with is recorded as the S1-signal. Due to the
collision, there are free electrons in the liquid xenon, that get accelerated upwards by an electric field.
When the electrons reach the gaseous xenon, they have enough energy to ionise the xenon, resulting
in a second signal, the S2. The right panel shows a more detailed view of the detector.
8
Apart from xenon being transparent to the emitted light it also has a large cross-section
which enlarges the probability of finding dark matter. This also means it is useful as a
screening substance, and in addition to the 1 t used for detection, there is over 2 t of xenon
used solely for shielding purposes. To further reduce background signals the detector is placed
in a big water tank in the Laboratori Nazionali del Gran Sasso, which is situated in a mine
in Italy [4].
4.1 Current data-analysis
As mentioned earlier, the shielding with xenon, water and 3600m of rock is not sufficient to
remove all background noise. For example, even though all components of the detector have
been selected on low radio-activity, there are still neutrons and electrons in the detector that
originated from decaying isotopes. Neutrons that scatter only once in the detector are hard to
distinguish from WIMPs, however, neutrons are likely to scatter multiple times before leaving
the detector. To distinguish between multiple and single-scatter events cuts are made on the
following parameters (or features in machine-learning jargon): the ‘largest S2’, ‘largest other
S2’, ‘50% width’ and the ‘Goodness of Fit’ (GoF) [6].
4.1.1 Largest S2
The S2 signal originated from the accelerated electrons that ionised the gaseous xenon. Its size
depends both on the energy of the incident particle and the z-coordinate of the collision. A
higher energy collision within the liquid xenon results in a larger number of liberated electrons,
and therefore more ionization in the gaseous phase. Because the electrons are accelerated on
their way up, their ability to ionise the gas is affected by the z-coordinate of the collision.
Collisions lower in the detector enable more acceleration and result in higher S2 peaks. The
largest S2 signal is the highest peak, labelled as S2, that is recorded by the PMTs during the
event. It is being used as a reference to which the other features are plotted.
4.1.2 Largest other S2
If particles were to scatter twice in the detector, the second collision would be less energetic
because of a loss of energy in the former. Therefore, the second largest S2 peak can be coupled
to the second collision. Even in single-scatter events a second S2 could be measured because
of noise or delay of electrons due to collisions, however, these accidental S2’s are in general
smaller than S2’s originated from a second collision.
Currently this cut is being implemented as [6]:
LOS2 < 0.00832 · S2 + 72.3, (1)
9
where ‘largest other S2’ has been abbreviated as LOS2 and S2 corresponds to the largest S2
signal.
Figure 4: The current cut in the (Largest S2, Largest other S2)-parameter space. The cut tries to
discriminate between single- and multi-scatter events. Even in single-scatters, a second S2 signal can
be observed, due to delay of some of the electrons, however, in multi-scatters this second signal is ofter
relatively large, allowing discrimination between the two cases. Points that lie above the blue line,
eq.(1), are filtered away, as they are not likely to be a single-scatter event.
In figure 4 the (Largest S2, Largest other S2)-parameter space is shown, including the
current cut: eq.(1). Data-points above the line are removed, while the events below the line
are kept for further analysis.
4.1.3 50%-width
If the second largest S2 is well separated from the largest S2 it could be identified by the
‘largest other S2’-cut. However, if the time-separation between the two peaks is small enough
for the distributions to overlap, the peaks can be classified as a single S2, see figure 5.
Although the second case in figure 5 is classified as one peak, its shape is different from a
genuine single peak. The 50%-width depends on the shape of the curve, and can be used to
10
discriminate between single and double-scatters in case 2 of figure 5.
Time Time
Sign
alco
un
t
Sign
al c
ou
nt
Figure 5: Schematical view of two consecutive S2-signals. The left panel shows a case that can be
identified by the ‘largest other S2’-cut, however, this cut fails to classify the right panel. As the time
between the two signals is short, there is some overlap between the peaks, resulting in the detector to
see only one peak. As this “one peak” has a different shape than a genuine single peak, this false peak
can be identified by its half-width, as the half-width depends on the peak-shape.
Another parameter that influences the width of the signal is the height at which the
collision occurred. Electrons originated from lower collisions take longer to reach the PMT’s
at the top of the detector, and subsequently diffuse further apart, creating a broader signal.
To discriminate between the two effects on the width of the signal, the actual width is divided
by the expected width [6],
∆texpected(t) =
√4.0325Dt
v2+ w2
0, (2)
where D is the diffusion constant in the detector, 22.8cm2/s, v the drift velocity of the
electrons in the detector, 1.44mm/µs and w0 = 348.6ns an offset that takes into account
that the pulse is formed by multiple electron-signals. The dependence on the position in the
detector is reflected in the appearance of t, the time between collision and detection of the
signal, as the field inside the detector is known, the time translates to a distance covered, and
therefore into the vertical position.
Figure 6 includes the lines:
∆t50%(t)
∆texpected(t)<
√ppf(1− 10−5, S2/scg)
S2/scg − 1, (3)
∆t50%(t)
∆texpected(t)>
√ppf(10−5,S2/scg)
S2/scg − 1(4)
the current cut in this parameter space, with scg a dimensionless constant: 21.3 and ppf the
percent-point function. Only points that lie between the curves (3) and (4) stay.
11
Figure 6: The 50%-width cut consists of two lines: eq.(3) and (4). As the width of the signal also
depends on the travel-time of the electrons, the width is divided by the expected width, eq.(2), resulting
in the single-scatter events to be centered around a normalised width of 1. Points that differ too much
from the expected width, and therefore lie outside the blue lines, are filtered away.
4.1.4 Goodness of fit
In the previous two cuts the peaks were separated in time, the GoF-cut uses the spatial
distribution of the peaks instead. Figure 7 shows the signal of two peaks in the PMT’s
situated at the top of the detector. Because the electrons are accelerated in the vertical
direction, the resulting shapes are roughly circular.
If the signal were a single-scatter, there would be a single circular shape, however the
distortion by the second collision results in a different pattern. To qualify how well a signal
resembles a single-scatter it is being compared with simulated patterns, where the difference
in all PMTs to the best fit is added. As a result, a higher value of the GoF actually means
the pattern did not fit well with the simulated patterns, and “badness of fit” might have been
a more accurate name.
12
Figure 7: From [6]. A top-view of the detector, showing the individual PMTs. The event shown is
a double-scatter event, as it consists of two separated regions of high signal. The purple cross marks
the reconstructed position. To identify single- and multi-scatters, the measured signals are compared
to simulated patterns, corresponding to the same reconstructed position. The Goodness of Fit (GoF)
is defined as the difference of each individual PMT-signal to the simulation, meaning a large GoF
corresponds to a bad fit.
Figure 8: Whereas the previous cuts were based on time-separation, the current GoF-cut tries to
discriminate between single-scatters and multi-scatters based on space-separation. The yield of all
PMTs is compared to simulated single-scatter patterns, where a better fit corresponds to a lower GoF.
The resulting cut, eq.(5), removes all points that lie above the blue line.
13
The result of the GoF-cut is plotted in figure 8, with the cut being:
GoF < 0.0404 · S2 + 594 · S20.0737 − 686. (5)
Since a lower GoF means a better fit, only the points that lie below the curve are being
considered.
5 Machine learning
As mentioned in the introduction, machine learning is about letting the computer find the
best way to identify the data, instead of imposing how the cuts have to be made. To illus-
trate the difference with conventional methods a cartoon-example is used. This example is a
classification problem. These problems consist of a data with features (or variables) that are
used to predict the class of a certain event.
If a computer would have to discriminate between Superman and Spiderman using con-
ventional methods, it would need characteristics of both. Its input could for example be like
table 1, where the values of specific features have been pre-defined. This would also work with
continuous variables (like the amount of red in a picture), but for the sake of the argument
only binary variables have been listed.
Class
Feature Superman Spiderman
Cape +1 -1
Mouth +1 -1
Gloves -1 +1
etc.
Table 1: Cartoon classification problem. The values +1 correspond to present, while -1 means the
specific feature is absent in that class.
The machine learning approach is different in the sense that the feature-boundaries have
not been imposed, the machine would still be told to look at the presence of a cape, gloves,
the amount of red, etc., but the second and third column of table 1 would be left empty.
Instead, a dataset is used to let the algorithm complete the table by itself.
The above situation is an example of supervised learning. Machine learning is also usable
for unsupervised problems, in which case the number of columns would not be specified, or
by the use of deep learning for problems without a pre-defined number of rows. However, in
this study the number of classes is known, events are either single-scatters or multi-scatters,
14
and therefore supervised learning is sufficient.
The data set that is used to fill the columns is referred to as the training-set [21], the
algorithm gets access to both features and classes of the data-points in order to fit a function
in feature-space. Afterwards it can designate classes to new points based on their feature-
values. To reduce the risk of overfitting the performance of the trained algorithm is tested
on another set, the validation-set. Although both features and classes of this set are again
specified, only the features are shown to the machine. The outcome is then compared to the
known classes of the data-points to test the performance of the algorithm.
To avoid overfitting, under-fitting or some other problem in the classification, parameters
can be tuned, and the cycle repeated in order to improve performance on the validation-set.
However, by repeating these steps, there is a risk of overfitting on the combined training +
validation-set. To measure the final performance of the algorithm, it is tested on a test-set,
after which the algorithmic parameters can not be changed without creating a bias to the
testing. Figure 9 shows the process of optimising the algorithm, and the use of the different
data-sets.
Training-set
Validation-set
Test-set
Final test
Optimisingdecision boundaryfor training set + test on validation
Manually adjustingimposed model parameters, e.g. class weight
Figure 9: The process of training an algorithm consists of multiple steps. At first, different parameters
are imposed on the algorithm, which are then used to make the best cut in the training-set. As
these cuts could be subject to overfitting or have other problems, the performance of the algorithm is
evaluated on the validation-set. To achieve better results on the validation-set, the imposed parameters
can be changed, going into a new cycle of training and evaluation. When the desired accuracy on the
validation-set is reached, an independent test is needed to check the performance of the algorithm.
Therefore, it is tested on a new set, the test-set, which gives the final accuracy of the classifier.
The number of points that form each data-set depends on the specific problem. For exam-
ple, the desired accuracy and the amount of available data. To use the data as efficiently as
15
possible, one should keep the test-set at the minimum size that fulfils the accuracy-limitations
and use as much data as possible in the training-set. Adding more data to the training-set
improves the ability to fit functions in the parameter space, and also lowers the chance of
overfitting [21].
5.1 Overfitting
Even though models with higher complexity are able to get better results on the training-set,
they are not generally speaking better models. This paradox is shown in figure 10.
0 2 4 6 8 10x
0
5
10
15
20
25
30
y
Overfittingfitted pointsnew pointsnoiseless truthfit, order = 10
Figure 10: Example of overfitting. The blue and red points are points on the curve y = 2x+ 3 (green
line) with added noise. The blue points (training-set) have been used to fit a 10th order polynomial,
which fits reasonably well. However, the red points (validation-set) do not match the blue curve at
all, so the fit has to be adjusted.
Even though the 10th order polynomial fits the training-set better than a linear model
would have done, it is clear that the model is not able to predict unseen data with a reasonable
accuracy. If the red points would have laid within x ∈ [0, 8] the results would have been
similar, but less pronounced.
16
5.2 Decision tree
The algorithms used to find patterns within the training set are called ‘classifiers’. The
first one that will be considered is the decision tree. A decision tree uses consecutive cuts in
feature-space to split the parameter region into rectangular decision regions. In these decision
areas points will be identified to belong to a specific class by a ‘majority vote’: all points will
get the same class as the one that is most abundant in that part of the parameter space. [22]
To clarify the decision tree procedure the classification of weekend-days versus midweek-
days will be performed based on the average temperature and the number of people that
visited a (unspecified) beach that day. Figure 11 shows both the feature-space (left panel)
and the decision tree (right panel), including the decision regions.
Figure 11: Example of a classification problem. The goal is to identify weekend- and midweek-days
based on the temperature and the number of visitors on a beach. The left panel shows the feature-
space, with fake data. Green dots correspond to weekend-days, while the midweek is represented by
red dots. The colored regions show how the decision tree has divided the parameter space, where the
green regions are predicted to contain green dots. The cuts have been shown with black lines, where
the broadest line corresponds to the first cut. The right panel shows the decision tree that corresponds
to this splitting of data. The left branch of each cut corresponds to “yes”, while the right branch is
“no”.
On weekend-days there are more people that have time to visit the beach, and therefore
a high number of visitors is a good indicator of weekend-days. After this first cut is made,
17
the bottom rectangle is further divided. Although at high temperatures the beach can be
crowded even midweek, this is not the case at lower temperatures. As a result, the second
cut identifies days with low temperatures as part of weekends. This argument does not hold
for low number of visitors, however in the absence of additional cuts these days are seen as
weekend-days because of the majority vote.
5.3 Random forest classifier
Although every cut is designed to construct the optimal splitting at that step, it is possible
that better results can be obtained by starting with a less optimal cut that enables a better
splitting in the consecutive step. Another problem with decision trees is the fact that different
starting cuts can result in various distinctive final trees. Even though the different cuts can
result in similar prediction accuracies, an experiment like XENON1T needs algorithms that
are reproducible. Finally, it is easy to overfit data by adding too many cuts [23].
To counter these problems, a Random Forest Classifier (RFC) is used; instead of using
one tree, multiple trees are used to predict the class of the data-points. The final outcome is
the weighted average of the individual predictions, where the weight is the fraction of points
that resulted in the majority vote [24]. The number of trees used to create the forest is a
parameter imposed on the algorithm. Using more trees results in better accuracy, however,
it also means more computation time.
To make sure different trees are created, only a randomly picked fraction of the features
are available for every cut. For example, using the features of the current data-analysis, the
first cut can be made either on the ‘largest other S2’ or the ‘50%-width’, while the second
cut is only allowed in the ‘largest other S2’- and ‘GoF’-spaces. The best cut will be different
depending on the features that can be used, and in this way different random trees are created
[25].
To evaluate what is the best possible cut the function
Φ = C1
∑i:yi=+1
|yi − h(xi)|
+ C2
∑i:yi=−1
|yi − h(xi)|
, (6)
that counts the number of misclassifications, is minimised. xi is a vector with all feature-
values of the i-th data-point and yi is the class to which it belongs (either +1 or -1). h(xi) is
the prediction, depending on how the cut is made. For the first cut in the above example it
would be of the form:
h(xi) =
+1 Nvisitors > 800
−1 Nvisitors < 800,(7)
18
where the value 800 would be the number to be optimised to find the best cut.
C1 and C2 in eq.(6) are class-weights, determining the effect of misclassifications. Higher
values for these parameters mean a higher penalty, allowing less violation in the training
set. In unbalanced data-sets (sets that have a different number of points for each class) the
sums in eq.(6) can differ heavily in size. To make sure that also misclassifications of the
less abundant class are considered the class weights are chosen inversely proportional to the
number of data-points [26].
5.4 Support vector machine
Although decision trees are usable in multi-dimensional problems, the possible cuts are lim-
ited by the way of construction. For example, a diagonal line in figure 11 would be difficult to
realise by the use of decision trees. Support Vector Machines (SVMs), originally constructed
by Vapnik [27], are classifiers that are able to fit differently shaped curves in feature space,
dividing it into decision regions.
The working principle of a SVM is mapping a feature-space (that can be multi-dimensional)
into a higher dimensional space (computational space). In this richer space a separating hy-
perplane is constructed (a hyperplane is the higher dimensional equivalent of a plane in 3D
and a straight line in 2D), which divides the data points into different classes [28]. Hereafter,
the map is inverted so that the decision region is transformed back to the feature space.
Even though the separating hyperplane is linear, the decision regions can be a higher order
polynomial, a Gaussian or other shaped function, depending on the map between feature and
computational space [28]:
xi → φ(xi) = (φ1(xi), φ2(xi), ...φn(xi)), (8)
where φ(xi) is the map from feature to computational space.
A hyperplane that separates classes without failure would have to fulfil:
w · xi + b ≥ +1 ∀ xi ∈ Class 1 (9)
w · xi + b ≤ −1 ∀ xi ∈ Class 2, (10)
where the hyperplane is specified by w and b [28].
If the ≥ or ≤-sign is an ‘equal’-sign instead, this particular point lies close to the boundary
and is called a support vector. The further the absolute value of an outcome differs from 1, the
larger the distance to the separating plane. As the support vectors lie close to the boundary,
they predominantly determine its shape [29].
The class y of a data-point is, in our 2-class problem, either -1 or 1, and therefore, in the
19
perfect case the following equation would hold:
yi(w · xi + b) ≥ 1, (11)
for all data-points i. However, if separation on the training set is not flawless, positive
predictions get multiplied by negative true values and vice-versa, such that eq.(11) becomes
yi(w · xi + b) ≥ 1− ξ (12)
instead, where ξ is a measure for misclassification on the training set and
ξ ≥ 0. (13)
The optimum separation is reached when the distance of all points to the boundary is
maximised, while the number of misclassifications, and their size, is minimised. To find this
hyperplane the following equation has to be minimised under the constraints (12) and (13)
[28]:
Φ(w,Ξ) =1
2|w|2 + C
(∑i
ξi
). (14)
C in eq.(14) again determines the effect of misclassifications. Alternatively, to include different
weighting for classes, the following function can be minimised:
Φ(w,Ξ) =1
2|w|2 + C1
∑i:yi=+1
ξi
+ C2
∑i:yi=−1
ξi
, (15)
which is more suitable for our problem.
To evaluate the discrimination power of the support vector machine, figure 11 has been
remade with a support vector machine, see figure 12. Instead of the consecutive cuts, the
decision region is now a smooth function. However, the position of the green and red areas is
still roughly equal.
20
0 5 10 15 20 25Temperature [oC]
0
200
400
600
800
1000
1200
Amou
nt o
f visi
tors
Days at the beach, classified by svm
Figure 12: The same data as in figure 11 has been re-classified, this time using a Support Vector
Machine (SVM). Comparing the two figures the green and red areas are similar in position, however,
the SVM resulted in a smoother decision boundary.
5.5 Multi-layered perceptron
Although computers are able to complete a large number of computations per time-frame,
there are still problems that humans can do better or faster, for example face recognition.
Therefore, algorithms have been proposed that try to mimic the human brain, trying to get
better results.
Figure 13: From [30]. A schematical view of a multi-layered perceptron, two-layered in this case. The
input layer is a vector with all the feature values and the output layer yields a vector with components
equal to the probability to belong to the corresponding class. The white circles are the individual
neurons, and the connecting lines all have an associated weight.
21
Instead of one powerful processing unit, Multi-Layered Perceptrons (MLPs) consist of
many non-complex processors in analogy to the neurons that construct our brain. A schematic
view of a MLP is shown in figure 13.
The left column is the input layer, which is processed by the hidden layers in the middle
before it reaches the output layer. The neurons (represented by white circles) are connected
by lines that correspond to a weight w, where w(l)ij corresponds to the weight from the i-th
neuron in the l-th layer to the j-th neuron in the (l + 1)-th layer. The output of this j-th
neuron depends on the input of all neurons in the previous layer [31]:
x(l+1)j = f
(N∑i=1
w(l)ij x
(l)i + θ
(l)j
), (16)
where N is the number of neurons in the l-th layer and θ(l)j is an offset corresponding to the
neuron on the left hand-side of the equation. The function f() is a non-linear activation func-
tion. The non-linearity ensures the stacking of layers does not yield trivial results. Although
there are multiple choices for the activation function, a common form is the Rectified Linear
Unit (ReLU) [32]:
ReLU(x) = max(0, x). (17)
As in eq.(6) the difference between prediction and truth will be minimised. However, the
parameters to be optimised are w(l)ij and θ
(l)j , for all i, j and l. For our problem, with four
features, an algorithm using three hidden layers, all consisting of ten neurons, would have
30 + 40 + 100 + 100 + 20 = 290 parameters to be optimised (30 neurons yield 30 θ(l)j and 260
connections, with corresponding weights).
5.6 Stochastic gradient descent
To solve the optimisation problems of eq.(6) and (15) is not easily done in an exact manner,
therefore, a numerical approach is used. From a chosen starting point, the gradient of Φ is
computed, hereafter, a step is made towards a lower value of Φ. By repeating these steps,
lower and lower values of Φ are reached, until eventually a minimum is reached. This mini-
mum can be either global or local, and therefore multiple starting positions can be tried in
succession, lowering the chance of ending in a local minimum.
The above method is called gradient descent, however, since the computation of the gra-
dient using all available data can be computationally demanding, a subset of the data can be
used for each step, in which gradient descent reduces to Stochastic Gradient Descent (SGD).
As the individual steps in SGD are less demanding than full gradient descent, it could con-
verge in fewer computation time. However, as less data is used for each step, it could be that
more steps are needed to reach the minimum and a balance between the number of steps and
22
the time for each step is needed.
A similar balance applies to the size of the steps taken, if SGD would be used to optimise
the w(l)ij for the MLP, it would take the following form:
w(l)ij,t = w
(l)ij,t−1 − η
(Φ′(w
(l)ij,t−1,xi, yi)
), (18)
where the consecutive steps are separated by time steps (t). Φ′ indicates the partial derivative
of Φ with respect to w(l)ij,t−1 [33].
The step-size is represented as η. Higher values for η enable coverage of a larger region
of the parameter space, in fewer steps, however, it comes with the risk of never finding the
minimum of Φ as is shown in figure 14.
Figure 14: Stochastic Gradient Descent (SGD) is a method to find the minimum of a function. Full
gradient descent calculates the full derivate, however, this can be computationally intensive for larger
data-sets. A solution is to use a fraction of the data to determine the direction of each step: SGD. The
above figure shows the effect of the step-size on the effecivity of the algorithm, where the density of
blue lines simbolises the gradient and the minimum of the function lies within the inner circle. Small
steps, yellow arrows, eventually reach the minimum, but it takes many steps to reach the end-point.
However, if the step-size is too large, red arrows, there is a risk of overshooting the minimum, and the
minimum might not be found.
To make sure one η is compatible with all dimensions in feature-space, feature-values
should be of similar size. This creates the need of normalisation of all features such that
neither the yellow, nor the red arrows in figure 14 are applicable to the algorithm.
5.7 ROC-curve
After the data has been split and the algorithms chosen and optimised, there is still the need
to evaluate the performance of the machine-learning code. Using the number of correctly
23
classified points in the test-set is not a correct representation of how well the algorithm per-
formed. Imagine a data-set with 10, 000 multiple-scatters and only a hundred single-scatters.
If a classifier would predict every data-point to be a multi-scatter it would reach an accuracy
of 99%, even though no single-scatters are correctly classified.
ROC-curves, or Receiver Operating Characteristics-curves, are a different way of measur-
ing the performance, countering the above problem [34]. A ROC-curve is a plot of the true
positive rate (tpr) against the false positive rate (fpr), which are defined as
tpr =Correctly classified positives
Total positives(19)
fpr =Incorrectly classified negatives
Total negatives(20)
respectively, where the term ‘positive’ corresponds to one class (the single-scatters), and ‘neg-
ative’ to the other (multiple-scatters).
As follows from eq.(19) and (20), both values are confined between 0 and 1. Figure 15
shows three different curves, one which has perfect classification (purple curve), one with de-
cent classification (blue curve) and one that did not succeed in separating the classes (green
curve).
Figure 15: Adjusted from [35]. ROC-curves are a measure of the performance of your algorithm,
where the true positive rate, eq.(19), is plotted against the false positive rate, eq.(20). In the perfect
case, no points would be misclassified, and the purple line would be obtained. In general, this is not
possible and ROC-curves of working algorithms look more like the blue line. If an algorithm fails to
discriminate between the classes, it is completely random whether it classifies a point directly. Getting
20% of the positive points correct would result in misclassifying 20% of negative points as well, and
the ROC-curve looks like the green curve.
24
Classifiers predict a point to belong to a class with a certain probability. By changing the
threshold on the certainty needed to be classified as positive, a tpr can be assigned to every
fpr. The blue curve in figure 15 is the standard form of a ROC-curve. Lowering the threshold
increases the number of correctly classified positives, while also increasing the fpr. After a
certain threshold, all the positives have been identified, and increasing the fpr even further
does not yield an increase in correctly classified points.
The green curve, corresponding to the diagonal line, shows a problem in the classifica-
tion. For every % of correctly classified positives, there is a same percentage of misclassified
negatives, meaning the classification is completely random. Predicting 50% of the one class
correctly results in 50% of the other class being misclassified. Therefore, counter-intuitively,
the line that looks like perfect correlation actually corresponds to no correlation between the
algorithm and the data.
6 Results
6.1 Data preparation
The data used in this thesis is simulated by M. Vonk [6]. With a Monte-Carlo program he
created 500, 000 S2-signals. Because S1-signals are used for time-measurements, they were
not needed in the simulation, as the time of collision is directly outputted by the simula-
tion. Although the data contained more features, only the four earlier mentioned were used:
‘largest S2’, ‘largest other S2’, ‘50%-width’, normalised by the expected width: eq(2) and the
‘goodness of fit’. For a detailed explanation of the data-simulation I refer to his work [6].
The data was split with 10%, 20% and 70% for the test-set, validation-set and training-set
respectively. The 350, 000 points in the training-set determined the normalisation. All the
features were divided by the highest value of that feature in the training-set, so also data from
the test- and validation-set, and afterwards the mean value of the feature in the training set
was subtracted. This resulted in the training-set having mean equal to zero for all features
and a difference between maximum and minimum of approximately one. The same applies
to the other two sets, but only in approximation.
As evident in figures 4, 6 and 8, the multiple-scatters are more abundant. The single-
scatter events make up about 10% of the data-points, which has to be accounted for in the
training of the algorithms.
25
6.2 Improving current cuts
The current data-analysis has a true positive rate of 0.928 with a corresponding false positive
rate of 0.163. To improve these cuts, the fpr needs to be lowered while the tpr remains
unchanged. Figure 16 shows the performance of a support vector machine on the test-set.
The current cuts are marked with the red dot, while the black dot is the corresponding point
on the ROC-curve.
The parameters imposed on the SVM are:
Parameter Value Explanation
classweight ‘balanced’ makes sure weights are inversely proportional to amount of data
kernel ‘rbf’ radial basis function
probability True to retrieve the confidence in the classification
C 1e2 weight of misclassification
gamma 0.4 Determines the region of effect of the data-points
Table 2: Parameters imposed on the support vector machine. The package used is scikit-learn [24].
0.0 0.2 0.4 0.6 0.8 1.0fpr
0.0
0.2
0.4
0.6
0.8
1.0
tpr
SVM, rbf-kernel
ROC-curvemachine cut, fpr = 0.085current cut, fpr = 0.163
Figure 16: The ROC-curve corresponding to the SVM, with parameters as in table 2. The red
dot shows the accuracy of the current cut, where the black dot is the accuracy of the machine cut,
corresponding to the same tpr. As the fpr of the machine cut is about a factor two smaller, the
background is reduced with the same factor, meaning the machine algorithm did considerably better
than the current data-cuts.
26
After it became clear that adding machine learning to the analysis yielded better results
than the current method, there were two hypotheses:
1. The two-dimensional cuts were not optimal and could be improved.
2. The multi-dimensional approach of the machine algorithm enabled better classification.
To check both hypotheses, a support vector machine was used to mimic the two-dimensional
cuts. The algorithm had no access to other features than the ‘Largest S2’ and the feature
considered, making it a purely 2D-problem.
0.0 0.2 0.4 0.6 0.8 1.0fpr
0.0
0.2
0.4
0.6
0.8
1.0
tpr
Largest other S2 cut
ROC-curvecurrent cut, fpr = 0.198machine cut, fpr = 0.160
1000 2000 3000 4000 5000Largest S2 [PE]
0
100
200
300
400
500
600
700
Larg
est o
ther
S2
[PE]
Largest other S2 cut
0.0 0.2 0.4 0.6 0.8 1.0fpr
0.0
0.2
0.4
0.6
0.8
1.0
tpr
50%-width cut
ROC-curvecurrent cut, fpr = 0.817machine cut, fpr = 0.775
1000 2000 3000 4000 5000Largest S2 [PE]
1
2
3
4
5
50%
wid
th /
Expe
cted
wid
th
50%-width cut
27
0.0 0.2 0.4 0.6 0.8 1.0fpr
0.0
0.2
0.4
0.6
0.8
1.0
tpr
Goodness of fit cut
ROC-curvecurrent cut, fpr = 0.896machine cut, fpr = 0.888
1000 2000 3000 4000 5000Largest S2 [PE]
500
1000
1500
2000
2500
3000
Good
ness
of f
it
Goodness of fit cut
Figure 17: To check whether the improvement of figure 16 was due to the high-dimensional analysis
or due to the 2D-cuts not being optimal, a SVM with the same parameters was used to classify the
scatter events based on only two features. The left column shows the ROC-curves of these algorithms,
including both the current accuracy and the machine accuracy. The SVM got better results in all
three cases, however, it is not sufficient to explain the factor of two encountered in figure 16. The right
column shows the decision boundaries that correspond to the machine cuts. Even though a larger
region of the parameter spaces is green, the fpr of the cuts is lower. This is due to the fact that the
green areas that lie outside the cuts have a low density of data, while the red areas inside the cuts
remove a large amount of points.
Figure 17 shows the results of these tests, where the same parameters were used as in figure
16, to make sure the results could be compared. The left column shows the ROC-curves,
while the corresponding decision boundaries are potted on the right.
As can be seen in figure 17, changing the two-dimensional cuts results in a better per-
formance. Although the machine cuts use larger regions in feature-space, they still remove
more multiple-scatters. This can be seen by looking at figure 4, 6 and 8, which show that the
density of data in the ‘extra-regions’ is small compared to the parts that are removed with
respect to the current cuts.
However, even though the first hypothesis holds, the improved cuts are not able to explain
the factor two difference in fpr seen in figure 16, meaning both the hypotheses are valid. The
fact that machine learning is indeed able to reach better classification than the current anal-
ysis justifies a comparison of the support vector machine with different algorithms, trying to
improve the cuts further.
28
6.3 Comparing different algorithms
Going through the same cycle of training and validation the following parameters were chosen
for the RFC- and MLP-classifiers:
Parameter Value Explanation
classweight ‘balanced’ weights are inversely proportional to amount of data
nestimators 100 amount of trees that form the forest
min. samples / leaf 6 the minimum amount of data-points left in a rectangle
max. features 2 the amount of features available for each cut
Table 3: Parameters for the boosted decision tree from the scikit-package [24].
Parameter Value Explanation
hidden-layer sizes (10,10,10) three layers, all with 10 neurons
activation ‘relu’ using a ReLU as activation function
solver ‘sgd’ using stochastic gradient descent
alpha 1e-4 similar to the C-parameter for other classifiers
Table 4: Parameters of the multi-layered perceptron [24].
0.0 0.2 0.4 0.6 0.8 1.0fpr
0.0
0.2
0.4
0.6
0.8
1.0
tpr
RFC
ROC-curvemachine cut, fpr = 0.079current cut, fpr = 0.163
0.0 0.2 0.4 0.6 0.8 1.0fpr
0.0
0.2
0.4
0.6
0.8
1.0
tpr
MLP
ROC-curvemachine cut, fpr = 0.084current cut, fpr = 0.163
Figure 18: To try to achieve better results, different machine algorithms were tested. The random
forest classifier and multi-layered perceptron both got slightly better results, where the random forest
performed best. However, the main difference with the SVM was in run-time. The SVM took multiple
hours to train, while these two algorithms finished the training phase within 15 minutes.
29
The results of both classifiers were similar to the support vector machine, see figure 18.
However, where the training of the SVM, using the full training set, took multiple hours, both
the RFC and the MLP finished training within 15 minutes.
7 Conclusion
As is evident from the previous chapters, adding machine learning to the current analysis
enables better results. A larger number of multi-scatters can be removed, without affecting
how many single-scatters are identified.
All three algorithms, the RFC, SVM and MLP, yield a comparable improvement in ac-
curacy, where the amount of multiple-scatter events that survive the data-cuts is roughly
halved. However, as the support vector machine faced difficulties with the large amount of
data, I would not recommend the algorithm for the XENON-experiment, where even larger
data-sets can be expected.
Even though the random forest classifier achieved slightly better classification than the
multi-layered perceptron, I would consider both for implementation. It could also be possible
that a combination of the two would yield even better results, just like the RFC is made up
of multiple decision trees.
While the machine classifiers are able to achieve better results, they lack the transparency
of the current cuts, which all have a well founded physical justification. If for this reason
machine learning is not chosen as data-splitter, then maybe a second look could be taken
at the current two-dimensional cuts. Slightly adjusting these, to avoid small regions with
high multiple-scatter density, might result in minor improvements of the accuracy, even if no
machine learning is used.
References
[1] P. Van Dokkum, et al., Nature 555, 629 (2018).
[2] P. A. Ade, et al., Astronomy & Astrophysics 594, A13 (2016).
[3] G. Bertone, Nature 468, 389 (2010).
[4] E. Aprile, et al., The European Physical Journal C 77, 881 (2017).
[5] E. Aprile, et al., Physical review letters 111, 021301 (2013).
[6] M. Vonk, Master thesis (2018).
[7] M. Bojarski, et al., arXiv preprint arXiv:1604.07316 (2016).
[8] D. Silver, et al., nature 529, 484 (2016).
[9] S. Marsland, Machine learning: an algorithmic perspective (CRC press, 2015).
30
[10] F. Sebastiani, ACM computing surveys (CSUR) 34, 1 (2002).
[11] T. S. van Albada, K. Begeman, R. Sanscisi, J. Bahcall, Dark Matter in the Universe
(World Scientific, 2004), pp. 7–23.
[12] W. Hu, S. Dodelson, Annual Review of Astronomy and Astrophysics 40, 171 (2002).
[13] E. Hivon, et al., The Astrophysical Journal 567, 2 (2002).
[14] G. Bertone, D. Hooper, J. Silk, Physics Reports 405, 279 (2005).
[15] E. Verlinde, Journal of High Energy Physics 2011, 29 (2011).
[16] M. Markevitch, arXiv preprint astro-ph/0511345 (2005).
[17] L. Bergstrom, Nuclear Physics B-Proceedings Supplements 138, 123 (2005).
[18] L. Bergstrom, Reports on Progress in Physics 63, 793 (2000).
[19] X. collaboration, G. Plante, et al., Columbia University (2016).
[20] J. Aalbers, PhD thesis (2018).
[21] W.-L. Chao, Disp. Ee. Ntu. Edu. Tw (2011).
[22] J. R. Quinlan, Machine learning 1, 81 (1986).
[23] L. Breiman, et al., The annals of statistics 26, 801 (1998).
[24] F. Pedregosa, et al., Journal of Machine Learning Research 12, 2825 (2011).
[25] G. Biau, E. Scornet, Test 25, 197 (2016).
[26] J. J. Rodriguez, L. I. Kuncheva, C. J. Alonso, IEEE transactions on pattern analysis and
machine intelligence 28, 1619 (2006).
[27] V. Vapnik, A. Y. Lerner, Avtomat. i Telemekh 24, 774 (1963).
[28] E. Osuna, R. Freund, F. Girosi (1997).
[29] C.-C. Chang, C.-J. Lin, ACM transactions on intelligent systems and technology (TIST)
2, 27 (2011).
[30] M. W. Gardner, S. Dorling, Atmospheric environment 32, 2627 (1998).
[31] A. K. Jain, J. Mao, K. M. Mohiuddin, Computer 29, 31 (1996).
[32] L. Noriega, School of Computing. Staffordshire University (2005).
[33] T. Zhang, Proceedings of the twenty-first international conference on Machine learning
(ACM, 2004), p. 116.
[34] T. Fawcett, Pattern recognition letters 27, 861 (2006).
[35] B. Buxton, W. Langdon, S. Barrett, Measurement and Control 34, 229 (2001).
31