atlas detector event classi cation with tensorflow

School of Physics and Astronomy

Queen Mary, University of London

ATLAS detector event classificationwith TensorFlow

Thomas Charman

Supervisor: Dr. Adrian Bevan

SPA7016U Physics Research Project45 Credit Units

Submitted in part fulfilment of the requirements for the degree ofMSci Physics from Queen Mary, University of London, March 2017.

Abstract

This report focuses on the classification of signal and background events in a dataset of Monte

Carlo simulations. The simulations are based on two different theories that predict the decay

X → hh → bbτ+τ−, where X in both cases is a previously undetected particle. A framework

is discussed in which the problem of event classification in high energy physics is mapped to

the capability of neural networks. The mathematical innards of neural networks are reviewed.

Details of implementing neural networks in the TensorFlow software library for numerical com-

putation are given, and technical specifications are presented for the purpose of reproducibility.

Hyper-parameters and features of a neural network are explained, and the process by which

they are selected is outlined. Results of the analysis are given, showing that the classifier is

not yet competitive with rival classifiers. Lastly the statistical reliability of such methods is

discussed, and comments are made on the viability of the use of TensorFlow in high energy

physics problems.

i

Acknowledgements

- I wish to thank Dr. Adrian Bevan for his guidance, which has been essential not only in the

completion of this project, but also in aiding my decisions concerning future endeavours

in the field.

- Tom Stevenson is owed my thanks for his help, which has been invaluable to me. His

assistance in preparing the data used in this project as well as taking the time to discuss

a wide range of concepts with me made this project possible in the first place.

- I would also like to thank Dr. Dan Traynor for his work in maintaining the local computing

cluster at Queen Mary a facility which saved me untold hours as well as Dr. Alex Owen

and Dr. Cozmin Timis for help with getting to grips with the TensorFlow software and

its prerequisites.

- Prof. Peter Sollich and Dr. Christopher White are also thanked for the time they gave

up in order to talk over theoretical concepts.

ii

Contents

Abstract i

Acknowledgements ii

1 Introduction 1

1.1 The ATLAS Detector & the bbτ+τ− Decay Channel . . . . . . . . . . . . . . . . 2

1.2 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Neural Networks 4

3 Picking Features 8

3.1 Features for hh → bbτ+τ− . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Hyper-Parameter Selection 14

4.1 Hidden Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Activation Functions & Optimisation Algorithms . . . . . . . . . . . . . . . . . 17

5 Results & Analysis 18

5.1 General Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.2 Best Performing Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.3 Addressing Generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6 Conclusions 22

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

A Full Features List 27

B Hyper-Parameter Scan 28

C Technical Specifications 30

iii

1. Introduction

Since the discovery of the Standard Model (SM) Higgs boson (h) in 2012 [1, 2] there is anincreased motivation to test for its resonant pair production. Such pair production, of the formX → hh is predicted by extensions to the SM. The unknown particle X can be accountedfor by one of many variations on a so-called two-Higgs-doublet model (2HDM) [3], in which itwould be a heavy scalar Higgs (H). Another candidate for X source comes in the form of theRandall-Sundrum graviton (RSG) [4, 5]. A particular decay channel of a resonant pair of SMHiggs bosons is hh → bbτ+τ−. In this analysis Monte Carlo simulations of the RSG and Hparticles decaying to bbτ−τ+ will be considered separately as a source of signal events whichwill be classified when mixed with simulated background events.

Multivariate analysis (MVA) has become a relatively commonplace solution to classificationproblems in high energy physics (HEP). Many different algorithms exist that will function aseither a binary or multi-class classifier, that is to say, they are designed to separate data intotwo or many classes. Many of these algorithms are implemented in the ROOT [6] Toolkit forMultivariate Data Analysis (TMVA) [7], the de facto standard software for performing MVAsin HEP. A particular choice of algorithm lies in neural networks (NNs) or deep neural networks(DNNs). The difference between NNs and DNNs is merely the number of so-called hiddenlayers between your input and output. There is no universal rule for the minimum number ofhidden layers a network must have before it is termed “deep”. Whilst choosing the size of theNNs used in this report there was no limitation placed on the number of hidden layers and sothe term NNs will be used for all networks in this report but in principle networks of all sizesare considered. NNs of any type are fundamentally based on the perceptron [8], an algorithminvented by Frank Rosenblatt in 1957. Since the inception of the perceptron there have beenmany varying examples in the literature of these methods successfully applied to HEP problems[9–13]. There is a preference in the field to use the TMVA implementation of boosted decisiontrees (BDTs) or other implementations such as XGBoost [14] in favour of NNs for a numberof tasks [15]. BDTs are thought to perform better without extensive configuration and aretherefore favoured by some, however others have differing opinions. In particular Petra Perneret al. [16] came to the conclusion that the difference in error is not as clear cut.

This work focuses on the use of TensorFlow [17], a new open source software library fornumerical computation, originally developed by the Google Brain Team. Specifically, modernimplementations of NN building blocks are used to create a signal versus background classifierfor the aforementioned signal events, as detected by the ATLAS detector [18]. In the comingchapter a summary of the ATLAS detector and previous analyses of the bbτ+τ− decay channelwill be presented, then TensorFlow and it’s features will be briefly discussed. Secondly a chapterdetailing the mathematics surrounding NNs will motivate the use of these algorithms generally,and choice of loss function is given. Next comes details of feature selection, first motivationwill be given as to what makes features useful for a NN, then using these principles features

1

of this analysis will be discussed. Chapter 4 will focus on the tuning of NNs by the selectionof their hyper-parameters and optimisation algorithm before divulging results in chapter 5.Conclusions will be drawn regarding the performance of the classifier versus a classifier beingused internally on the ATLAS analysis of the LHC’s second data taking phase (run 2). Finallycomments are made regarding what could be done to improve this work in the future and theviability of TensorFlow for use in HEP.

1.1. The ATLAS Detector & the bbτ+τ− Decay Channel

The ATLAS detector is a particle detector at the Large Hadron Collider (LHC). It spans alength of 44 m, is 25 m in height and weighs approximately 7000 tonnes. It has been observingthe products from proton-proton collisions since 2010, and detected a particle that we now knowto be the Higgs boson in 2012. The detector is formed of cylindrical layers at the centre of whichlies the interaction point, which is the intersection between two beam pipes. During operationboth pipes contain beams of protons travelling at near light-speed in opposite directions, theprotons collide with a centre of mass energy of approximately 13 TeV. Figure 1.1 shows acut-away representation of the detector revealing some of the internal structure. Each layeris different in the way that it detects particles. Radiation trackers made of silicon or xenon,calorimeters and muon detectors are some of the particular components used in ATLAS. Datafrom these components can be used to infer physical properties of the particles decaying in thebeam pipe. For example by observing the curvature of charged particles through the radiationtrackers their momentum can be calculated. Measurements like this can be thought of asprimitive data, another type of data, known as derived, is acquired by combining primitives.Derived data is formed based on knowledge of physical laws that are applied to the primitives.The difference between these data types as they pertain to machine learning and NNs will bediscussed in chapters 3 and 4.

Figure 1.1: A cut-away representation of The ATLAS Detector with some specific componentslabelled [18].

Analyses of the bbτ+τ− decay channel has been published by the ATLAS collaboration [19]using data from the LHC’s first data taking phase (run 1). The CMS collaboration have also

2

published an update to the search using partial data from the second run [20]. Both analysesreport no findings of an excess over the expected backgrounds. During a discussion of thesebackgrounds in these analyses it is noted that a tt background is a significant contributor. Inthis analysis the simulated tt background will be separated from simulated hh → bbττ signal,though in principle the methods presented could be used on any collider analysis and due tothe universal approximation theorem [21] also have much wider applications. Past analysis alsocontains details of the final state that was considered in the search, the ATLAS result looks onlyat a final state in which one of the tau particles decays hadronically and the other leptonically,whereas the CMS result takes into account different final states. Given that the simulationsused in this work pertain to the ATLAS detector, we too shall only consider the one-hadronicone-leptonic final state of the decay channel.

1.2. TensorFlow

Released in November 2015, TensorFlow is a relatively new tool for performing machine learn-ing. The software has implementations of a wide variety of machine learning algorithms witha focus on NNs that will be used in this paper. TensorFlow comes with a particular nomencla-ture with which programs are described in the documentation, in an effort to keep in line withthis, a summary of import elements of a TensorFlow program is included in table 1.1. Given

Key TensorFlow Elements

Name Description

Operations Mathematical computations in the abstract.

Kernels Implementations of operations for a specific device.

Sessions The means by which programmers send or retrieve datafrom the computational graph.

Tensors The TensorFlow implementation of a multi-dimensionalarray, similar to an nd-array in NumPy [22].

Variables Variables are persistent, mutable Tensors, by defaultthey will be adapted by training algorithms.

Table 1.1: A summary of important elements in TensorFlow.

these elements, a simple neural network may be thought of as a computational graph madeof operations and tensors. At each execution of the model, data will be passed through thegraph as a tensor, operations act on the data and variables along the way. A function knownas an error or loss function will be evaluated, and a training algorithm will be used to updatethe variables. At this stage the importance of persistence of the variables, in regards to thecomputation graph is highlighted; for if the variables behaved like the data tensor they wouldno longer be seen by the graph after the single execution. Instead when the graph is executed asecond time it is done so with the updated variable tensors, the loss function is evaluated oncemore and this is repeated until some requirement is fulfilled. In practice batches of data arefed to the network at each execution to reduce computation time. Normally the requirementfor stopping the training is after a fixed number of epochs have taken place, where an epoch isdefined as a fixed number of executions. More detail on techniques such as early-stopping, andthe specific implementations used in this work will be given later in chapter 5.

3

2. Neural Networks

This section outlines some of the mathematical formalism surrounding NNs that act as clas-sifiers. These kinds of NNs can be used to solve binary classification or multi-class problems,where the probability that each data candidate belongs to any of the classes is mutually exclu-sive. Many other types of NN exist, however their details will not be discussed in this work.In simple terms a NN is comprised of connected layers of nodes as in figure 2.1. Nodes fallinto three types input, output and hidden, with layers being homogeneous with regard to nodetype and therefore inheriting their name. The input layer is comprised of one input node perdimension of a single entry in the dataset that one wishes to pass into the NN for classifica-tion. Each output node is a predictive unit corresponding to one of K classes and it is thegoal of the intermediate hidden layers (full of hidden units) to provide a relationship betweenthe inputs and outputs such that the predictive unit with the highest value corresponds to thecorrect class for the given data. The output layer should therefore be comprised of K nodes.In order to delve further into the details, a good mathematical representation of the hiddenlayers is required. One such representation is given by Bishop in Pattern Recognition and Ma-chine Learning [23]. In this section the nomenclature used by Bishop will be outlined and thenadapted for our specific use.

.

.

.

.

.

.

.

.

.

x0

x1

xd

h0

h1

hm

y1

yK

Inputs Hidden Layer Outputs

Figure 2.1: A basic neural network containing an input layer of d nodes corresponding to dataof dimensionality d, a hidden layer of m hidden units hi and an output layer of K predictiveunits yi.

The building blocks of the NN, called activations, resemble Fisher discriminants [24] andtake the form

aj =d∑

i=1

wjixi + wj0 (2.1)

4

where the wji terms are known as weights and the wj0 as biases. Collectively the weights andbiases shall be referred to as adaptive parameters, due to the fact that (as shall be made clearlater) they are to be adapted by a training algorithm. In order to take the activations andturn them into something closer to a perceptron [8] they must be passed through an activationfunction denoted H,

hj = H(aj) (2.2)

becoming what are known as hidden units. These activation functions may be non-linearand have the restriction that they must be differentiable, crucially this is different from theperceptron which uses a non-differentiable step function. Considering a network with only asingle hidden layer as in figure 2.1, there are further steps required to get from the hidden unitsto the predictions yk. The complete network function as an argument of a vector of data pointsx and a matrix of adaptive parameters w should return predictions given by

yk(x,w) = O

(m∑j=1

w(2)kj H

(d∑

i=1

w(1)ji xi + w

(1)j0

)+ w

(2)k0

). (2.3)

where now, as in Bishop’s representation, the superscript number in brackets labels the layerto which adaptive parameters belong, not counting the input layer (as it does not containthem). The network function (2.3) is comprised by performing the same construction on thehidden unit (2.2) as was performed on the original data point xi in (2.1), with a special typeof activation function an output function O. Notably the hidden units must be summed overin the same way that data points are, however instead of summing over the dimensions of thedata, in order to reproduce the network in figure 2.1 we sum up to m the desired number ofhidden units, which is referred to as the “size” of the hidden layer. A common choice of outputfunction for binary classification problems is the logistic sigmoid function

O(z) =1

1 + exp(−z)(2.4)

where in each of these z merely denotes the argument of the output function. For multi-classproblems where K > 2 a generalisation of the logistic sigmoid, the softmax function

O(z)k = p(k|x) = exp(zk)∑ki=1 exp(zi)

(2.5)

gives the probability of being in class k given the data x and where i is summed all classes. Theindex k appears on the output function as it must be calculated for each k ∈ K the total numberof classes. It is at this stage that the true meaning of the predictive units is solidified, eachshould give a number between zero and one that represents the probability of data belongingto the corresponding class, and logically all predictive units should sum to one. For the sake ofcompactness we shall, as Bishop does, re-write (2.3) by introducing x0 = 1 in order to absorbthe biases into the sums, yielding

yk(x,w) = O

(m∑j=0

w(2)kj H

(d∑

i=0

w(1)ji xi

)). (2.6)

Now it is our goal to generalise this network function to one not only of an arbitrary numberof hidden layers but also such that each hidden layer can be of arbitrary size and have an

5

arbitrary activation function. For this purpose, networks will now be described in terms ofthe number of hidden layers, instead of the number of layers that contain adaptive parameters(hidden plus output) as before. We may start by writing a function for a network of two hiddenlayers

yk(x,w) = O

(m2∑j2=0

wkj2H2

(m1∑j1=0

wj2j1H1

(d∑

i=0

wj1ixi

))). (2.7)

In order to arive at the two layer function (2.7) the same process was used as for the originalnetwork function (2.3). Now there are two activation functions and hidden layer sizes denotedm. Clearly adding a layer to the network simply involves repeated application of this processand picking up the required number of additional parameters. A function for a network of nhidden layers may therefore be written as

yk(x,w) = O

(mn∑jn=0

wkjnHn

(. . .H2

(m1∑j1=0

wj2j1H1

(d∑

i=0

wj1ixi

)). . .

))(2.8)

where there are n different versions of the activation function and hidden layer size. Our newhidden units obey the following notation

hnj = Hn(aj) (2.9)

eliminating the need for the superscript labelling of layer. The new labelling identifies hiddenlayer by the left-hand index of the hidden unit or the right-hand hidden layer of the weights andbiases. In equation (2.8) the hidden units are shown in their expanded form and distinguishedby the fact that their j indices take on the subscript of n corresponding to the layer they belongto. Figure 2.2 depicts a network of n hidden layers, as described in equation (2.8) except thatevery hidden layer is depicted with the same size m.

.

.

. ...

.

.

.

.

.

.

x0

x1

xd

h10

h11

h12

h1m

hn0

hn1

hn2

hnm

y1

yK

Inputs Layer 1 Layer n Outputs

Figure 2.2: A more complex neural network containing an input layer of d nodes correspondingto data of dimensionality d, n hidden layers of m hidden units each hij (where i indexes hiddenlayer and j indexes a particular unit) and an output layer of K predictive units yk.

There are now a great deal of parameters that we have to keep track of. It is thereforeimportant to distinguish between adaptive parameters and the parameters which we must pick

6

by hand, known as hyper-parameters. A relationship can be written between the number ofhidden layers n, and the number of hyper-parameters, as follows

# of hyper-parameters = 2n+ 1. (2.10)

As previously stated there are simply n lots of activation functions and n hidden layers todetermine a size for. Often it is sufficient to set all activation functions to the same function,and in this work all hidden layers share the same size for simplicity. The adaptive parametersare far more numerous, and will be optimised by means of a training algorithm. In TensorFlowthe variable object is choice for implementing the adaptive parameters as training algorithmsin the software will update them by default. The adaptive parameters are initialised randomlyfrom a Gaussian distribution resulting in poor predictive power to begin with. In order toimprove this some figure of merit, known as a loss function, must be used in order for thetraining algorithm to measure quantitatively the performance of the NN. We also must providethe algorithm with a dataset to train on, complete with a set of targets or labels that, forthe training set, reveal the correct classification for each entry. Due to the this requirement,methods such as these are referred to as supervised learning methods. A natural loss functionone may use to describe the error of the model given a current set of adaptive parameters issum-of-squares error

E(w) =1

2

N∑n=1

(y(xn,w)− tn)2 (2.11)

where tn are the targets for the given data entries xn. Minimising this function with somealgorithm does work in practice, however it has been shown that, using

E(w) = −N∑

n=1

(tn ln(yn) + (1− tn) ln(1− yn)

)(2.12)

known as cross-entropy, is faster and more generalised [25] (further discussion on generalisationin section 5.3). The particular training algorithm that will be used to update parameters inthis report is known as adaptive moment estimation or ADAM [26]. ADAM is a variant of thegradient descent algorithm, which is widely used and has spawned many other variants [27].The reasons for picking between ADAM and vanilla gradient descent are given in chapter 4.

7

3. Picking Features

One of the crucial hyper-parameters, discussed in chapter 2, that needs optimising is the di-mensionality of the data d. One could simply pick d to be the highest possible value, whichwould equal to the total dimensionality of each measurement D. For example, if a measurementof position had an x, y and a z coordinate then in this regime d = D = 3. Generally it couldbe said for larger values of d the model is more complex and in theory a more complex modelis able to describe more complex relationships. This may seem like nothing but a benefit, how-ever as detailed by MacKay [28] it is common practice, and indeed recommended, to obey theprinciple of Occam’s razor when solving problems. It could therefore be advantageous to pickd < D provided that performance does not suffer greatly. It could also be the case that due totime constraints there is a limitation placed on the maximum value of d as computation timescales with d. If time is specifically limited then it may be useful to fix an upper bound on d inadvance. Whether or not dimensionality is fixed Occam’s razor insists that the best solutionto the problem is the least complex model that provides the best performance. In other words,if two models have equal or very similar discriminating capability then the less complex of thetwo should be preferred. Therefore the issue that remains is, in which way should the particularsubset of D to train the NN be picked. Members of D are known as features and the processof picking the members of d is known as feature selection.

There are additional motivations for reducing the dimensionality of a dataset. Reductionhelps to avoid what is known as the curse of dimensionality. This so-called curse refers to theproblems encountered when a data set is of a very large dimension, these are many-fold butgenerally include a higher computational cost and lower statistical significance. The reductionin statistical significance comes from the fact that a parameter space that has extremely largevolume will be much more difficult to densely populate with data, and therefore datasets of highdimensionality suffer from sparsity. A smaller parameter space also leads to improvements ingeneralisation, a term given to the problem of ensuring that the predictive power of a NN willhold on previously unseen data sets. For example, if a NN trained on Monte Carlo simulations ofATLAS events could correctly classify real data from the experiment it is said to be generalised.Further comments regarding generalisation will come later.

The information the NN function (as in chapter 2 equation (2.8)) is able exploit in orderto group data into different classes comes from the features that are included. In general theNN combines the features non-linearly. This non-linearity is important to keep in mind whenconsidering which features should be picked from the overall set, and makes the task far fromtrivial. Nevertheless, there are a few simple relationships that can be used to help with featureselection. Considering an iterative process whereby a single feature is our starting point, itcan be said that a second feature should not be added to the set if it is very highly correlatedwith the starting feature. This is because if the two features are correlated then it is assumedthat the second feature does not add a significant amount of new information to the NN and

8

will negligibly improve discriminating power. By plotting a histogram of a given feature andcolouring it such that the desired classes are marked according to their truth labels it is possibleto view any existing linear separations of class that already exist in the data. It can be said thatfeatures whose distributions are shaped such that the data is already to some degree separatedinto the desired classes will provide good separating power. The converse however is not true,due to the fact that the the NN is not limited to linear combinations of features it cannot beassumed that features whose distributions are not linearly separable are not worthy of inclusion.

3.1. Features for hh → bbτ+τ−

In order to discuss the features for the hh → bbτ+τ− decay, first the selection criteria mustbe made clear. The Monte Carlo simulations used in this analysis were simulated based on aselection criteria requiring exactly one hadronically decaying tau particle, exactly one lepton(muon or electron), and exactly two b jets in the final state. A representation of such a decaycan be seen in figure 3.1, where it is clear that the lepton required for selection comes froma leptonically decaying tau particle. There are many different measurements made within the

p

pτ∓

ντ

l±

νl

b

b

h

τ±W±

h

Figure 3.1: A schematic of a proton-proton collision that undergoes some unspecified process(denoted by the striped circle) resulting in the formation of two standard model Higgs bosonsh, that in turn decay into products that meet the selection criteria for this analysis. Here l±

represents a muon, electron or either of their anti-particles (depending on sign).

ATLAS detector that could be used to identify the signature of the signal. Given that the goalof this analysis is to produce a signal versus background classifier it makes sense to focus on thefeatures that will provide discriminating power between the two, and therefore understandingthe background process is crucial. Figure 3.2 a process in which a proton-proton collisionproduces the tt background. At first glance it would appear that these processes would havea final state that would be rejected by the selection criteria as they are missing a hadronicallydecaying tau particle. In fact the W+ boson hadronises into two quarks producing a similarresponse in the detector to the hadronic jet of the missing tau particle. It should be notedthat alternatively to in figure 3.2 the W+ boson could decay leptonically with the W− bosonhadronising, and this would still meet the selection criteria. The W boson can also decay intoa real tau particle, however the fraction of the W± → hadrons decay mode is (67.60± 0.27)%(relative to other W± decay modes) is large compared to W± → τ± which has a fraction of

9

(11.25±0.20)% [29]. If a W boson decays to a real tau as opposed to the fake tau (hadrons) theother W boson must decay leptonically or the process will be rejected. Furthermore if the realtau coming from the first boson decays leptonically the process will also be rejected because inthis case there would be two leptons and no hadronic tau.

p

p

b

b

l−

νl

qu

qd

t

t W−

W+

Figure 3.2: A schematic of a proton-proton collision that undergoes some unspecified process(denoted by the striped circle) resulting in the formation of a top and anti-top quark thatin turn decay into products that meet (or fake) the selection criteria for this analysis. Herel− represents a muon or an electron, qu represents a charm or up quark, and (qd) denotes ananti-down or anti-strange quark.

Now it is our goal to identify which observables will be useful in distinguishing the signal andbackground processes, checking that they meet out general criteria for features of a NN andthen testing to check discriminating power. Comparing figures 3.1 and 3.2 it is immediatelyidentifiable that the origin of the b-jets is different. For the signal both b-jets share a parent(the standard model Higgs), whereas in the background the b-jets are always heterogeneouswith respect to their parent. It is therefore expected that in signal events the angle of separationbetween the two jets would be smaller than in background events. In order to discuss such anangle first the coordinate system used to describe the ATLAS detector must be established asit is in the overview of the detector [18]. The interaction point is taken as the origin, withthe beam travelling along the z-axis. The (x, y) plane is perpendicular to the beam direction,with positive x pointing towards the centre of the LHC ring and positive y pointing up. Theazimuthal angle φ is measured around the z-axis with θ, the polar angle, measured from thez-axis. The quantity known as pseudo-rapidity (η) is defined as in equation (3.1), which allowsus to define the difference in separation between two jets ∆R as in (3.2).

η = − ln

(tan

(θ

2

)), (3.1)

∆R =

√(∆η)2

+(∆φ)2. (3.2)

It is this quantity ∆R(b, b) that is expected to be smaller for the signal than for the background,based on the origin of the b-jets. Indeed figure 3.3 shows that simulations agree with thisassumption, a sharp peak in the signal can be seen at ∆R(b, b) < 1 rad, whereas a much morespread out peak at around 3 rad. We can therefore surmise that indeed ∆R(b, b) provides agood way to distinguish signal from background in our simulated events.

10

Figure 3.3: A histogram of ∆R(b, b) the angular separation between the two b-jets colouredaccording to whether the event was simulated as signal or background. Here the signal is theRSG at a mass point of 900 GeV with the usual tt background. The histograms have beenweighted such that they have unit area.

Another important concept when it comes to selecting features for high energy physics prob-lems is that of reconstructed mass. In general, measuring the 3-momentum and energy of decayproducts, allows you to reconstruct each particle’s 4-momentum, p = (E, ~p), where ~p is ordinary3-momentum.

p · p = pµpµ = −E2 + ~p · ~p, (3.3)

E2 = ~p · ~p+m2. (3.4)

Considering a particle decay where A → BC we can assume by conservation of 4-momentumthat pA = pB + pC . Imagining a situation where the 3-momentum and energies of particles Band C have been measured at ATLAS we can reconstruct their 4-momentum. We can thenuse the aforementioned conservation law to deduce the 4-momentum of A. Once this has beenobtained we can use equation (3.3) and the relativistic dispersion relation (3.4) to determinemA. By obtaining mA in this fashion it is said that the mass has been reconstructed. There arenumber of different masses that can be reconstructed in order to aid the separation of signaland background. First the reconstructed mass of the entire process shall be considered. Insteadof labelling this overall mass as mRSG for the RSG or mH for H, we will instead call all massesobtained for the full process mhh, as both produce two SM Higgs bosons. Of course in thecase of the background process, the two top-quarks are merely mimicking the signature of theHiggs bosons and in reality the quantity mhh has no physical connection to a Higgs boson fora background event. Given that the signal events are produced by a resonance, their massspectrum is expected to be a sharp symmetrical peak centred about the particular simulatedmass point. In contrast the background events are not coming from resonance and thereforethey are expected to form an asymmetric peak starting at the mass of two top-quarks andexponentially falling off with increasing mass. It can be seen in figure 3.4 that this is indeedthe case.

Mass reconstruction can also be carried out on individual branches of the decay to gainadditional information about the process. Considering the signal process as in figure 3.1,

11

Figure 3.4: A histogram of mhh coloured according to whether the event was simulated assignal or background. Here the signal is the RSG at a mass point of 600 GeV with the usual ttbackground. The histograms have been weighted such that they have unit area.

reconstructing the mass of the h → τ−τ+ portion of the diagram ought to give a different resultto reconstructing a mass from the same decay product that exists in the background process asin figure 3.2. Both of these reconstructions would ideally account for the neutrinos that existin the decay products, however due to the fact that the ATLAS detector has no sensitivity toneutrinos using conventional methods limits us to reconstructing the visible mass mvis. Thisquantity is reconstructed using the decay products circled in blue in figure 3.5. The missing

τ∓

ντ

l±

νl

l−

νlqu

qd

mvis

mMMC

mvis

mMMC

τ±W±

t

tW−

W+

Figure 3.5: The portions of the signal and background processes relating to the features mvis

and mMMC . The dashed blue circles show the particles used to reconstruct mvis whereas thedashed red circles show that mMMC is reconstructed by attempting to account for neutrinos(which cannot be detected by ATLAS).

mass calculator (MMC) technique is a statistical tool that aims to reconstruct a mass that doesaccount for the neutrinos [30]. Figure 3.5 also shows, in red circles the decay products and theirparents that the missing mass calculator attempts to reconstruct into a mass, referred to asmMMC . The expected difference between the signal and background distributions of simulatedevents has its origin, as for mhh, in the fact that the signal comes from a resonance decay

12

whereas the background does not. Unlike mhh where the reconstructed invariant masses arevery different, it is predicted instead that both signal and background distributions would peakat around the same value. The similarity in the location of the peaks comes from the fact thatthe mass, the method is attempting to reconstruct, is the same for both processes (it is not untilthe RSG or H part of the process that the mass of any unknown particle would present itself).The only difference therefore in mMMC for signal and background would be the same differencein the shapes of the peaks as in mhh due to the fact that only the signal process originatesfrom resonance. Figure 3.6 agrees with these predictions, and shows that the background hasa wider peak with a larger tail than in the case of mhh.

Figure 3.6: A histogram ofmMMC coloured according to whether or not the event was simulatedas signal or background. Here, the signal is H at a mass point of 300 GeV with the usual ttbackground. The histograms have been weighted such that they have unit area.

There are many other measurements that may serve as features for a NN trained to separatethese two signal and background processes, however they are left out of the discussion for thesake of avoiding repetition. Some of the features not mentioned in this chapter are selectedbased on similar arguments to those already presented. Further features have not been discussedsimply because of the fact that the NN is able to gain some discriminating power from featuresthat are useful for no obvious or intuitive physical reasons. That is not to say that the NNhas some mysterious power to see separations in data that are not really present but merelythat these separations manifest in parameter spaces that exceed the number of dimensionsbeyond which sensible visualisations break down. The features that fall into the aforementionedcategory are picked or left out of the NN by empirical testing of the classifier’s performancewith and without them. A list of the features used in this work is included in table A.1 in theappendix A. All features discussed in this chapter are of a derived nature, with respect to thediscussion in chapter 2, further comment will be made on primitive and derived features in thechapter 5.

13

4. Hyper-Parameter Selection

Getting high performance from a NN training relies on good hyper-parameter selection. Asstated in chapter 2 the hyper-parameters that need to be optimised are: dimensionality of thedata, the number of hidden layers, the size of each hidden layer and the choice of each activationfunction. In the previous chapter the dimensionality of the data was discussed in the contextof feature selection. This hyper-parameter stands out as somewhat unique, in that it is pickedbased on rules regarding the features themselves and so will be very different based on whatthe data actually represents. The other hyper-parameters can be selected in a process that ismuch more similar across different datasets.

4.1. Hidden Layers

Two very linked hyper-parameters are the number of hidden layers, and the size of each hiddenlayer. To develop an understanding of how to think about what each hidden units does wewill begin by considering a NN with a single hidden layer. The restriction to a single hiddenlayer constrains separating lines in the parameter space to being straight, because non-linearresponses from hidden units come from the weighted mixing of linear responses from previoushidden units. To aid the discussion a range of datasets shall be considered that consists merelyof points described by two spatial coordinates, the parameter space is therefore simply the X,Y plane reduced to a unit square. It would be possible to get a non-linear response from ahidden unit in the first hidden layer if a node in the input layer was fed with a non-linearcombination of features, for example with X2. In the following examples we will consider NNswhose input layer consist of two nodes, one for the feature X and the other for Y . It is alsoimportant at this stage to introduce the concept of a decision surface. A decision surface is acontour line in the feature space at a special value related to decision making. If a NN outputsa number between 0 and 1, where 0 is the response the NN gives for something it considersstrongly to be background and 1 for signal, then the relevant decision surface is the contourline where the response is equal to 0.5. The decision boundary is termed a surface because ingeneral the feature space can be of many dimensions.

The plots in figure 4.1 show the easiest kind of dataset to separate into two classes. Fullsignal and background separation can achieved by some hypothetical NN with a single hiddenunit. It is also easy to see that for this very simple example, the decision surface and the outputfrom the hidden unit are identical. In fact a NN with no hidden layers whatsoever would beable to fully separate this data. This idea can be extended to form a guideline to be taken intoaccount when picking the number of hidden layers. If the output of a hidden unit in the lasthidden layer of a NN is identical to the output of the overall NN, then the information requiredto produce that output must have been present in the previous layer, therefore that layer can

14

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

X

Y

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

X

Y

Figure 4.1: Fake data points of signal (squares) and background (circles). The blue line in theleft plot shows the response from a single hidden unit, the red line in the right plot shows thedecision surface.

be removed without loss of accuracy. Considering a dataset where signal and background areharder to separate, a hidden layer becomes necessary, and one with more than a single hiddenunit at that. This is demonstrated in figure 4.2 where the response from three hidden units hasbeen plotted. It is clear from the diagram that any fewer than three lines would not result incomplete separation of signal and background. Another observation is that the decision surfaceappears to envelope an area similar to that of the output from the hidden units with somesmoothing that results in rounded corners. This smoothing effect is the result of the activationfunctions H that were introduced in chapter 2.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

X

Y

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

X

Y

Figure 4.2: Fake data points of signal (squares) and background (circles). The blue lines in theleft plot show the response from a three different hidden units, the red line in the right plotshows the decision surface.

So far NNs with only a single hidden layer have been discussed in order to get an idea ofwhat happens in a hidden unit. Next we shall discuss what happens in hidden units beyond

15

the first hidden layer. Figure 4.3 shows a hypothetical response from a NN training on a fakedataset, for a NN with four hidden units in the first hidden layer and two hidden units inthe second. All four responses from the first hidden layer are shown on the top left plot, theresponses from the second hidden layer are divided across the plots in the top right and bottomleft corners. It is clear from looking at the responses from the second hidden layer that theyare simply mixing of the responses from the first hidden layer. It is interesting to note that a

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

X

Y

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

X

Y

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

X

Y

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

X

Y

Figure 4.3: Plots showing the output of a hypothetical neural network with two hidden layersof size four and two respectively. The response from the first hidden layer, where the arrowsmark on which side of each line the response favoured signal (top left). The response from thesecond hidden layer with one unit per plot (top right, bottom left), and the overall decisionsurface (bottom right).

NN that was given X × Y as a feature would have been able to attain separation of the datawith fewer layers. The role of the hidden layers can be understood for its role in mimickingnon-linear combinations of the features. It has indeed been shown that NNs with multiplehidden layers can mimic multiplication, powering, and division of features [31]. The numberof hidden layers and their respective sizes for this analysis were picked by trial and error butthe starting point was chosen based on the fact that the NN was not given any combinationsof features (so multiple layers would be required), and that the data was not easy to separate(so these layers would potentially need a large number of units each).

16

4.2. Activation Functions & Optimisation Algorithms

As expressed in chapter 2, it is quite possible to have a different activation function at everyhidden layer of a NN. For the types of NN built in this analysis the activation function chosen atevery hidden layer was the same. Activation functions have the same role as the step function inRosenblatt’s perceptron [8], however they must never be multi-valued for a given input. Thereare a few functions that are popular choices for activation function as seen in figure 4.4 theyare all approximately ‘s’ shaped, and therefore sigmoidal though only one is the true vanillasigmoid function. In general, functions which are symmetric about the origin are preferred[32], for this reason the vanilla sigmoid function is a poor choice compared to the other two.Choosing between the softsign and tanh functions is difficult due to their similarity. In thisanalysis the both activations were tried on a subset of the overall data and the tanh functionwas chosen. This choice was motivated merely by the fact that the tanh function seems to bea more common choice as there was no perceived difference in performance.

−1.0 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1.0

−1.0

−0.5

0.5

1.0

x

yf(x) = 1

1+e−x

g(x) = tanh(x)

h(x) = 11+|x|

Figure 4.4: A graphical representation of three popular activation functions. The legend showsthe mathematical form of the functions (apart from for the common hyperbolic tangent func-tion). Here f(x) = sigmoid(x) and h(x) = softsign(x).

The gradient descent algorithm is the most common choice for optimising NNs. As previouslydiscussed in chapter 2 the algorithm sets the value of the adaptive parameters as to minimisea loss function. Iterative steps are taken, during which the algorithm uses the differentialof the function that it is optimising (its gradient) and aims to find a local minima (hencegradient descent). The fact that the algorithm will find a local minima instead of the globalminimum of the function means that the algorithm performs best on functions with a singleminimum. Gradient descent in its normal form updates adaptive parameters after a batch ofdata has been processed, alternatively if the batch size is reduced to one, the algorithm is calledstochastic gradient descent. The algorithm used in this analysis is the ADAM optimiser. It isa variant of gradient descent that stores exponentially decaying averages of past gradients andpast squared gradients. The algorithm uses this information to speed up convergence, and ingeneral ADAM, or another faster version of gradient descent, should be used if time is precious,otherwise stochastic gradient descent is slower but more robust [27].

17

5. Results & Analysis

5.1. General Methodology

All NN results presented were trained using the TensorFlow software. Networks all used tanhactivation functions, the softmax output function and a cross entropy cost function. TheTensorFlow tf.nn.softmax cross entropy with logits combined implementation of outputand cost functions was used to avoid any potential divergences. The value of the cost functionis monitored at each epoch, and checked to see if it is lowest for the given training, if noimprovement is found within five epochs the training terminates. A value known as the learningrate determines the speed at which the algorithm attempts to converge on a minimum, it wasset to exponentially decay. The decay is governed by the following formula at each epoch

learning rate = initial rate× (decay base)∧(epoch/decay epochs). (5.1)

The values in equation (5.1) were set such that the learning rate decayed to 10% of its initialvalue in 10 epochs, where the initial rate was 0.001. In all cases the hold-out method was usedwhereby training and testing datasets were isolated from one another in advance. The NN istrained on the training set and then performance is measured with the testing set, previouslyunseen by the network. A means of accounting for event weights was implemented and will bedetailed in section 5.2.

Figure 5.1: The signal and background NN output distributions for the best performing trainingof the H signal at 300 GeV.

18

Histograms of the NN output for signal events and background events were normalised tounit area and plotted on the same axes such as in figure 5.1. The NN output was also plottedfor the testing data, to check that there are no large discrepancies between testing and trainingoutputs. Discrepancies would indicate that the NN had learned patterns in noise present in thedata rather than the patterns that yield separation in general. A NN that learns a pattern innoise is said to have been over-trained, whilst the issue of ensuring a network will perform aspredicted on unseen data is known as generalisation. Comparing these distributions is done inthe standard way by plotting a receiving operating characteristic (ROC) curve, which effectivelygives a plot of background rejection versus signal efficiency. The area under curve (AUC) of theROC curve is then equal to 1 for completely separated classes, and would have the expectationvalue of 0.5 if labels were assigned to events at random. Comparing ROC curves is little moreuseful than just comparing AUCs therefore to save on space NN outputs will be discussed bycomparing the latter.

Nodes perSignal (mass) Hidden layers hidden layer AUC

RSG (900 GeV) 2 600 0.9734 300 0.9918 150 0.97116 75 0.870

H (900 Gev) 2 600 0.9674 300 0.9898 150 0.94516 75 0.898

RSG (600 GeV) 2 600 0.9714 300 0.9848 150 0.91216 75 0.843

H (600 Gev) 2 600 0.9864 300 0.9858 150 0.95716 75 0.899

RSG (300 GeV) 2 600 0.8814 300 0.8618 150 0.73816 75 0.697

H (300 Gev) 2 600 0.9404 300 0.9278 150 0.87216 75 0.700

Table 5.1: Results contrasting NNs of few hidden layers with many nodes each and NNs ofmany hidden layers with few nodes each.

Hyper-parameters were picked by first performing a scan that aimed to contrast deep net-works of many hidden layers but few nodes in each and shallow networks of few hidden layersbut many nodes in each. The results in table 5.1 show the AUCs networks from 2 to 16 hidden

19

layers. This initial sweep through configurations was used to get a general picture of the num-ber of hidden layers/nodes per hidden layer that would perform best. After these results hadbeen analysed two hyper-parameter scans were carried out whereby one hyper-parameter wasiterated on whilst all others were fixed. For a fixed number of nodes per hidden layer (300), thenumber of hidden layers was iterated between 1 and 10. It was found that on average NNs with3 hidden layers performed best. Secondly the number of nodes per hidden layer was iteratedbetween 75 and 750 for NNs of 3 hidden layers. Full results of both these scans can be foundin appendix B.

All NNs were trained on the same fourteen features detailed in table A.1 as previouslymentioned. The feature set contains both primitive and derived features, which given thediscussion chapter 4 ought to be at least somewhat unnecessary. The reason for this is that aNN of multiple hidden layers should be able to combine primitive features in such a way thatthe mathematical operations that result in the derived features are approximated. Some of thederived features in the set may in principle redundant, however the feature set presented wasthe one for which the best AUC was attained on average across all signals. Training specificsignals and their mass points on different feature sets was not viable given the time constraintsplaced on this work, but would be interesting to try in the future.

5.2. Best Performing Networks

The best performing NNs for each of the signal mass point combinations that were consideredare detailed in table 5.2. It is clear that for both the H and RSG the higher mass points areeasier to classify than their lower mass counterparts. It is also evident that the best performingNNs for high mass points had fewer nodes per hidden layer in general. An interpretation ofthese relationships is that due to the higher mass points being easier to classify less nodes arerequired to do so (signal and background are less mixed up). The number of hidden layers foreach of the best performing NNs does not vary greatly especially compared with the variationof nodes per hidden layer. Given the interpretation that additional hidden layers allow thenetwork to approximate more complex mathematical functions, this can be taken to mean thatfor each of the signals considered, a similar amount of complexity was required.

Nodes perSignal (mass) Hidden layers hidden layer AUC

RSG (900 GeV) 3 225 0.991

H (900 Gev) 3 225 0.989

RSG (600 GeV) 4 300 0.984

H (600 Gev) 2 600 0.986

RSG (300 GeV) 2 600 0.881

H (300 Gev) 3 750 0.946

Table 5.2: The best performances in terms of AUC for each signal at each mass point.

These results have been compared with internal results from the ATLAS run 2 analyses ofthe same decay. Across all signals and mass points it can be said that the discriminating powerof these NNs is not yet competitive with the best performing BDTs. Discussions with those

20

that work on this analysis have lead to the conclusion that the event weight implementationis suspected to be the reason for the discrepancy in accuracy. In basic terms the differencebetween implementations is that TMVA takes event weights into account when training (asit was designed for use in HEP), whereas in this work event weights were implemented byartificially duplicating events based on their event weight. Specifically the event weight whichranges from roughly 0–4, was multiplied by 10, rounded down to the nearest integer and thenthe entry corresponding to that weight was duplicated by a factor of the integer. This isproblematic for two reasons, firstly duplicating events is not best practice when performingMonte Carlo simulations (otherwise generating more simulated events would be trivial), andsecondly event weight precision is reduced by using this method. Event weights commonly haveup to ten digits after the decimal place and so by multiplying by only 10 before casting as aninteger lots of this precision is lost, however a problem with using a larger number than 10 isgreatly increased computational demand. Another potential reason for the lack in performanceof the NNs is the fact that only NNs in which the number of nodes per hidden layer was constantacross all hidden layers were used. Similarly to in the case of picking different feature sets fordifferent signals, picking different sizes for different hidden layers could have proved fruitful butwas also infeasible due to time constraints.

5.3. Addressing Generalisation

As mentioned previously in section 5.1 the hold out method was used to combat over-training,and whilst the training and test distributions in figure 5.1 appear to agree closely, the issue ofgeneralisation has not been properly addressed. When it comes to high energy physics analyses,having reproducible results is in general more important getting a result that was desired. Forthis reason it is important to note that this result does not take into account whatsoever thevariance of a NN training. A crude test was performed where the same NN was run five separatetimes and the resulting AUC was found not to change, however this is not enough to rule outthe role variance when it comes to analysing a NN output. The problem is that for a givenconfiguration the variance may be found to be low on Monte Carlo simulated events of a givenmass point but this says nothing about the potential variance for other mass points, real dataor alternate network configurations. Several methods exist within the field of machine learningthat are designed to reduce over-training. One such method, dropout [33] is widely used andhas an implementation in TensorFlow, it is suggested that the use of this technique in HEPanalyses is looked at in more detail.

21

6. Conclusions

6.1. Summary

The issue of separating signal and background events in data from an experiment such at theATLAS detector the the LHC has been explained. The significance of being able to classifyevents has been framed in terms of the analysis of the X → hh → bbτ+τ− decay. Monte Carlosimulations for X as a Randall-Sundrum Graviton and as a two-Higgs-doublet model heavyHiggs have been considered. NNs have been proposed as a means of classification, and trainedon the simulated data of the signals at a range of mass point (300, 600, 900 GeV). Basics ofthe TensorFlow software, which is relatively new in terms of its usage in the field have beengiven. The mathematical underpinnings of the neural network have been detailed in chapter2. With this understanding of the elements that the NN is comprised of both mathematicallyand in the software it is hoped that the usage of NNs as black box is shown to be unnecessaryand far from best practice.

Feature selection was introduced as a necessary part of building a successful NN classifier.The concept of Occam’s razor was used to motivate picking a subset of overall set of featuresthat are available. The ∆R(b, b),mhh and mMMC features were highlighted specifically andmotivations were given as to why they make good candidates to be members of this subset. Ina similar fashion the selection of hyper-parameters was discussed. The role of a hidden unitin a network of a single layer was demonstrated by using plots of fake data that is imaginedto have been classified by some hypothetical NN. This discussion continued by extending thisanalogy to a NN of two hidden layers. Armed with this understanding the process of pickingboth hyper-parameters and features by trial and error is made faster.

Finally the results of the analysis have been detailed. First some general methodology thatapplies to all NNs that were trained was explained. Then details were given of the specific hyper-parameter configurations that were scanned over. Interpretations were made in line with theunderstanding developed in previous chapters as to the nature of the relationship between thedifficulty of the classification problem and the values of the hyper-parameters that performedthe task best. It was found that the classification power of these NNs is not yet competitivewith the rival algorithms being used on the ATLAS run 2 data. The reason for this is thoughtto lie in the difference in the consideration of event weights. Comments were made on the issueof generalisation, and it was noted that this work lacks a proper quantifiable way to measurethe variance of the results and therefore whether or not the NNs are generalised. The dropoutmethod was proposed as a technique that may help in this avenue.

22

6.2. Future Research

There is a particular difference between traditional machine learning problems and HEP anal-yses that would be interesting to address in the future. When training a NN classifier on aproblem with a real world application for example imagine recognition, the number of records(events) in each class is in general similar. For example if a NN is to be trained to be able toclassify images as either oranges or apples it would usually be trained on a set of images that intruth were depicting both examples roughly 50% of the time. In the Monte Carlo simulationsof particle physics decays interest usually lies in searching for an as yet undetected particle.Naturally it is expected that particles which have yet to be discovered are produced far lessfrequently that those that are known to us (which make up the background). For this reason inevents in the simulations are heavily skewed in favour of being background, that is to say thatthere are far more background events that signal in a given simulated dataset. It is thereforeproposed that artificially sampling an equal number of signal and background events at eachtraining iteration may change the output distribution compared with random sampling fromthe entire simulated dataset.

Finally, a short comment on the viability of the use of TensorFlow in the field of HEP.The TensorFlow software and corresponding documentation and tutorials available on the webform an excellent basis for someone looking to understand and implement their own NNs.The software is very versatile and at one stage prototypes of the NNs used in this work wererunning across four GPU cards simultaneously. Such parallelism is hard to achieve with otherpackages that are used in HEP. Despite this large advantage, the fact that there is no nativeway in TensorFlow to account for event weights such as those used in this analysis, is a barrierto entry for any high energy physicist. On the other hand, problems of an image recognitionnature are also sometimes dealt with in HEP and for these problems the lack of event weightingis not a problem, and therefore the use of TensorFlow is thoroughly recommended.

23

Bibliography

[1] Georges Aad et al. (ATLAS Collaboration). Observation of a new particle in the searchfor the Standard Model Higgs boson with the ATLAS detector at the LHC. Phys. Lett.,B716:1–29, 2012. doi: 10.1016/j.physletb.2012.08.020.

[2] Serguei Chatrchyan et al. (CMS Collaboration). Observation of a new boson at a massof 125 GeV with the CMS experiment at the LHC. Phys. Lett., B716:30–61, 2012. doi:10.1016/j.physletb.2012.08.021.

[3] G. C. Branco, P. M. Ferreira, L. Lavoura, M. N. Rebelo, Marc Sher, and Joao P. Silva.Theory and phenomenology of two-Higgs-doublet models. Phys. Rept., 516:1–102, 2012.doi: 10.1016/j.physrep.2012.02.002.

[4] Lisa Randall and Raman Sundrum. A Large mass hierarchy from a small extra dimension.Phys. Rev. Lett., 83:3370–3373, 1999. doi: 10.1103/PhysRevLett.83.3370.

[5] Lisa Randall and Raman Sundrum. An Alternative to compactification. Phys. Rev. Lett.,83:4690–4693, 1999. doi: 10.1103/PhysRevLett.83.4690.

[6] Rene Brun and Fons Rademakers. Root - an object oriented data analysis framework. InAIHENP’96 Workshop, Lausane, volume 389, pages 81–86, 1996.

[7] A. Hoecker et al. TMVA - Toolkit for Multivariate Data Analysis. ArXiv Physics e-prints,March 2007.

[8] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organi-zation in the brain. Psychological Review, pages 65–386, 1958.

[9] Liliana Teodorescu. Artificial neural networks in high-energy physics. 2008. URL http:

//cds.cern.ch/record/1100521.

[10] Hermann Kolanoski. Application of artificial neural networks in particle physics. 1995.doi: 10.1016/0168-9002(95)00743-1. URL https://pdfs.semanticscholar.org/8da9/

68cfb8e41331a7385b3367c464e5706e0111.pdf.

[11] M. Pogwizd, L. J. Elgass, and P. C. Bhat. Bayesian Learning of Neural Networks forSignal/Background Discrimination in Particle Physics. ArXiv e-prints, July 2007.

[12] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Enhanced Higgs Boson to τ+τ− Searchwith Deep Learning. Phys. Rev. Lett., 114(11):111801, 2015. doi: 10.1103/PhysRevLett.114.111801.

24

http://cds.cern.ch/record/1100521

http://cds.cern.ch/record/1100521

https://pdfs.semanticscholar.org/8da9/68cfb8e41331a7385b3367c464e5706e0111.pdf

https://pdfs.semanticscholar.org/8da9/68cfb8e41331a7385b3367c464e5706e0111.pdf

[13] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for Exotic Particles inHigh-Energy Physics with Deep Learning. Nature Commun., 5:4308, 2014. doi: 10.1038/ncomms5308.

[14] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. CoRR,abs/1603.02754, 2016. URL http://arxiv.org/abs/1603.02754.

[15] Byron P. Roe, Hai-Jun Yang, Ji Zhu, Yong Liu, Ion Stancu, and Gordon McGregor.Boosted decision trees, an alternative to artificial neural networks. Nucl. Instrum. Meth.,A543(2-3):577–584, 2005. doi: 10.1016/j.nima.2004.12.018.

[16] Petra Perner, Uwe Zscherpel, and Carsten Jacobsen. A comparison between neural net-works and decision trees based on data from industrial radiographic testing. PatternRecognition Letters, 22(1):47 – 54, 2001. ISSN 0167-8655. doi: http://dx.doi.org/10.1016/S0167-8655(00)00098-2. URL http://www.sciencedirect.com/science/article/pii/

S0167865500000982. Machine Learning and Data Mining in Pattern Recognition.

[17] Martın Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems,2015. URL http://tensorflow.org/. Software available from tensorflow.org.

[18] Georges Aad et al. (ATLAS Collaboration). The atlas experiment at the cern large hadroncollider. Journal of Instrumentation, 3(08):S08003, 2008. URL http://stacks.iop.org/

1748-0221/3/i=08/a=S08003.

[19] Georges Aad et al. (ATLAS Collaboration). Searches for higgs boson pair production inthe hh → bbττ , γγwW ∗, γγbb, bbbb channels with the atlas detector. Phys. Rev. D, 92:092004, Nov 2015. doi: 10.1103/PhysRevD.92.092004. URL http://link.aps.org/doi/

10.1103/PhysRevD.92.092004.

[20] CMS Collaboration. Search for resonant Higgs boson pair production in the bbτ+τ− finalstate using 2016 data. 2016.

[21] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neu-ral Networks, 4(2):251 – 257, 1991. ISSN 0893-6080. doi: http://dx.doi.org/10.1016/0893-6080(91)90009-T. URL http://www.sciencedirect.com/science/article/pii/

089360809190009T.

[22] David Ascher, Paul F. Dubois, Konrad Hinsen, James Hugunin, and Travis Oliphant.Numerical Python. Lawrence Livermore National Laboratory, Livermore, CA, ucrl-ma-128569 edition, 1999.

[23] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Sci-ence and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN0387310738.

[24] R. A. FISHER. The use of multiple measurements in taxonomic problems. Annals ofEugenics, 7(2):179–188, 1936. ISSN 2050-1439. doi: 10.1111/j.1469-1809.1936.tb02137.x.URL http://dx.doi.org/10.1111/j.1469-1809.1936.tb02137.x.

[25] John Platt Patrice Y. Simard, Dave Steinkraus. Best practices for convolutionalneural networks applied to visual document analysis. Institute of Electrical andElectronics Engineers, Inc., August 2003. URL https://www.microsoft.com/en-

25

http://arxiv.org/abs/1603.02754

http://www.sciencedirect.com/science/article/pii/S0167865500000982

http://www.sciencedirect.com/science/article/pii/S0167865500000982

http://tensorflow.org/

http://stacks.iop.org/1748-0221/3/i=08/a=S08003

http://stacks.iop.org/1748-0221/3/i=08/a=S08003

http://link.aps.org/doi/10.1103/PhysRevD.92.092004

http://link.aps.org/doi/10.1103/PhysRevD.92.092004

http://www.sciencedirect.com/science/article/pii/089360809190009T

http://www.sciencedirect.com/science/article/pii/089360809190009T

http://dx.doi.org/10.1111/j.1469-1809.1936.tb02137.x

https://www.microsoft.com/en-us/research/publication/best-practices-for-convolutional-neural-networks-applied-to-visual-document-analysis/


us/research/publication/best-practices-for-convolutional-neural-networks-

applied-to-visual-document-analysis/.

[26] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.

[27] Sebastian Ruder. An overview of gradient descent optimization algorithms. CoRR,abs/1609.04747, 2016. URL http://arxiv.org/abs/1609.04747.

[28] David MacKay. Bayesian methods for adaptive models. 1991.

[29] C. Patrignani et al. Review of Particle Physics. Chin. Phys., C40(10):100001, 2016. doi:10.1088/1674-1137/40/10/100001.

[30] A. Elagin, P. Murat, A. Pranko, and A. Safonov. A New Mass Reconstruction Techniquefor Resonances Decaying to di-tau. Nucl. Instrum. Meth., A654:481–489, 2011. doi: 10.1016/j.nima.2011.07.009.

[31] Kai-Yeung Siu and Vwani Roychowdhury. Optimal depth neural networks for multipli-cation and related problems. Advances in Neural Information Processing Systems, pages59–59, 1993.

[32] Yann A LeCun, Leon Bottou, Genevieve B Orr, and Klaus-Robert Muller. Efficient back-prop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.

[33] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-dinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach.Learn. Res., 15(1):1929–1958, January 2014. ISSN 1532-4435. URL http://dl.acm.org/

citation.cfm?id=2627435.2670313.

26






http://dl.acm.org/citation.cfm?id=2627435.2670313

http://dl.acm.org/citation.cfm?id=2627435.2670313

Appendix A: Full Features List

Feature Description

mbb The reconstructed mass of the di-b jet system.

mhh The reconstructed mass of the di-Higgs systemor their mimicked signature in the case of the background.

mvis The reconstructed mass of visible products di-tau systemor the fake tau where relevant.

∆R(b, b) The angular separation between the two b jets.

pT (b2) The transverse momentum of the sub-leading b jet.

pT (b, b) The transverse momentum of the di-b jet system.

pT (l) The transverse momentum of the lepton.

pT (τ) The transverse momentum of the hadronically decaying tau.

pT (l, τ) The transverse momentum of the lepton-tau system,where the tau is hadronically decaying.

EmissT The missing transverse energy.

HT The scalar sum of EmissT and pT of all jets,

the lepton, and the hadronically decaying tau.

∆φ(lτ) The difference in φangle between the lepton,and the hadronically decaying tau.

mT (l, EmissT ) The transverse mass of the lepton hadronically

decaying tau system1.

∆pT (lτ) The difference in transverse momentum between the leptonand the hadronically decaying tau.

Table A.1: A full list of features used to train NNs in chapter 5.

1mT (l, EmissT ) =

√2plTE

missT (1− cos (∆φ))

27

Appendix B: Hyper-Parameter Scan

Signal (mass) Hidden layers AUC

RSG (900 GeV) 1 0.9872 0.9793 0.9784 0.9875 0.9876 0.9817 0.9778 0.9459 0.94210 0.920

H (900 Gev) 1 0.9812 0.9813 0.9864 0.9885 0.9846 0.9847 0.9718 0.9569 0.90810 0.878

RSG (300 GeV) 1 0.8032 0.8763 0.8794 0.8615 0.8406 0.8257 0.7558 0.7119 0.65210 0.634

H (300 Gev) 1 0.8902 0.9323 0.9314 0.9295 0.9216 0.9017 0.8788 0.8549 0.78810 0.753

Table B.1: A scan over number of hidden layers, with nodes per hidden layer fixed at 300.

28

Nodes perSignal (mass) hidden layer AUC

RSG (900 GeV) 75 0.979150 0.989225 0.991300 0.989375 0.983450 0.985525 0.968600 0.979675 0.974750 0.982

H (900 Gev) 75 0.982150 0.985225 0.989300 0.986375 0.981450 0.987525 0.984600 0.986675 0.988750 0.982

RSG (300 GeV) 75 0.847150 0.861225 0.857300 0.866375 0.869450 0.877525 0.880600 0.880675 0.877750 0.880

H (300 Gev) 75 0.876150 0.914225 0.924300 0.931375 0.932450 0.940525 0.940600 0.937675 0.940750 0.946

Table B.2: A scan over nodes per hidden layer, with the number of hidden layers fixed at 3.

29

Appendix C: Technical Specifications

• CPU - Intel(R) Xenon(R) E5530

• GPU - Nvidia Tesla K40c with driver version 352.93

• TensorFlow 0.9.0

• CUDA 7.5 with CuDNN (prerequisites for TensorFlow GPU utilisation)

• Python 2.7.11 with Anaconda 4.1.0 (64-bit)

• GCC 4.4.7

• ROOT 6.06

30

atlas detector event classi cation with tensorflow

Documents