alexander radovic college of william and mary · cvn @ dune adam aurisano university of cincinnati...
TRANSCRIPT
CVN @ DUNE
Adam Aurisano University of Cincinnati
Alexander Radovic College of William and Mary
Evan Niner Fermilab
Fernanda Psihas Indiana University
Alex Himmel Fermilab
Alexander Radovic CVN@DUNE1
Introduction• Convolutional neural networks have proven to be extremely
effective for event identification at the NOvA experiment: https://arxiv.org/abs/1604.01444 • MicroBoone has recently show us that liquid argon
detector readout is a similarly excellent domain for CNNs: https://arxiv.org/abs/1611.05531 • This leads to a natural next question. Can we take the art
based tools and expertise from the NOvA CNN, named CVN, and apply it on DUNE MC to rapidly produce a competitive PID?
• If we can it could prove to be a powerful tool for benchmarking alternative detector designs, hit finding tools, and TDR numbers.
Alexander Radovic CVN@DUNE2
Neural Networks
y
Alexander Radovic CVN@DUNE3
Neural Networks
x = input vector
y
y = � (Wx+ b)
� =
Alexander Radovic CVN@DUNE4
Convolutional Neural Networks
http://setosa.io/ev/image-kernels/
Input Feature MapKernel
Instead of training a weight for every input pixel, try learning weights that describe kernel operations, convolving that kernel across the entire image to exaggerate useful features. Inspired by research showing that cells in the visual cortex are only responsive to small portions of the visual field.
Convolutional Neural Networks
Feature Map
https://developer.nvidia.com/deep-learning-courses
Instead of training a weight for every input pixel, try learning weights that describe kernel operations, convolving that kernel across the entire image to exaggerate useful features. Inspired by research showing that cells in the visual cortex are only responsive to small portions of the visual field.
Alexander Radovic CVN@DUNE6
Convolutional Layers• Every trained kernel operation is the same across an entire
input image or feature map. • Each convolutional layer trains an array of kernels to
produce output feature maps.• Weights for a given
convolutional layer are a 4D tensor of NxMxHxW (number of incoming features, number of outgoing features, height, and width)
Alexander Radovic CVN@DUNE7
Pooling Layers• Intelligent downscaling of input feature maps. • Stride across images taking either the maximum or average
value in a patch. • Same number of feature maps, with each individual feature
map shrunk by an amount dependent on the stride of the pooling layers.
CVN@DUNE8
The LeNetIn its simplest form a convolutional neural network is a series of convolutional, max pooling, and MLP layers:
The “LeNet” circa 1989
http://deeplearning.net/tutorial/lenet.html http://yann.lecun.com/exdb/lenet/
Modern CNNsRenaissance in CNN use over the last few years, with increasingly complex network-in-network models that allow for deeper learning of more complex features.
“Going deeper with convolutions” arXiv:1409.4842
The brilliance of this inception module is that it uses kernels of several sizes but keeps the number of feature maps under control by use of a 1x1 convolution.
Alexander Radovic CVN@DUNE10
Modern CNNs
The brilliance of this inception module is that it uses kernels of several sizes but keeps the number of feature maps under control by use of a 1x1 convolution.
Renaissance in CNN use over the last few years, with increasingly complex network-in-network models that allow for deeper learning of more complex features.
Alexander Radovic CVN@DUNE11
Modern CNNs
The “GoogleNet” circa 2014
Convolution Pooling Softmax Other
The brilliance of this inception module is that it uses kernels of several sizes but keeps the number of feature maps under control by use of a 1x1 convolution.
Renaissance in CNN use over the last few years, with increasingly complex network-in-network models that allow for deeper learning of more complex features.
Alexander Radovic CVN@DUNE12
CVN @ NOvA
CC Classifier Outputµν0 0.2 0.4 0.6 0.8 1
PO
T20
10×Ev
ents
/ 18
0
20
40
60eνAppeared
µνSurvived NC background
backgroundeνBeam
CC Classifier Outputeν0 0.2 0.4 0.6 0.8 1
PO
T20
10×Ev
ents
/ 18
0
5
10
15eνAppeared
µνSurvived NC background
backgroundeνBeam
NOvA CVN achieves 73% efficiency and 76% purity on νe selection at the optimized cut. Equivalent to 30% more exposure with the previous PIDs. Full details here: https://arxiv.org/abs/1604.01444
s/ps+ b
After oscillations, cosmic rejection cuts, data quality cuts:
Our Input• Our input “image”, is a set of three maps of wire hits. One for each
of the U and V wire plane views and a third for the collection plane. These maps are chosen to be 500 wires long and 3200 TDC ticks wide (split into 200 time chunks). Each pixel is then the sum of the adc recorded in that time/space slice.
• The map size was chosen primarily due to disk space and GPU memory constraints. Even this comparatively coarse image is 12.5 times larger than the NOvA CVN inputs.
• The wire hits themselves are from “gaus hit”. I’m using the peak time for the time of each hit and the integral adc under the signal for the adc.
• Using the full set of mergeana files from http://dune-data.fnal.gov/mc/mcc7/index.html: • prodgenie_{a}nu_dune10kt_1x2x6_mcc7.0 • prodgenie_{a}nue_dune10kt_1x2x6_mcc7.0 • prodgenie_{a}nutau_dune10kt_1x2x6_mcc7.0
Alexander Radovic CVN@DUNE14
Our Input
Alexander Radovic CVN@DUNE15
U V Collection
Our Input
Alexander Radovic CVN@DUNE16
U V Collection
The Training Sample• 500k events, only preselection
requiring 100 hits split across any number of planes.
• Labels are from GENIE truth, neutrino vs. antineutrino is ignored.
• No oscillation information, just the raw input flux/fluxswapped distributions.
• Pushed into LevelDB databases: 80% for training and 20% for testing.
• Rescale adc to go from 0 to 255 and truncate to chars for dramatically reduced file size at almost no loss of information. Final bin is overflow for >1000 adc.
ART
Even
ts
ν e Q
E
ν e R
ES
ν e D
IS
ν e C
OH
ν µ Q
E
ν µ R
ES
ν µ D
IS
ν µ C
OH
ν τ Q
E
ν τ R
ES
ν τ D
IS
ν τ C
OH
NC
17 Alexander Radovic
Num
ber o
f “Pi
xels
”
Truncated and Rescaled Integral ADC
Our Architecture• Minimal variation on the architecture on
the CVN paper as a first pass. Major change is taking both the induction planes as the “X” view and the collection plane as the “Y” view.
• The architecture attempts to categorize events as {νµ, νe, ντ } × {QE,RES,DIS}, or NC. Softmax output so these can be combined to form other PIDS.
• Designed to be a universal classifier. • Built in the excellent CAFFE
framework: http://caffe.berkeleyvision.org/
? =PL^
*VU]VS\[PVU��� Z[YPKL �
4H_ 7VVSPUN��� Z[YPKL �
395
*VU]VS\[PVU��
*VU]VS\[PVU��
395
4H_ 7VVSPUN��� Z[YPKL �
0UJLW[PVU4VK\SL
0UJLW[PVU4VK\SL
4H_ 7VVSPUN��� Z[YPKL �
0UJLW[PVU4VK\SL
@ =PL^
*VU]VS\[PVU��� Z[YPKL �
4H_ 7VVSPUN��� Z[YPKL �
395
*VU]VS\[PVU��
*VU]VS\[PVU��
395
4H_ 7VVSPUN��� Z[YPKL �
0UJLW[PVU4VK\SL
0UJLW[PVU4VK\SL
4H_ 7VVSPUN��� Z[YPKL �
0UJLW[PVU4VK\SL
0UJLW[PVU4VK\SL
(]N 7VVSPUN��
:VM[TH_�6\[W\[
Alexander Radovic 18
Training PerformanceFirst check, how does training and test performance track over time?
Alexander Radovic
Training Performance
Alexander Radovic
Second check, how does the accuracy break down by event type?
NuMu PID0 0.2 0.4 0.6 0.8 1
Test
Sam
ple
Even
ts
0
1000
2000
3000
NuMu CCNuENuTau CCNC
Training Performance
Alexander Radovic
Third check, what sort of PID separation am I getting?
NuE PID0 0.2 0.4 0.6 0.8 1
Test
Sam
ple
Even
ts
0
200
400
600
800
1000
NuMu CCNuENuTau CCNC
Conclusions• Code to make CVN training samples currently running in the
DUNE art framework. First commit of the module coming soon.
• Working on adding the caffe UPS product so that we can also add the inference code- at that point we can start to benchmark with real cuts and oscillation parameters.
• Promising first pass but it’s clear that with only 500k events overtraining could be a problem. Can we generate a much larger but minimally reconstructed sample?
• Useful tool in and of itself, but would Pandora or shower/track CNN designers be interested in taking a CVN PID result as an input?
• Very naive first start, huge amounts of room to improve. Expect rapid gains in performance and many more diagnostic plots.
Alexander Radovic CVN@DUNE22
HEP Deep Learning Papers“A Convolutional Neural Network Neutrino Event Classifier” JINST 11 (2016) no.09, P09001 “Revealing Fundamental Physics from the Daya Bay Neutrino Experiment using Deep Neural Networks” arXiv:1601.07621 "“Background rejection in NEXT using deep neural networks” arXiv:1609.06202 “Searching for exotic particles in high-energy physics with deep learning” Nature Communications 5, Article number: 4308 (2014)"“Jet-Images -- Deep Learning Edition” arXiv:1511.05190 “Jet Substructure Classification in High-Energy Physics with Deep Neural Networks” Phys.Rev. D93 (2016) no.9, 094034"“Parton Shower Uncertainties in Jet Substructure Analyses with Deep Neural Networks” arXiv:1609.00607 “Machine learning techniques in searches for tt¯h in the h→bb¯ decay channel” arXiv:1610.03088"“Learning to Pivot with Adversarial Networks” arXiv:1611.01046"“Convolutional Neural Networks Applied to Neutrino Events in a Liquid Argon Time Projection Chamber” arXiv:1611.05531
Interesting ReadingSome high level discussion of the use of these tools in science: http://www.nature.com/news/can-we-open-the-black-box-of-ai-1.20731 I would recommend starting by reading this: http://deeplearning.net/tutorial/lenet.html then I'd read through the various links on the caffe homepage Our own paper which might be of interest: https://arxiv.org/abs/1604.01444 Some other useful papers: http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf <-- introduces the idea of networks in networks http://arxiv.org/pdf/1409.4842v1.pdf <-- introduces a specific google network in network called an inception module which we've found to be very powerful http://www.cs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf <-- semantic segmentation Developments in the googlenet, which is great way to track general trends in the field http://arxiv.org/abs/1502.03167 <-- introduces batch normalization http://arxiv.org/pdf/1512.00567.pdf <-- smarter kernel sizes http://arxiv.org/abs/1602.07261 <-- introducing residual layers Also fun though I've yet to think of a good hep application: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Deep LearningDeep Neural Networks
Convolutional Neural Networks
Neural Turing Machines
Recurrent Neural NetworkUnsupervised Learning
Adversarial Networks
Deep LearningDeep Neural Networks
Convolutional Neural Networks
Neural Turing Machines
Recurrent Neural NetworkUnsupervised Learning
Adversarial Networks
Neural Networks
y
Neural Networks
x = input vector
y
y = � (Wx+ b)
� =
Training A Neural NetworkL(
W,x
)
W
Start with a “Loss” function which characterizes the performance of the network. For supervised learning:
L(W,X) =
1
N
NexamplesX
1
�yi log (f(xi))� (1� yi) log (1� f(xi))
Training A Neural Network
L(W,X) =
1
N
NexamplesX
1
�yi log (f(xi))� (1� yi) log (1� f(xi))
Add in a regularization term to avoid overfitting:L0 = L+
1
2
X
j
w2j
Start with a “Loss” function which characterizes the performance of the network. For supervised learning:
Training A Neural Network
L(W,X) =
1
N
NexamplesX
1
�yi log (f(xi))� (1� yi) log (1� f(xi))
Add in a regularization term to avoid overfitting:L0 = L+
1
2
X
j
w2j
Update weights using gradient descent:
Propagate the gradient of the network back to specific nodes using back propagation. AKA apply the chain rule:
w0
j = wj � ↵rwjL
rwjL =�L
�f
�f
�gn
�gn�gn�1
...�gk+1
�gk
�gk�wj
Start with a “Loss” function which characterizes the performance of the network. For supervised learning:
Deep Neural NetworksWhat if we try to keep all the input data? Why not rely on a wide, extremely Deep Neural Network (DNN) to learn the features it needs? Sufficiently deep networks make excellent function approximators:
http://cs231n.github.io/neural-networks-1/
However, until recently they proved almost impossible to train.
The Deep Learning RevolutionMost famously “big data” and accelerated computing made it easier to find and run deep neural networks:
Better Activation FunctionsBut there were also some major technical breakthroughs. One being more effective back propagation due to better weight initialization and saturation functions:
The problem with sigmoids: ReLU:
http://deepdish.io/
�� (x)
�x
= � (x) (1� � (x))
Sigmoid gradient goes to 0 when x is far from 1. Makes back propagation impossible! Use ReLU to avoid saturation.
ReLU (x)
�x
=
(1 when x > 0
0 otherwise
Smarter TrainingAnother is stochastic gradient descent (SGD). In SGD we avoid some of the cost of gradient descent by evaluating as few as one event at a time. The performance of conventional gradient descent is approximated as the various noisy sub estimates even out, with the stochastic behavior even allowing for jumping out of local minima.
http://hduongtrong.github.io/
Dropout• Same goal as conventional regularization- prevent
overtraining. • Works by randomly removing whole nodes during training
iterations. At each iteration, randomly set XX% of weights to zero and scale the rest up by 1/(1 – 0.XX).
• Forces the network not to build complex interdependencies in the extracted features.