alexander radovic college of william and mary · cvn @ dune adam aurisano university of cincinnati...

36
CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas Indiana University Alex Himmel Fermilab Alexander Radovic CVN@DUNE 1

Upload: others

Post on 12-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

CVN @ DUNE

Adam Aurisano University of Cincinnati

Alexander Radovic College of William and Mary

Evan Niner Fermilab

Fernanda Psihas Indiana University

Alex Himmel Fermilab

Alexander Radovic CVN@DUNE1

Page 2: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Introduction• Convolutional neural networks have proven to be extremely

effective for event identification at the NOvA experiment: https://arxiv.org/abs/1604.01444 • MicroBoone has recently show us that liquid argon

detector readout is a similarly excellent domain for CNNs: https://arxiv.org/abs/1611.05531 • This leads to a natural next question. Can we take the art

based tools and expertise from the NOvA CNN, named CVN, and apply it on DUNE MC to rapidly produce a competitive PID?

• If we can it could prove to be a powerful tool for benchmarking alternative detector designs, hit finding tools, and TDR numbers.

Alexander Radovic CVN@DUNE2

Page 3: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Neural Networks

y

Alexander Radovic CVN@DUNE3

Page 4: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Neural Networks

x = input vector

y

y = � (Wx+ b)

� =

Alexander Radovic CVN@DUNE4

Page 5: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Convolutional Neural Networks

http://setosa.io/ev/image-kernels/

Input Feature MapKernel

Instead of training a weight for every input pixel, try learning weights that describe kernel operations, convolving that kernel across the entire image to exaggerate useful features. Inspired by research showing that cells in the visual cortex are only responsive to small portions of the visual field.

Page 6: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Convolutional Neural Networks

Feature Map

https://developer.nvidia.com/deep-learning-courses

Instead of training a weight for every input pixel, try learning weights that describe kernel operations, convolving that kernel across the entire image to exaggerate useful features. Inspired by research showing that cells in the visual cortex are only responsive to small portions of the visual field.

Alexander Radovic CVN@DUNE6

Page 7: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Convolutional Layers• Every trained kernel operation is the same across an entire

input image or feature map. • Each convolutional layer trains an array of kernels to

produce output feature maps.• Weights for a given

convolutional layer are a 4D tensor of NxMxHxW (number of incoming features, number of outgoing features, height, and width)

Alexander Radovic CVN@DUNE7

Page 8: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Pooling Layers• Intelligent downscaling of input feature maps. • Stride across images taking either the maximum or average

value in a patch. • Same number of feature maps, with each individual feature

map shrunk by an amount dependent on the stride of the pooling layers.

CVN@DUNE8

Page 9: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

The LeNetIn its simplest form a convolutional neural network is a series of convolutional, max pooling, and MLP layers:

The “LeNet” circa 1989

http://deeplearning.net/tutorial/lenet.html http://yann.lecun.com/exdb/lenet/

Page 10: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Modern CNNsRenaissance in CNN use over the last few years, with increasingly complex network-in-network models that allow for deeper learning of more complex features.

“Going deeper with convolutions” arXiv:1409.4842

The brilliance of this inception module is that it uses kernels of several sizes but keeps the number of feature maps under control by use of a 1x1 convolution.

Alexander Radovic CVN@DUNE10

Page 11: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Modern CNNs

The brilliance of this inception module is that it uses kernels of several sizes but keeps the number of feature maps under control by use of a 1x1 convolution.

Renaissance in CNN use over the last few years, with increasingly complex network-in-network models that allow for deeper learning of more complex features.

Alexander Radovic CVN@DUNE11

Page 12: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Modern CNNs

The “GoogleNet” circa 2014

Convolution Pooling Softmax Other

The brilliance of this inception module is that it uses kernels of several sizes but keeps the number of feature maps under control by use of a 1x1 convolution.

Renaissance in CNN use over the last few years, with increasingly complex network-in-network models that allow for deeper learning of more complex features.

Alexander Radovic CVN@DUNE12

Page 13: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

CVN @ NOvA

CC Classifier Outputµν0 0.2 0.4 0.6 0.8 1

PO

T20

10×Ev

ents

/ 18

0

20

40

60eνAppeared

µνSurvived NC background

backgroundeνBeam

CC Classifier Outputeν0 0.2 0.4 0.6 0.8 1

PO

T20

10×Ev

ents

/ 18

0

5

10

15eνAppeared

µνSurvived NC background

backgroundeνBeam

NOvA CVN achieves 73% efficiency and 76% purity on νe selection at the optimized cut. Equivalent to 30% more exposure with the previous PIDs. Full details here: https://arxiv.org/abs/1604.01444

s/ps+ b

After oscillations, cosmic rejection cuts, data quality cuts:

Page 14: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Our Input• Our input “image”, is a set of three maps of wire hits. One for each

of the U and V wire plane views and a third for the collection plane. These maps are chosen to be 500 wires long and 3200 TDC ticks wide (split into 200 time chunks). Each pixel is then the sum of the adc recorded in that time/space slice.

• The map size was chosen primarily due to disk space and GPU memory constraints. Even this comparatively coarse image is 12.5 times larger than the NOvA CVN inputs.

• The wire hits themselves are from “gaus hit”. I’m using the peak time for the time of each hit and the integral adc under the signal for the adc.

• Using the full set of mergeana files from http://dune-data.fnal.gov/mc/mcc7/index.html: • prodgenie_{a}nu_dune10kt_1x2x6_mcc7.0 • prodgenie_{a}nue_dune10kt_1x2x6_mcc7.0 • prodgenie_{a}nutau_dune10kt_1x2x6_mcc7.0

Alexander Radovic CVN@DUNE14

Page 15: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Our Input

Alexander Radovic CVN@DUNE15

U V Collection

Page 16: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Our Input

Alexander Radovic CVN@DUNE16

U V Collection

Page 17: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

The Training Sample• 500k events, only preselection

requiring 100 hits split across any number of planes.

• Labels are from GENIE truth, neutrino vs. antineutrino is ignored.

• No oscillation information, just the raw input flux/fluxswapped distributions.

• Pushed into LevelDB databases: 80% for training and 20% for testing.

• Rescale adc to go from 0 to 255 and truncate to chars for dramatically reduced file size at almost no loss of information. Final bin is overflow for >1000 adc.

ART

Even

ts

ν e Q

E

ν e R

ES

ν e D

IS

ν e C

OH

ν µ Q

E

ν µ R

ES

ν µ D

IS

ν µ C

OH

ν τ Q

E

ν τ R

ES

ν τ D

IS

ν τ C

OH

NC

17 Alexander Radovic

Num

ber o

f “Pi

xels

Truncated and Rescaled Integral ADC

Page 18: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Our Architecture• Minimal variation on the architecture on

the CVN paper as a first pass. Major change is taking both the induction planes as the “X” view and the collection plane as the “Y” view.

• The architecture attempts to categorize events as {νµ, νe, ντ } × {QE,RES,DIS}, or NC. Softmax output so these can be combined to form other PIDS.

• Designed to be a universal classifier. • Built in the excellent CAFFE

framework: http://caffe.berkeleyvision.org/

? =PL^

*VU]VS\[PVU��� Z[YPKL �

4H_ 7VVSPUN��� Z[YPKL �

395

*VU]VS\[PVU��

*VU]VS\[PVU��

395

4H_ 7VVSPUN��� Z[YPKL �

0UJLW[PVU4VK\SL

0UJLW[PVU4VK\SL

4H_ 7VVSPUN��� Z[YPKL �

0UJLW[PVU4VK\SL

@ =PL^

*VU]VS\[PVU��� Z[YPKL �

4H_ 7VVSPUN��� Z[YPKL �

395

*VU]VS\[PVU��

*VU]VS\[PVU��

395

4H_ 7VVSPUN��� Z[YPKL �

0UJLW[PVU4VK\SL

0UJLW[PVU4VK\SL

4H_ 7VVSPUN��� Z[YPKL �

0UJLW[PVU4VK\SL

0UJLW[PVU4VK\SL

(]N 7VVSPUN��

:VM[TH_�6\[W\[

Alexander Radovic 18

Page 19: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Training PerformanceFirst check, how does training and test performance track over time?

Alexander Radovic

Page 20: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Training Performance

Alexander Radovic

Second check, how does the accuracy break down by event type?

Page 21: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

NuMu PID0 0.2 0.4 0.6 0.8 1

Test

Sam

ple

Even

ts

0

1000

2000

3000

NuMu CCNuENuTau CCNC

Training Performance

Alexander Radovic

Third check, what sort of PID separation am I getting?

NuE PID0 0.2 0.4 0.6 0.8 1

Test

Sam

ple

Even

ts

0

200

400

600

800

1000

NuMu CCNuENuTau CCNC

Page 22: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Conclusions• Code to make CVN training samples currently running in the

DUNE art framework. First commit of the module coming soon.

• Working on adding the caffe UPS product so that we can also add the inference code- at that point we can start to benchmark with real cuts and oscillation parameters.

• Promising first pass but it’s clear that with only 500k events overtraining could be a problem. Can we generate a much larger but minimally reconstructed sample?

• Useful tool in and of itself, but would Pandora or shower/track CNN designers be interested in taking a CVN PID result as an input?

• Very naive first start, huge amounts of room to improve. Expect rapid gains in performance and many more diagnostic plots.

Alexander Radovic CVN@DUNE22

Page 23: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

HEP Deep Learning Papers“A Convolutional Neural Network Neutrino Event Classifier” JINST 11 (2016) no.09, P09001 “Revealing Fundamental Physics from the Daya Bay Neutrino Experiment using Deep Neural Networks” arXiv:1601.07621 "“Background rejection in NEXT using deep neural networks” arXiv:1609.06202 “Searching for exotic particles in high-energy physics with deep learning” Nature Communications 5, Article number: 4308 (2014)"“Jet-Images -- Deep Learning Edition” arXiv:1511.05190 “Jet Substructure Classification in High-Energy Physics with Deep Neural Networks” Phys.Rev. D93 (2016) no.9, 094034"“Parton Shower Uncertainties in Jet Substructure Analyses with Deep Neural Networks” arXiv:1609.00607 “Machine learning techniques in searches for tt¯h in the h→bb¯ decay channel” arXiv:1610.03088"“Learning to Pivot with Adversarial Networks” arXiv:1611.01046"“Convolutional Neural Networks Applied to Neutrino Events in a Liquid Argon Time Projection Chamber” arXiv:1611.05531

Page 24: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Interesting ReadingSome high level discussion of the use of these tools in science: http://www.nature.com/news/can-we-open-the-black-box-of-ai-1.20731 I would recommend starting by reading this: http://deeplearning.net/tutorial/lenet.html then I'd read through the various links on the caffe homepage Our own paper which might be of interest: https://arxiv.org/abs/1604.01444 Some other useful papers: http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf <-- introduces the idea of networks in networks http://arxiv.org/pdf/1409.4842v1.pdf <-- introduces a specific google network in network called an inception module which we've found to be very powerful http://www.cs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf <-- semantic segmentation Developments in the googlenet, which is great way to track general trends in the field http://arxiv.org/abs/1502.03167 <-- introduces batch normalization http://arxiv.org/pdf/1512.00567.pdf <-- smarter kernel sizes http://arxiv.org/abs/1602.07261 <-- introducing residual layers Also fun though I've yet to think of a good hep application: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Page 25: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Deep LearningDeep Neural Networks

Convolutional Neural Networks

Neural Turing Machines

Recurrent Neural NetworkUnsupervised Learning

Adversarial Networks

Page 26: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Deep LearningDeep Neural Networks

Convolutional Neural Networks

Neural Turing Machines

Recurrent Neural NetworkUnsupervised Learning

Adversarial Networks

Page 27: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Neural Networks

y

Page 28: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Neural Networks

x = input vector

y

y = � (Wx+ b)

� =

Page 29: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Training A Neural NetworkL(

W,x

)

W

Start with a “Loss” function which characterizes the performance of the network. For supervised learning:

L(W,X) =

1

N

NexamplesX

1

�yi log (f(xi))� (1� yi) log (1� f(xi))

Page 30: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Training A Neural Network

L(W,X) =

1

N

NexamplesX

1

�yi log (f(xi))� (1� yi) log (1� f(xi))

Add in a regularization term to avoid overfitting:L0 = L+

1

2

X

j

w2j

Start with a “Loss” function which characterizes the performance of the network. For supervised learning:

Page 31: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Training A Neural Network

L(W,X) =

1

N

NexamplesX

1

�yi log (f(xi))� (1� yi) log (1� f(xi))

Add in a regularization term to avoid overfitting:L0 = L+

1

2

X

j

w2j

Update weights using gradient descent:

Propagate the gradient of the network back to specific nodes using back propagation. AKA apply the chain rule:

w0

j = wj � ↵rwjL

rwjL =�L

�f

�f

�gn

�gn�gn�1

...�gk+1

�gk

�gk�wj

Start with a “Loss” function which characterizes the performance of the network. For supervised learning:

Page 32: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Deep Neural NetworksWhat if we try to keep all the input data? Why not rely on a wide, extremely Deep Neural Network (DNN) to learn the features it needs? Sufficiently deep networks make excellent function approximators:

http://cs231n.github.io/neural-networks-1/

However, until recently they proved almost impossible to train.

Page 33: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

The Deep Learning RevolutionMost famously “big data” and accelerated computing made it easier to find and run deep neural networks:

Page 34: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Better Activation FunctionsBut there were also some major technical breakthroughs. One being more effective back propagation due to better weight initialization and saturation functions:

The problem with sigmoids: ReLU:

http://deepdish.io/

�� (x)

�x

= � (x) (1� � (x))

Sigmoid gradient goes to 0 when x is far from 1. Makes back propagation impossible! Use ReLU to avoid saturation.

ReLU (x)

�x

=

(1 when x > 0

0 otherwise

Page 35: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Smarter TrainingAnother is stochastic gradient descent (SGD). In SGD we avoid some of the cost of gradient descent by evaluating as few as one event at a time. The performance of conventional gradient descent is approximated as the various noisy sub estimates even out, with the stochastic behavior even allowing for jumping out of local minima.

http://hduongtrong.github.io/

Page 36: Alexander Radovic College of William and Mary · CVN @ DUNE Adam Aurisano University of Cincinnati Alexander Radovic College of William and Mary Evan Niner Fermilab Fernanda Psihas

Dropout• Same goal as conventional regularization- prevent

overtraining. • Works by randomly removing whole nodes during training

iterations. At each iteration, randomly set XX% of weights to zero and scale the rest up by 1/(1 – 0.XX).

• Forces the network not to build complex interdependencies in the extracted features.