r in high energy physics (a somewhat personal account) adam lyon fermi national accelerator...
Post on 21-Dec-2015
216 views
TRANSCRIPT
R in High Energy Physics(A somewhat personal account)
Adam Lyon Fermi National Accelerator LaboratoryComputing Division - DØ Experiment
PHYSTAT Workshop on Statistical SoftwareMSU - March, 2004
Outline:Some BackgroundWhy is R interesting to us?Some non-analysis examplesUsing R in HEPSome thoughts on where this can go
2A. Lyon (FNAL/DØCA) – 2004
Some background on me… Graduate student on DØ
(400 person Fermilab experiment) Marc Paterno and I were
some of the first to use C++ for analysis at DØ (days of PAW)
… and the first DØ to use Bayesian statistics for limit calculation
Postdoc on CLEO (Cornell) (200 person experiment) Used PAW, ROOT &
Mathematica for several analyses
Involved in experiment's transition to C++
Back to DØ (now 700 person Fermilab experiment) as an associate scientist in Computing Division
Used R for non-HEP analysis applications
Pondering (with Marc Paterno and Jim Kowalkowski also of FNAL/CD) how R can be made useful in HEP analyses
3A. Lyon (FNAL/DØCA) – 2004
First use of R Marc (C++ & Statistics expert & trouble maker)
came across R and showed it to Jim and myself.
Looked neat but didn't have any reason to use it until…
Monitoring of DØ's Data Handling SystemDØ has 601 Terabytes of data on tapeSAM (DØ & CDF joint project) is our
• File storage system (knows where all files live)• File delivery system (gets those files to you worldwide)• File cataloging system (stores meta-data for file
cataloging) • Analysis bookkeeping system (remembers what you did)
4A. Lyon (FNAL/DØCA) – 2004
Data Handling at DØ
SAM typically delivers
~150 TB of data to users per month
It's perhaps a 0th generation GRID
SAM is a very complicated system
No monitoring except for huge dumps of text and log files
Monitoring was sorely needed -- lots of things can go wrong
Usage statistics were needed for future planning and discovery of bottlenecks
5A. Lyon (FNAL/DØCA) – 2004
samTV
6A. Lyon (FNAL/DØCA) – 2004
Monitoring with R Turn a text file like this (from parsing big log files):station procId time event fromStation durcabsrv1 2983599 1074593577 OpenFile enstore 9343cabsrv1 2983599 1074604748 RequestNextFile NA 11171cabsrv1 2983599 1074609598 OpenFile enstore 4850cabsrv1 2983599 1074620392 RequestNextFile NA 10794cabsrv1 2983599 1074620392 OpenFile fnal-cabsrv1 0cabsrv1 2983599 1074631505 RequestNextFile NA 11113cab 3085189 1076666381 OpenFile cab 415cab 3085189 1076673379 RequestNextFile NA 6998cab 3085189 1076673379 OpenFile cab 0cab 3085189 1076680426 RequestNextFile NA 7047cab 3085189 1076680753 OpenFile enstore 327cab 3085189 1076687836 RequestNextFile NA 7083cab 3085189 1076687836 OpenFile enstore 0cab 3085189 1076694821 RequestNextFile NA 6985cab 3085189 1076695114 OpenFile cab 293cab 3085189 1076702701 RequestNextFile NA 7587cab 3085189 1076702701 OpenFile enstore 0cab 3085189 1076710021 RequestNextFile NA 7320cab 3085189 1076710021 OpenFile enstore 0cab 3085189 1076717651 RequestNextFile NA 7630cab 3085189 1076717651 OpenFile cab 0(705,000 more lines like the above!)
7A. Lyon (FNAL/DØCA) – 2004
Into plots like this… R code:
library(lattice)
d = read.table("data.dat", head=T)
w = data[ data$event=="OpenFile",]
w$min = w$dur/60.0
bwPlot( fromStation ~ min | station, data=w, subset=(min<60)
xlab="Minutes", main="Wait Time for …" )
8A. Lyon (FNAL/DØCA) – 2004
Box and Whisker Plots
9A. Lyon (FNAL/DØCA) – 2004
Why is R interesting to us? Seems to be the "State of the Art" in statistics Enormous library of user contributed add-on
packages Huge number of statistical tests, fitting, smoothing, … More advanced stuff too: genetic algorithms, support vector
machines, kriging (would have been useful for my thesis!) Advanced graphics based on William Cleveland's
Visualizing Data SQL (MySQL, Oracle, SqlLite, Postgres, ODBC), XML Hooks to COM and CORBA Interfaces for Python, Perl, Tk (GUIs), Java
Pretty easy interface to C, C++, Fortran
Some nice conveniences (R can save its state) It's multiplatform It's free!
10A. Lyon (FNAL/DØCA) – 2004
The R Language
"Not unlike S" Author (John Chambers) received 1998
ACM Software System Award:
The ACM's citation notes that Dr. Chambers' work "will forever alter the way people analyze, visualize, and manipulate data . . . S is an elegant, widely accepted, and enduring software system, with conceptual integrity, thanks to the insight, taste, and effort of John Chambers." (http://www.acm.org/announcements/ss99.html)
I guess he did good!
11A. Lyon (FNAL/DØCA) – 2004
What is the R/S Language Interesting? "Programming with
Data"
The fundamental purpose of the language (as I see it) is to provide general tools for efficient data manipulation and analysis while allowing extensions to those tools to be programmed easily.
Has a specific purpose. You wouldn't write your online data acquisition system in R/S. But analyzing output from online monitoring is certainly a good task for it.
R/S is a functional language vectorized functions,
apply, lazy evaluation
R/S is an object oriented language (but with a functional bent) Functions with the same
name are dispatched based on argument types (has notions of inheritance and other OO features)
Is R/S ideal? Don't know, but we've been very surprised by how some complicated tasks can be accomplished with astonishingly simple code
12A. Lyon (FNAL/DØCA) – 2004
Some non-analysis examples
> nrow(w) [1] 399135
> w[1:2,]station procId time event fromStation dur min
1 cabsrv1 2983599 1074593577 OpenFile enstore 934 155.7
2 cabsrv1 2983599 1074609598 OpenFile enstore 4850 80.8
> w.means = aggregate(w, list(station=w$station, src=w$fromStation), mean)
> w.means[1:2,]station src x
1 cab cab 6.8616951092 cabsrv1 cab 8.171100917
samTV: Plot the mean wait times by file source for each SAM station
2.2 seconds
13A. Lyon (FNAL/DØCA) – 2004
samTV cont'd
> dotplot(src ~ x | station,
data=w.means,
scales=list(cex=1.3),
main=list("Mean Process Wait Times", cex=1.5),
xlab=list("Wait time (minutes)", cex=1.5),
cex=1.7,
par.strip.text= list(cex=1.7) )
14A. Lyon (FNAL/DØCA) – 2004
Non-analysis Examples We’ve found R to be great for slogging through text files and
database query results to make extremely useful and pretty plots
15A. Lyon (FNAL/DØCA) – 2004
Non-analysis applications Performance of DB
server middleware (Marc Paterno)
Data transfer speed. vs. data size for two different servers
Fit to model of startup time plus constant throughput
modpollux = nls( speed ~ alpha*(1-alpha*beta/(alpha*beta+mb)), data=client[pollux,], start=c(alpha=2.0, beta=0.50), trace=T)
16A. Lyon (FNAL/DØCA) – 2004
What have we learned so far? There seems to be an "R way"
Do it the functional way! Use the apply commands and vectorized functions
instead of for loops Higher order functions
One of R's strengths is its user contributions but this means some functionality is repeated (e.g.
three histogram functions -- albeit each serves a slightly different purpose)
The learning curve is long (R can do lots!) But there are extensive manuals, online
documentation, and published books and papers
17A. Lyon (FNAL/DØCA) – 2004
R in HEP We are aware of no one using R, or any other
statistical package, in the HEP community. Why?
Our needs are quite specific and… My Postdoc supervisor (Ed Thorndike): "Trust no one" "Or at least trust no one outside of HEP"
With very few exceptions, all of our scientific software tools are written within the community. Many people write their own, reinventing lots of wheels
Most are unaware of tools from the statistics community and how they could apply to us
Many of us (including me) have little to no formal statistical training and had no exposure to statistical tools (e.g. SAS, SPSS, MATLAB, R)
18A. Lyon (FNAL/DØCA) – 2004
R in HEP Maybe this is changing, a little
Root, the most widely used HEP analysis tool, has TGraphSmooth which implements Loess smoother (translated R functions into C++)
Software is getting more complicated (we are doing lots more than just whipping up quick and dirty Fortran). Some realization that we can't do it all ourselves (e.g. databases, SAM uses consultants)
But problem: our datasets tend to be huge
19A. Lyon (FNAL/DØCA) – 2004
HEP datasets and R R seems to want to hold everything in memory
(recently discovered externalVector; haven't tried it yet)
In HEP, we typically run successive skims to reduce the data size (601 TB down to 100s of Meg or a few Gig) Hard trade offs between size and utility of skims Usually skims are output to a more convenient format
(e.g. Root files) For example, I use a 4th generation skim with 412
variables and 232K rows (1.9 Gig) Even our last stage skims are probably too large for R Efficient handling of large datasets is one reason why
Root is very successful
20A. Lyon (FNAL/DØCA) – 2004
Three strategies for reading HEP data in R Realize that I don't need all 412 variables for all rows in
memory at the same time In fact usually concentrate on just a few variables at a time Perform even further event requirements
1. If data is small enough, bring it into R
2. If can reduce data to something R can hold, bring that subset of data into R -- have the full power of R
perhaps this means using that data for awhile, and loading a new set to tackle another aspect of the problem
3. If can't even do above, then have some R apparatus to read in data one row at a time and update an R object (e.g. histograms) [But you don't get the full power of R]
21A. Lyon (FNAL/DØCA) – 2004
Reading Root files into R Do it the R way!
root.apply("myTree", "myFile.root", myFunction)
C++ and R code written evenings of one weekend (my wife was out of town, dog was asleep)
You supply an R function that receives an entry from your Root file (as a list). Function can make requirements on the data, return
nothing if fails Function returns a new list of variables to pass to R. Can
be new derived variables not in the Root entry
Return of root.apply is a data frame (an R database)
22A. Lyon (FNAL/DØCA) – 2004
Example -- selecting dielectrons# Select events with two good electrons# Only the EM and MET branches are
needed#selectDiE = function(entry) {
# Make dataframe of electron data es = as.data.frame( entry$EM ) ; attach(es)
# Make the requirements for a good electron
goodECuts = ( id == 10 | abs(id) == 11 ) & pt > 25.0 & emfrac > 0.9 & fiducial==1
# If nothing passed, then stop if ( ! any(goodECuts) ) return(NULL)
# Get electron etas etas = abs(eta)
# Make the requirements for good etas goodEtaCuts = etas < 1.05 |
( etas > 1.7 & etas < 2.3 )
# If no electrons had a good eta, then stop if ( ! any(goodEtaCuts) ) return(NULL) # Get the list of electrons meeting all cuts goodEs = goodEtaCuts & goodECuts
# Now require that at least two electrons pass
goodEsDF = es[goodEs,] if ( nrow(goodEsDF) < 2 ) return(NULL)
########## Construct the return list
# Get the ordering for the electrons goodElectronsOrder = order( -goodEsDF$pt )
e1 = goodEsDF[ goodElectronsOrder[[1]], ] names(e1) <- paste("e1", names(e1),
sep=".")
e2 = goodEsDF[ goodElectronsOrder[[2]], ] names(e2) <- paste("e2", names(e2),
sep=".")
# Return return ( c(as.list(e1), as.list(e2), entry$MET ) )}
R Function Definition
Turn electron data into an R data frame
Apply cuts to electrons. Returns a boolean vector (T,T,F,T,F)
Cut entry if nothing passed
Join (AND) the requirements
Make new data frame with passing electrons. Are there
2 or more?
23A. Lyon (FNAL/DØCA) – 2004
Example -- selecting dielectrons# Select events with two good electrons# Only the EM and MET branches are
needed#selectDiE = function(entry) {
# Make dataframe of electron data es = as.data.frame( entry$EM )
# Make the requirements for a good electron
goodECuts = ( es$id == 10 | abs(es$id) == 11 ) & es$pt > 25.0 & es$emfrac > 0.9 & es$fiducial==1
# If nothing passed, then stop if ( ! any(goodECuts) ) return(NULL)
# Get electron etas etas = abs(es$eta)
# Make the requirements for good etas goodEtaCuts = etas < 1.05 |
( etas > 1.7 & etas < 2.3 )
# If no electrons had a good eta, then stop if ( ! any(goodEtaCuts) ) return(NULL) # Get the list of electrons meeting all cuts goodEs = goodEtaCuts & goodECuts
# Now require that at least two electrons pass
goodEsDF = es[goodEs,] if ( nrow(goodEsDF) < 2 ) return(NULL)
########## Construct the return list
# Get the ordering for the electrons goodElectronsOrder = order( -goodEsDF$pt )
e1 = goodEsDF[ goodElectronsOrder[[1]], ] names(e1) <- paste("e1", names(e1),
sep=".")
e2 = goodEsDF[ goodElectronsOrder[[2]], ] names(e2) <- paste("e2", names(e2),
sep=".")
# Return return ( c(as.list(e1), as.list(e2), entry$MET ) )}
24A. Lyon (FNAL/DØCA) – 2004
Analyzing dielectronsd = root.apply("Global",
"mydata.root", selectDiE)
Handed back a data frame with the variables I wanted.
Can now attack this data with the full power of R
25A. Lyon (FNAL/DØCA) – 2004
Dielectrons> d = root.apply(…)
> given.met = equal.counts(d$met, number=4, overlap=0.1)
> summary(given.met)Intervals:
min max count1 0.05888367 4.103577 30442 3.84661865 6.498108 30453 6.21868896 9.914124 30464 9.42181396 88.125061 3043
Ovrlap between adjacent intervals:[1] 307 306 308
> xyplot(e2.pt ~ e1.pt | given.met, data=d)
26A. Lyon (FNAL/DØCA) – 2004
Extracting signal and background from data (From Marc Paterno) Given a data sample, extract the
amount of signal and background Bump fitting A common HEP problem
Try a MC example1. Generate data based on a signal distribution
(Breit-Wigner [Cauchy] of mass and width) and a background distribution (1/(a+b*x)^3)
2. Fit this data with the background and signal distributions, but with unknown parameters
27A. Lyon (FNAL/DØCA) – 2004
Bump Fitting Generate the background
distribution
bf returns a function that when given a uniform random variable [0,1) returns the background distribution with parameters a and b
rbackground generates the distribution for n values
Clever use of higher order functions and vectorized functions
bf = function(a,b) { function(x) { temp=1-x; temp*(a/b)*(temp+sqrt(temp)) }}
rbackground = function(n, a, b) {
transform = bf(a,b); transform(runif(n))}
28A. Lyon (FNAL/DØCA) – 2004
Bump Fitting Generate the signal
Generate n random Breit-Wigner values
Require that distribution be positive and less than max. Throw away values that fail
Recursively call function to make up the amount that was lost
Make the data
Join the signal and background into one distribution
rsignal = function(n, mass, width, max) {
temp = rcauchy(n,mass,width); temp = temp[temp > 0 &
temp < max];
num.more = n - length(temp); if (num.more > 0) { more = rsignal(
n-length(temp), mass, width, max);
temp = append(temp, more); } temp}
rexperiment = function (nsig, mass, width, nback, a, b) {
append(rsignal(nsig, mass, width, a*b/2),
rbackground(nback, a, b)) }
29A. Lyon (FNAL/DØCA) – 2004
Bump fitting Use an unbinned
maximum likelihood fitter (from MASS)
Rprof significantly sped up fit (replace ^)
dbackground = function(x, a, b) {
d = a+b*x 2*a*a*b/(d*d*d)}
mydistr = function(x, f, m, s, a, b) {
(1-f)*dbackground(x,a,b) + f*dcauchy(x,m,s)
}
fres2 = fitdistr(data, densfun=mydistr, start=list(f = FRAC, m=40.0, s=3.0, a=100, b=2.))
30A. Lyon (FNAL/DØCA) – 2004
Bump Fitting True distribution
total Histogram is
generated data Signal fit Background fit Total fit
Bottom plot is of residuals (true-fit)
31A. Lyon (FNAL/DØCA) – 2004
What are we considering next?
Continue to learn more about R
Further Develop the "Three Strategies"
Explore doing a physics analysis in R
Summary
We explore using R, a statistical analysis package from the statistics community, in an HEP enviornment
R has already proven useful for analyzing monitoring and benchmarking data
We have ideas on how R can be used to read large datasets
We've done some "proof of principle" studies of physics analysis with R
As we learn more about R, we expect to be more surprised at its capabilities
32A. Lyon (FNAL/DØCA) – 2004
Options for R and Root Interfacing(after discussions)
no interest from R community in non-I/O functions of Root
In order of work required :
1) R and Root remain separate-- use the more appropriate tool for the task. Use text files to communicate between the two if necessary.
2) Root loads R's math and low level statistical libraries as shared objects Minimalist approach for some
functionality Some access to the math and
statistics C code functions from R These C functions take basic C
types, so no translation necessary But: no upper level functions
written in the R language available
3) R and Root remain separate, but: R package to read Root Trees
directly into R data frames. Still use best tool for particular task Now easier to get HEP data into R
4) Allow calling of selected high level R functions from within Root
Root runs the R interpreter translation is necessary
R functions: understand Root objects Root: understand R return objects
Expose only some R functions may reduce amount of translation
33A. Lyon (FNAL/DØCA) – 2004
More Advanced Integration Options5) R prompt from the Root prompt R needs seamless knowledge of
objects in current Root session At end of R session, new R variables
translated into Root objects Root runs the R interpreter Translation for all types of Root
variables into R and all types of R variables returned to Root.
A major undertaking
6) Root prompt from within R Harder than 5: R is C but Root is C++ I don't see much interest in this
Things get interesting starting at 3)
I have a version 0.0.1 prototype for reading Root trees into R. Required for all options above 3. I’ll try to work on this as time permits
Both Root and R interface to Python Translate with Python as intermediary? Not sure if that's performant enough