r in high energy physics (a somewhat personal account) adam lyon fermi national accelerator...

R in High Energy Physics(A somewhat personal account)

Adam Lyon Fermi National Accelerator LaboratoryComputing Division - DØ Experiment

PHYSTAT Workshop on Statistical SoftwareMSU - March, 2004

Outline:Some BackgroundWhy is R interesting to us?Some non-analysis examplesUsing R in HEPSome thoughts on where this can go

2A. Lyon (FNAL/DØCA) – 2004

Some background on me… Graduate student on DØ

(400 person Fermilab experiment) Marc Paterno and I were

some of the first to use C++ for analysis at DØ (days of PAW)

… and the first DØ to use Bayesian statistics for limit calculation

Postdoc on CLEO (Cornell) (200 person experiment) Used PAW, ROOT &

Mathematica for several analyses

Involved in experiment's transition to C++

Back to DØ (now 700 person Fermilab experiment) as an associate scientist in Computing Division

Used R for non-HEP analysis applications

Pondering (with Marc Paterno and Jim Kowalkowski also of FNAL/CD) how R can be made useful in HEP analyses


First use of R Marc (C++ & Statistics expert & trouble maker)

came across R and showed it to Jim and myself.

Looked neat but didn't have any reason to use it until…

Monitoring of DØ's Data Handling SystemDØ has 601 Terabytes of data on tapeSAM (DØ & CDF joint project) is our

• File storage system (knows where all files live)• File delivery system (gets those files to you worldwide)• File cataloging system (stores meta-data for file

cataloging) • Analysis bookkeeping system (remembers what you did)


Data Handling at DØ

SAM typically delivers

~150 TB of data to users per month

It's perhaps a 0th generation GRID

SAM is a very complicated system

No monitoring except for huge dumps of text and log files

Monitoring was sorely needed -- lots of things can go wrong

Usage statistics were needed for future planning and discovery of bottlenecks


samTV


Monitoring with R Turn a text file like this (from parsing big log files):station procId time event fromStation durcabsrv1 2983599 1074593577 OpenFile enstore 9343cabsrv1 2983599 1074604748 RequestNextFile NA 11171cabsrv1 2983599 1074609598 OpenFile enstore 4850cabsrv1 2983599 1074620392 RequestNextFile NA 10794cabsrv1 2983599 1074620392 OpenFile fnal-cabsrv1 0cabsrv1 2983599 1074631505 RequestNextFile NA 11113cab 3085189 1076666381 OpenFile cab 415cab 3085189 1076673379 RequestNextFile NA 6998cab 3085189 1076673379 OpenFile cab 0cab 3085189 1076680426 RequestNextFile NA 7047cab 3085189 1076680753 OpenFile enstore 327cab 3085189 1076687836 RequestNextFile NA 7083cab 3085189 1076687836 OpenFile enstore 0cab 3085189 1076694821 RequestNextFile NA 6985cab 3085189 1076695114 OpenFile cab 293cab 3085189 1076702701 RequestNextFile NA 7587cab 3085189 1076702701 OpenFile enstore 0cab 3085189 1076710021 RequestNextFile NA 7320cab 3085189 1076710021 OpenFile enstore 0cab 3085189 1076717651 RequestNextFile NA 7630cab 3085189 1076717651 OpenFile cab 0(705,000 more lines like the above!)


Into plots like this… R code:

library(lattice)

d = read.table("data.dat", head=T)

w = data[ data$event=="OpenFile",]

w$min = w$dur/60.0

bwPlot( fromStation ~ min | station, data=w, subset=(min<60)

xlab="Minutes", main="Wait Time for …" )


Box and Whisker Plots


Why is R interesting to us? Seems to be the "State of the Art" in statistics Enormous library of user contributed add-on

packages Huge number of statistical tests, fitting, smoothing, … More advanced stuff too: genetic algorithms, support vector

machines, kriging (would have been useful for my thesis!) Advanced graphics based on William Cleveland's

Visualizing Data SQL (MySQL, Oracle, SqlLite, Postgres, ODBC), XML Hooks to COM and CORBA Interfaces for Python, Perl, Tk (GUIs), Java

Pretty easy interface to C, C++, Fortran

Some nice conveniences (R can save its state) It's multiplatform It's free!


The R Language

"Not unlike S" Author (John Chambers) received 1998

ACM Software System Award:

The ACM's citation notes that Dr. Chambers' work "will forever alter the way people analyze, visualize, and manipulate data . . . S is an elegant, widely accepted, and enduring software system, with conceptual integrity, thanks to the insight, taste, and effort of John Chambers." (http://www.acm.org/announcements/ss99.html)

I guess he did good!


What is the R/S Language Interesting? "Programming with

Data"

The fundamental purpose of the language (as I see it) is to provide general tools for efficient data manipulation and analysis while allowing extensions to those tools to be programmed easily.

Has a specific purpose. You wouldn't write your online data acquisition system in R/S. But analyzing output from online monitoring is certainly a good task for it.

R/S is a functional language vectorized functions,

apply, lazy evaluation

R/S is an object oriented language (but with a functional bent) Functions with the same

name are dispatched based on argument types (has notions of inheritance and other OO features)

Is R/S ideal? Don't know, but we've been very surprised by how some complicated tasks can be accomplished with astonishingly simple code


Some non-analysis examples

> nrow(w) [1] 399135

> w[1:2,]station procId time event fromStation dur min

1 cabsrv1 2983599 1074593577 OpenFile enstore 934 155.7

2 cabsrv1 2983599 1074609598 OpenFile enstore 4850 80.8

> w.means = aggregate(w, list(station=w$station, src=w$fromStation), mean)

> w.means[1:2,]station src x

1 cab cab 6.8616951092 cabsrv1 cab 8.171100917

samTV: Plot the mean wait times by file source for each SAM station

2.2 seconds


samTV cont'd

> dotplot(src ~ x | station,

data=w.means,

scales=list(cex=1.3),

main=list("Mean Process Wait Times", cex=1.5),

xlab=list("Wait time (minutes)", cex=1.5),

cex=1.7,

par.strip.text= list(cex=1.7) )


Non-analysis Examples We’ve found R to be great for slogging through text files and

database query results to make extremely useful and pretty plots


Non-analysis applications Performance of DB

server middleware (Marc Paterno)

Data transfer speed. vs. data size for two different servers

Fit to model of startup time plus constant throughput

modpollux = nls( speed ~ alpha*(1-alpha*beta/(alpha*beta+mb)), data=client[pollux,], start=c(alpha=2.0, beta=0.50), trace=T)


What have we learned so far? There seems to be an "R way"

Do it the functional way! Use the apply commands and vectorized functions

instead of for loops Higher order functions

One of R's strengths is its user contributions but this means some functionality is repeated (e.g.

three histogram functions -- albeit each serves a slightly different purpose)

The learning curve is long (R can do lots!) But there are extensive manuals, online

documentation, and published books and papers


R in HEP We are aware of no one using R, or any other

statistical package, in the HEP community. Why?

Our needs are quite specific and… My Postdoc supervisor (Ed Thorndike): "Trust no one" "Or at least trust no one outside of HEP"

With very few exceptions, all of our scientific software tools are written within the community. Many people write their own, reinventing lots of wheels

Most are unaware of tools from the statistics community and how they could apply to us

Many of us (including me) have little to no formal statistical training and had no exposure to statistical tools (e.g. SAS, SPSS, MATLAB, R)


R in HEP Maybe this is changing, a little

Root, the most widely used HEP analysis tool, has TGraphSmooth which implements Loess smoother (translated R functions into C++)

Software is getting more complicated (we are doing lots more than just whipping up quick and dirty Fortran). Some realization that we can't do it all ourselves (e.g. databases, SAM uses consultants)

But problem: our datasets tend to be huge


HEP datasets and R R seems to want to hold everything in memory

(recently discovered externalVector; haven't tried it yet)

In HEP, we typically run successive skims to reduce the data size (601 TB down to 100s of Meg or a few Gig) Hard trade offs between size and utility of skims Usually skims are output to a more convenient format

(e.g. Root files) For example, I use a 4th generation skim with 412

variables and 232K rows (1.9 Gig) Even our last stage skims are probably too large for R Efficient handling of large datasets is one reason why

Root is very successful


Three strategies for reading HEP data in R Realize that I don't need all 412 variables for all rows in

memory at the same time In fact usually concentrate on just a few variables at a time Perform even further event requirements

1. If data is small enough, bring it into R

2. If can reduce data to something R can hold, bring that subset of data into R -- have the full power of R

perhaps this means using that data for awhile, and loading a new set to tackle another aspect of the problem

3. If can't even do above, then have some R apparatus to read in data one row at a time and update an R object (e.g. histograms) [But you don't get the full power of R]


Reading Root files into R Do it the R way!

root.apply("myTree", "myFile.root", myFunction)

C++ and R code written evenings of one weekend (my wife was out of town, dog was asleep)

You supply an R function that receives an entry from your Root file (as a list). Function can make requirements on the data, return

nothing if fails Function returns a new list of variables to pass to R. Can

be new derived variables not in the Root entry

Return of root.apply is a data frame (an R database)


Example -- selecting dielectrons# Select events with two good electrons# Only the EM and MET branches are

needed#selectDiE = function(entry) {

# Make dataframe of electron data es = as.data.frame( entry$EM ) ; attach(es)

# Make the requirements for a good electron

goodECuts = ( id == 10 | abs(id) == 11 ) & pt > 25.0 & emfrac > 0.9 & fiducial==1

# If nothing passed, then stop if ( ! any(goodECuts) ) return(NULL)

# Get electron etas etas = abs(eta)

# Make the requirements for good etas goodEtaCuts = etas < 1.05 |

( etas > 1.7 & etas < 2.3 )

# If no electrons had a good eta, then stop if ( ! any(goodEtaCuts) ) return(NULL) # Get the list of electrons meeting all cuts goodEs = goodEtaCuts & goodECuts

# Now require that at least two electrons pass

goodEsDF = es[goodEs,] if ( nrow(goodEsDF) < 2 ) return(NULL)

########## Construct the return list

# Get the ordering for the electrons goodElectronsOrder = order( -goodEsDF$pt )

e1 = goodEsDF[ goodElectronsOrder[[1]], ] names(e1) <- paste("e1", names(e1),

sep=".")


sep=".")

# Return return ( c(as.list(e1), as.list(e2), entry$MET ) )}

R Function Definition

Turn electron data into an R data frame

Apply cuts to electrons. Returns a boolean vector (T,T,F,T,F)

Cut entry if nothing passed

Join (AND) the requirements

Make new data frame with passing electrons. Are there

2 or more?


Example -- selecting dielectrons# Select events with two good electrons# Only the EM and MET branches are

needed#selectDiE = function(entry) {

# Make dataframe of electron data es = as.data.frame( entry$EM )

# Make the requirements for a good electron

goodECuts = ( es$id == 10 | abs(es$id) == 11 ) & es$pt > 25.0 & es$emfrac > 0.9 & es$fiducial==1

# If nothing passed, then stop if ( ! any(goodECuts) ) return(NULL)

# Get electron etas etas = abs(es$eta)

# Make the requirements for good etas goodEtaCuts = etas < 1.05 |

( etas > 1.7 & etas < 2.3 )

# If no electrons had a good eta, then stop if ( ! any(goodEtaCuts) ) return(NULL) # Get the list of electrons meeting all cuts goodEs = goodEtaCuts & goodECuts

# Now require that at least two electrons pass

goodEsDF = es[goodEs,] if ( nrow(goodEsDF) < 2 ) return(NULL)

########## Construct the return list

# Get the ordering for the electrons goodElectronsOrder = order( -goodEsDF$pt )


sep=".")


sep=".")

# Return return ( c(as.list(e1), as.list(e2), entry$MET ) )}


Analyzing dielectronsd = root.apply("Global",

"mydata.root", selectDiE)

Handed back a data frame with the variables I wanted.

Can now attack this data with the full power of R


Dielectrons> d = root.apply(…)

> given.met = equal.counts(d$met, number=4, overlap=0.1)

> summary(given.met)Intervals:

min max count1 0.05888367 4.103577 30442 3.84661865 6.498108 30453 6.21868896 9.914124 30464 9.42181396 88.125061 3043

Ovrlap between adjacent intervals:[1] 307 306 308

> xyplot(e2.pt ~ e1.pt | given.met, data=d)


Extracting signal and background from data (From Marc Paterno) Given a data sample, extract the

amount of signal and background Bump fitting A common HEP problem

Try a MC example1. Generate data based on a signal distribution

(Breit-Wigner [Cauchy] of mass and width) and a background distribution (1/(a+b*x)^3)

2. Fit this data with the background and signal distributions, but with unknown parameters


Bump Fitting Generate the background

distribution

bf returns a function that when given a uniform random variable [0,1) returns the background distribution with parameters a and b

rbackground generates the distribution for n values

Clever use of higher order functions and vectorized functions

bf = function(a,b) { function(x) { temp=1-x; temp*(a/b)*(temp+sqrt(temp)) }}

rbackground = function(n, a, b) {

transform = bf(a,b); transform(runif(n))}


Bump Fitting Generate the signal

Generate n random Breit-Wigner values

Require that distribution be positive and less than max. Throw away values that fail

Recursively call function to make up the amount that was lost

Make the data

Join the signal and background into one distribution

rsignal = function(n, mass, width, max) {

temp = rcauchy(n,mass,width); temp = temp[temp > 0 &

temp < max];

num.more = n - length(temp); if (num.more > 0) { more = rsignal(

n-length(temp), mass, width, max);

temp = append(temp, more); } temp}

rexperiment = function (nsig, mass, width, nback, a, b) {

append(rsignal(nsig, mass, width, a*b/2),

rbackground(nback, a, b)) }


Bump fitting Use an unbinned

maximum likelihood fitter (from MASS)

Rprof significantly sped up fit (replace ^)

dbackground = function(x, a, b) {

d = a+b*x 2*a*a*b/(d*d*d)}

mydistr = function(x, f, m, s, a, b) {

(1-f)*dbackground(x,a,b) + f*dcauchy(x,m,s)

}

fres2 = fitdistr(data, densfun=mydistr, start=list(f = FRAC, m=40.0, s=3.0, a=100, b=2.))


Bump Fitting True distribution

total Histogram is

generated data Signal fit Background fit Total fit

Bottom plot is of residuals (true-fit)


What are we considering next?

Continue to learn more about R

Further Develop the "Three Strategies"

Explore doing a physics analysis in R

Summary

We explore using R, a statistical analysis package from the statistics community, in an HEP enviornment

R has already proven useful for analyzing monitoring and benchmarking data

We have ideas on how R can be used to read large datasets

We've done some "proof of principle" studies of physics analysis with R

As we learn more about R, we expect to be more surprised at its capabilities


Options for R and Root Interfacing(after discussions)

no interest from R community in non-I/O functions of Root

In order of work required :

1) R and Root remain separate-- use the more appropriate tool for the task. Use text files to communicate between the two if necessary.

2) Root loads R's math and low level statistical libraries as shared objects Minimalist approach for some

functionality Some access to the math and

statistics C code functions from R These C functions take basic C

types, so no translation necessary But: no upper level functions

written in the R language available

3) R and Root remain separate, but: R package to read Root Trees

directly into R data frames. Still use best tool for particular task Now easier to get HEP data into R

4) Allow calling of selected high level R functions from within Root

Root runs the R interpreter translation is necessary

R functions: understand Root objects Root: understand R return objects

Expose only some R functions may reduce amount of translation


More Advanced Integration Options5) R prompt from the Root prompt R needs seamless knowledge of

objects in current Root session At end of R session, new R variables

translated into Root objects Root runs the R interpreter Translation for all types of Root

variables into R and all types of R variables returned to Root.

A major undertaking

6) Root prompt from within R Harder than 5: R is C but Root is C++ I don't see much interest in this

Things get interesting starting at 3)

I have a version 0.0.1 prototype for reading Root trees into R. Required for all options above 3. I’ll try to work on this as time permits

Both Root and R interface to Python Translate with Python as intermediary? Not sure if that's performant enough

r in high energy physics (a somewhat personal account) adam lyon fermi national accelerator...

Documents

openfile cab

requestnextfile na

openfile enstore

data data

openfile fnalcabsrv1

lyon fnald ca

librarylattice d

r code