the phystat repository for physics statistics code m. fischler, j. linnemann, m. paterno, p. canal...
TRANSCRIPT
The Phystat Repository
For Physics Statistics Code
M. Fischler, J. Linnemann, M. Paterno, P. Canal
phystat.org
Samsi, March 7, 2006 Duke University
The phystat.org Repository
• A broadly accessible collection of– Tools and utilities– Modules and Libraries– Code fragments and technical documentation
Pertaining to statistics used in physics• Idea emerged as an adjunct of the
PHYSTAT Conferences on Statistical problems in Particle Physics, Astrophysics, and Cosmology– Small workshop held in August at FNAL
Observations at PHYSTAT and at the Workshop:
• Many of the papers presented at PHYSTAT05 (Oxford) and 03 (SLAC) would benefit from a common place to cite code and technical expositions concerning statistics techniques– Citing a package for more detail about what
was done in a physics publication is a primary motivator for the Phystat repository
• Many of the participants have code modules and tools which they would like to make more readily available to the physics community
The Useful Statistics Repository Would Contain
• Tools and utilities– Useful stand-alone packages
• Modules and Libraries– Working code intended as building blocks for others’
programs
• Major Integrated Toolsets• Code fragments
– Illustrating the precise statistical algorithms applied to major experiment’s analyses
– Not necessarily intended to run intact outside their original environment
• Technical documentation of statistical algorithms– Perhaps more detailed than would be appropriate for archival
journal papers
Does Such a Repository Have To Be Created?
• Existing arXiv-style repositories – Are not a place for code and libraries
• Existing code repositories (e.g., SourceForge, R Project)– Would not be appropriate for code
fragments or expositions documenting experiments’ algorithms
– Physics Statistics code would get lost in the mass of packages
• Code collections by individual physicists– Continuity issues: Will it be there in 10 yrs?
The Phystat Repository Strategy• Institutional responsibility is key
– To ensure that archived material will remain available over time
– Assigned package numbers (e.g., PHYSTAT/0603-001/v2) will be suitable for use as citations, without concern that they will become invalid
• We should be as inclusive as possible– No restrictions based on which platforms or languages a
package works with– No acceptance/refereeing wrestling– The broadest possible acceptance of licensing approaches
• Don’t be too ambitious– The repository content will come from the community, not
from the repository maintainers
phystat.org
• Universal download access– Sophisticated search and browsing aids– Multi-view classification of contents
• Mildly moderated content submission– As unrestrictive as possible
• Support for value added– User comments– Validation and endorsement
• FNAL Computing Division commitment– Support for site mechanism, archival storage, and
content moderation
Intended Scope of the Repository
• Hypothesis testing– Model comparison– Classical and Bayesian tests
• Fitting/parameter estimation• Limit setting• Categorization
– Decision tree, Neural Net, …• Random Distribution Generation• {Your suggestions here}
– E.g., if people feel Phystat is a good place to share tracking algorithms, it can be flexible
Using the phystat.org Repository
• www.phystat.org – organized using Plone• Main page has:
– How-to instructions (and links) for• Finding packages• Submitting/modifying a package• Commenting, validating, and so forth• Links to all the PHYSTAT conferences• Links to related web resources
– Navigation to each type of package– Search tools
Phystat.org
Using the phystat.org Repository
• Navigation leads to several types of page:– Package lists
• Created dynamically as result of searches or selection of categories of packages
• Contain names, one-line descriptions
– Package pages• Full description of one package• Download button
– Submit-a-package form• Fields for descriptions, uploads
Using the phystat.org Repository• Searches by
– Category• Executable utility, Library, Code Fragment, Root macro…
– Language• C++, R, Python, Fortran, …
– Purpose• Fitting, categorization, hypothesis testing
– Keywords
• Package pages– Description– Download
• Multiple versions allowed
– User discussion– Validation links
Submitting Content• The author should prepare:
– A package name– A one-line description (suitable for reading in lists of
packages)– A full description (a paragraph suitable to let users
decide whether to download)– Tarball containing
• Code (if applicable)
• Build tools (if applicable)• Documentation (if available)• Test/sample data (if available)• Scripts that would reproduce figures from a paper (if applicable)
– Answers to:• type, purpose, language, platforms• Pulldowns make entering these easy
– (Optional) keywords
Submitting Content
• “Come as you are” philosophy– Don’t want to discourage busy physicists
from submitting citable work because documentation is in poor shape
• Goal is that submitting a prepared package will take five minutes or less– Check boxes for type, purpose, language– Pulldown list for keywords
• Package will become publicly visible after moderator verifies it is suitable
Policies
• This is a code (and papers) repository– Packages contain source code and/or technical or
theoretical documentation– Build instructions and files should be included where
relevant– phystat.org does not distribute executables
• (Loose) Content Control– Must be relevant to some area of physics– Must be related to statistics, probability, fitting,
categorization, or similar area– The moderator(s) are not trying to be judges of quality
Policies
• License Issues– Submitters must agree to let our site
freely distribute the package (of course)– Submissions are allowed to attach
whatever license agreements they wish• As long as we can distribute the package
– The author – not the repository – is responsible for any enforcement of copyright and license issues.
• Repository “held harmless” against improper use by downloaders
Policies
• Steering Committee– 5-10 people active in statistics in physics– Probable initial configuration includes:
• Jim Linnemann (initial chair) (Atlas, D0)
• Louis Lyons (CDF)
• Harrison Prosper (D0, CMS, Cosmology)
• Glen Cowan (PDG statistics editor)
• Kyle Cranmer (Atlas)
• Roger Barlow (Babar)
– Meet primarily by e-mail– Set policies, directions of value-added work,
and so forth
Repository Support Activities(“Phase I of Phystat”)
• Establishment of web site– With mechanisms for browsing,
submission/updating, and discussion– With assignment of submission numbers
suitable for use a citations in papers• Licensing and filtering policies
– Must satisfy FNAL/DOE criteria• Community consensus on content policies
– And formation of steering committee• Dissemination of info about Phystat
Value-Added Activities(“Phase II”)
• These are all potential– Depending on community desires and time available– Some done by supporters/moderators– Others depend on participation by outside physicists
• Classification/validation related:– Distinguish actively maintained usable
packages from archival entries– Organizing user feedback synopsis– Lists of known working platforms for pkgs– Basic functional validation/certification– Organization of community comparisons
among packages
Possible Value-Added Activities(“Phase II”)
• Extending Scope– Keep a “code wanted” list
• People express needs for specific capabilities
– Looking for and interfacing to relevant software produced by stats community
– Blobel: how about mathematical methods?• Improving Capabilities
– Integrating related packages– Soliciting/supporting/adding extensions to
submitted code– Portability enhancements
You can make phystat.org Valuable
• Add to the Contents of phystat.org– Submit packages to be disseminated– Submit code fragments defining how your
analysis did statistics• You can reliably cite your submitted code by its
phystat number, much like a paper in arXiv. Prosper et. al, phystat.org/0603004 or
Prosper et. al, phystat.org/0603004v2– Submit documents explaining choices of
statistical approaches • phystat.org is pretty empty today
– But there is a large backlog of code and tools potentially valuable to the HEP community!
What Next
• Make use of phystat.org!– Browse for packages you may be able to use– Browse to see how various experiments tackled
your statistics issues– Use repository to download versions of major
packages
• Add value to packages– Validation and endorsement comments– Report problems and make suggestions
• Comment about repository mechanics