data logistics in particle physics ready or not, here it comes… prof. paul sheldon vanderbilt...

Data Logistics in Particle PhysicsReady or Not, Here it Comes…

Prof. Paul SheldonProf. Paul SheldonVanderbilt UniversityVanderbilt University

Prof. Paul SheldonProf. Paul SheldonVanderbilt UniversityVanderbilt University

Outline

How Strange is the Universe? 5 Modern Mysteries.

In trying to resolve these mysteries, particle physicists face a significant data logistics problem.

Solution should be flexible enough to encourage the creative approaches that will maximize productivity.

REDDnet breaks “data-tethered” compute model, allows unfettered access w/o strong central control.

Is the Universe Even Stranger Than We Have Imagined?

One piece of evidence: rotational velocities of stars in galaxies

Pick a star, how fast is it moving around galactic center?

Mass of galaxy is much, much larger than you get by counting the stars in the galaxy

GMr

v2 1st Year Physics!

We Don’t Know What The Majority of Matter in the Universe Is.

This “extra” matter is 90% of the Universe!

Conventional explanations have mostly been ruled out– Planets, dust, …

Most of the matter in the Universe is probably an exotic form of matter — heretofore unknown!

But there is a good chance particle physicists will make some soon at the LHC at CERN!

~10% normal matter 90% “other” matter

5 Mysteries for a New Millennium

What is the majority of matter in the universe made of?

Does space have more than three dimensions?

Where is all the anti-matter created by the Big Bang?

What is this bizarre thing called “Dark Energy?”

Why do things have mass?

Answering These Questions Presents Many Challenges…

Experiments require significant infrastructure, large collaborations2500 Physicists!

CERN Large Hadron

Collider: 2007 Start CERN Large Hadron

Collider: 2007 Start

27 km tunnel in Switzerland & France(100 m below ground)

CMS

Petascale Computing Required

0

50

100

150

200

250

300

350

2007 2008 2009 2010Year

MS

I200

0

2008: ~50,0008 GHz P4s

CMS will generate Petabytes of data per year and require Petaflops of CPU…

But physics is done in small groups, geographically distributed

Distributed Resources, PeopleWhy Distributed Resources?• Sociology• Politics• Funding

To maximize the quality and rate of To maximize the quality and rate of scientific discovery, all physicists scientific discovery, all physicists must have equal ability to access must have equal ability to access and analyze the experiment's data…and analyze the experiment's data…

CMS Collaboration:CMS Collaboration:>37 Countries, >163 Institutes>37 Countries, >163 Institutes

LHC Data Grid Hierarchy

Tier 1

Online System

CERN Center PBs of Disk;

Tape Robot

FNAL Tier1IN2P3 Tier1 INFN Tier1 RAL Tier1

InstituteInstituteInstitute

Workstations/Laptops

~150-1500 MBs

10 Gbps

1 to 10 Gbps

~PByte/sec

10-40+ Gbps

Tier2 CenterTier2 CenterTier2 Center

1-10 Gbps

Tier 0 +1

Tier 3

Tier 4

Caltech Tier2 Tier 2

Experiment

>10 Tier1 and ~100 Tier2 Centers

UERJ Tier2

Physics data cache

Vanderbilt Tier3 The small Analysis Groups

doing the physics: work at the Tier 3/4 Level.

Data Logistics Yin and Yang

Uncertainty reigns at the most important level — where the physics will get done.

Physicists will evolve novel use cases that will not Physicists will evolve novel use cases that will not jive with expectations or any plans/rules/edicts.jive with expectations or any plans/rules/edicts.

High Level Control

Infrastructure Ready? Tested Use Cases

Tier 0Strong,

CentralizedMost Much Understood

Tier 4 Anarchy Little/None None ?????

Use Cases: What we Do Know

Physicists will: need access to 10-100 TB Data Sets for short term periods.run over this data many times, refining, improving their analysis.use local computing resources where they may not have much storage available.make “opportunistic use” of compute resources at Tier 3 sites and Grid sites. perform “production runs” at Tier 2 sites.

REDDnet at Tier 3

Opportunistic computing vs data-tethered computing– CMS has no formal solution for Tier 3 storage– Compute on resources — even those where data not hosted

On-demand working storage– improve data logistics– Acts local — familiar user tools

Demonstrate at a Tier 3– Performance– Reliability– … and convenience

REDDnet

SC06 Depots

Near Term Plan of Work

Provide T3 scratch spaceHost/mirror popular datasets on REDDnetParticipate in Data and Service Challenges– Summer 07 Challenge Starting Soon– Network and Data Transfer Load tests

Integrate with existing CMS toolsDevelop a Tier 3 Analysis environment– Initial small test community– Test with individual analyses– Run on the Grid

data logistics in particle physics ready or not, here it comes… prof. paul sheldon vanderbilt...

Documents

data logisticsac

data logistics yin

majority of matter

normal matter

extra matter

tier 2experiment10 tier1

tb data sets

exotic form of matter