enabling knowledge discovery in a virtual universe

Enabling Knowledge Discovery in a Virtual

Universe

Harnessing the Power of Parallel Grid Resources for Astrophysical

Data AnalysisJeffrey P. GardnerJeffrey P. GardnerAndrew ConnollyAndrew ConnollyCameron McBrideCameron McBride

Pittsburgh Supercomputing CenterPittsburgh Supercomputing CenterUniversity of PittsburghUniversity of Pittsburgh

Carnegie Mellon UniversityCarnegie Mellon University

How to turn simulation output into scientific knowledge

Step 1: Run simulation

Step 2: Analyze simulationon workstation

Step 3: Extract meaningfulscientific knowledge

(happy scientist)Using 300 processors:(circa 1995)



Step 2: Analyze simulationon server (in serial)


(happy scientist)Using 1000 processors:(circa 2000)



Step 2: Analyze simulationon ???

(unhappy scientist)Using 4000+ processors:(circa 2006)

X

Mining the Universe can be (Computationally) Expensive

The size of simulations is no longer limited by computational power

It is limited by the parallelizability of data analysis tools

This situation, will only get worse in the future.



Step 2: Analyze simulationon ???

Using 100,000 processors?:(circa 2012)

X

By 2012, we will have machines that will have many hundreds of thousands of cores!

The Challenge of Data Analysis in a Multiprocessor Universe

Parallel programs are difficult to write! Steep learning curve to learn parallel programming

Parallel programs are expensive to write! Lengthy development time

Parallel world is dominated by simulations: Code is often reused for many years by many

people Therefore, you can afford to invest lots of time

writing the code. Example: GASOLINE (a cosmology N-body

code) Required 10 FTE-years of development

The Challenge of Data Analysis in a Multiprocessor Universe

Data Analysis does not work this way: Rapidly changing scientific

inqueries Less code reuse

Simulation groups do not even write their analysis code in parallel!

Data Mining paradigm mandates rapid software development!

How to turn observational data into scientific knowledge

Step 1: Collect data

Step 2: Analyze dataon workstation


(happy astronomer)

The Era of Massive Sky Surveys

Paradigm shift in astronomy: Sky Surveys Available data is growing at a much

faster rate than computational power.

Good News for “Data Parallel” Operations

Data Parallel (or “Embarrassingly Parallel”): Example:

1,000,000 QSO spectra Each spectrum takes ~1 hour to reduce Each spectrum is computationally

independent from the others There are many workflow

management tools that will distribute your computations across many machines.

Tightly-Coupled Parallelism(what this talk is about)

Data and computational domains overlap

Computational elements must communicate with one another

Examples: Group finding N-Point correlation functions New object classification Density estimation

The Challenge of Astrophysics Data Analysis in a

Multiprocessor Universe

Build a library that is: Sophisticated enough to take care

of all of the nasty parallel bits for you.

Flexible enough to be used for your own particular astrophysics data analysis application.

Scalable: scales well to thousands of processors.

The Challenge of Astrophysics Data Analysis in a Multiprocessor Universe

Astrophysics uses dynamic, irregular data structures:

Astronomy deals with point-like data in an N-dimensional parameter space

Most efficient methods on these kind of data use space-partitioning trees.

The most common data structure is a kd-tree.

Challenges for scalable parallel application development:

Things that make parallel programs difficult to write Thread orchestration Data management

Things that inhibit scalability: Granularity (synchronization) Load balancing Data locality

Overview of existing paradigms: GSA

There are existing globally shared address space (GSA) compilers and libraries: Co-Array Fortran UPC ZPL Global Arrays

The Good: These are quite simple to use. The Good: Can manage data locality well. The Bad: Existing GSA approaches tend not

to scale very well because of fine granularity.

The Ugly: None of these support irregular data structures.

Overview of existing paradigms: GSA

There are other GSA approaches that do lend themselves to irregular data structures: e.g. Linda (tuple-space)

The Good: Almost universally flexible The Bad: These tend not to scale even

worse than the previous GSA approaches. Granularity is too fine



Things that inhibit scalability: Granularity Load balancing Data locality

GSA

Overview of existing paradigms: RMI

rmi_broadcast(…, (*myFunction));

RMI layer

RMI layer

myFunction()

RMI Layer

myFunction()

RMI Layer

myFunction()

RMI Layer

myFunction()

Proc. 0 Proc. 1 Proc. 2 Proc. 3

MasterThread

Computational Agenda

myFunction() is coarsely grained

“Remote Method Invocation”



Things that inhibit scalability: Granularity Load balancing Data locality

RMI

N tropy: A Library for Rapid Development of kd-tree Applications

No existing paradigm gives us everything we need.

Can we combine existing paradigms beneath a simple, yet flexible API?

N tropy: A Library for Rapid Development of kd-tree Applications

Use RMI for orchestration Use GSA for data management

A Simple N tropy Example:N-body Gravity Calculation

Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM

100 million light years

Proc 0 Proc 1 Proc 2

Proc 5Proc 4Proc 3



ntropy_Dynamic(…, (*myGravityFunc));

N tropy master layer

N tropy thread service layer

myGravityFunc()


myGravityFunc()


myGravityFunc()


myGravityFunc()


MasterThread

Computational Agenda

Particles on which to calculate gravitational force

P1 …P2 Pn…


Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM

100 million light years

To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset

To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset


Proc 5Proc 4Proc 3




myGravityFunc()


myGravityFunc()


myGravityFunc()


myGravityFunc()


N tropy GSA layer N tropy GSA layer N tropy GSA layer N tropy GSA layer

0

43

1

2

76

5

1110

8

9

1413

12

N tropy Performance Features

GSA allows performance features to be provided “under the hood”: Interprocessor data caching

< 1 in 100,000 off-PE requests actually result in communication.

RMI allows further performance features Dynamic load balacing

Workload can be dynamically reallocated as computation progresses.

N tropy Performance10 million particlesSpatial 3-Point3->4 Mpc

No interprocessor data cache,No load balancing

Interprocessor data cache,No load balancing

Interprocessor data cache,Load balancing

Why does the data cache make such a huge difference?

myGravityFunc()

Proc. 0

0

43

1

2

76

5

1110

8

9

1413

12

N tropy “Meaningful” Benchmarks

The purpose of this library is to minimize development time!

Development time for:1. Parallel N-point correlation function

calculator 2 years -> 3 months

2. Parallel Friends-of-Friends group finder

8 months -> 3 weeks

Conclusions

Most approaches for parallel application development rely on a single paradigm Inhibits scalability Inhibits generality

Almost all current HPC programs are written in MPI (“paradigm-less”): MPI is a “lowest common denominator”

upon which any paradigm can be imposed.

Conclusions

Many “real-world” problems, especially those involving irregular data structures, demand a combination of paradigms

N tropy provides: Remote Method Invocation (RMI) Globally Share Addressing (GSA)

Conclusions

Tools that selectively deploy several parallel paradigms (rather than just one) may be what are needed to parallelize applications that use irregular/adaptive/dynamic data structures.

More Information: Go to Wikipedia and seach “Ntropy”

enabling knowledge discovery in a virtual universe

Documents

observational data

analysis code

sky surveysavailable

scientific knowledgestep

irregular data structures

simulation output

thousands of processors

nasty parallel bits