the challenges ahead for visualizing and analyzing massive data sets hank childs lawrence berkeley...

The Challenges Ahead for Visualizing and Analyzing

Massive Data SetsHank Childs

Lawrence Berkeley National Laboratory

February 26, 2010

27B elementRayleigh-Taylor Instability(MIRANDA, BG/L)

2 trillion element mesh

2 billion elementThermal hydraulics(Nek5000, BG/P)

Overview of This Mini-Symposium

• Peterka: we can visualize the results on the supercomputer itself

• Bremer: we can understand and gain insight from these massive data sets

• Childs: visualization and analysis will be a crucial problem on the next generation of supercomputers

• Pugmire: we can make our algorithms work at massive scale

How does the {peta-, exa-} scale affect visualization?

Large # of time steps

Large ensembles

High-res meshes

Large # of variables

The soon-to-be “good ole days” … how visualization is done right now*

P0P1

P3

P2

P8P7 P6

P5

P4

P9

Pieces of data(on disk)

Read Process Render

Processor 0

Read Process Render

Processor 1

Read Process Render

Processor 2

Parallelized visualizationdata flow network

P0 P3P2

P5P4 P7P6

P9P8

P1

Parallel Simulation Code

* = Your mileage may vary - Are you running full machine?- How much data do you

output?

Pure parallelism performance is based on # bytes to process and I/O rates.

Vis is almost always >50% I/O and sometimes 98% I/O

Amount of data to visualize is typically O(total mem)

Relative I/O (ratio of total memory and I/O) is key

FLOPs Memory I/O

Terascale machine

“Petascale machine”

Anedoctal evidence: relative I/O is getting slower.

Machine name Main memory I/O rate

ASC purple 49.0TB 140GB/s 5.8min

BGL-init 32.0TB 24GB/s 22.2min

BGL-cur 69.0TB 30GB/s 38.3min

Petascale machine

?? ?? >40min

Time to write memory to disk

Why is relative I/O getting slower?

• “I/O doesn’t pay the bills”—And I/O is becoming a dominant cost in the

overall supercomputer procurement.• Simulation codes aren’t as exposed.

Recent runs of trillion cell data sets provide further evidence that I/O dominates

8

● Weak scaling study: ~62.5M cells/core

8

#coresProblem Size

TypeMachine

8K0.5TZAIXPurple

16K1TZSun LinuxRanger

16K1TZLinuxJuno

32K2TZCray XT5JaguarPF

64K4TZBG/PDawn

16K, 32K1TZ, 2TZCray XT4Franklin2T cells, 32K procs on Jaguar

2T cells, 32K procs on Franklin

- Approx I/O time: 2-5 minutes- Approx processing time: 10 seconds

Visualization works because it uses the brain’s highly effective visual processing system.

Trillions of data points

Millions of pixels

But is this still a good idea at the peta-/exascale?

• (Note that visualization is often reducing the data … so we are frequently *not* trying to render all of the data points.)



One idea: add more pixels!

35M pixel powerwall• Bonus: big displays act as collaboration centers.



One idea: add more pixels!

35M pixel powerwall

Source: Sawant & Healey, NC State

Visual acuity of the human eye is <30M

pixels!!

Summary: what are the challenges?

• Scale—We can’t read all of the data at full resolution

any more? What can we do?• Insight

—There is a lot more data than pixels. How are we going to understand it?

How can we deal with so many cells per pixel?

• What should the color of this pixel be?—“Random” between the 9 colors?—An average value of the 9 colors? (brown)—The color of the minimum value?—The color of the maximum value?

• We need infrastructure to allow users to have confidence in the pictures we deliver.

A single pixel

Data insight often goes far beyond pictures (see Bremer talk)

Multi-resolution techniques use coarse representations then refine.

P0P1

P3

P2

P8P7 P6

P5

P4

P9


Read Process Render

Processor 0

Read Process Render

Processor 1

Read Process Render

Processor 2

Parallelized visualizationdata flow network

P0 P3P2

P5P4 P7P6

P9P8

P1


P2

P4

Multi-resolution: pros and cons

• Summary:—“Dive” into data

• Enough diving results in original data

• Pros—Avoid I/O & memory requirements—Confidence in pictures; multi-res hierarchy

addresses “many cells to one pixel issue”• Cons

—Is it meaningful to process simplified version of the data?

—How do we generate hierarchical representations? What costs do they incur?

In situ processing does visualization as part of the simulation.

P0P1

P3

P2

P8P7 P6

P5

P4

P9


Read Process Render

Processor 0

Read Process Render

Processor 1

Read Process Render

Processor 2

P0 P3P2

P5P4 P7P6

P9P8

P1


In situ processing does visualization as part of the simulation.

P0P1

P3

P2

P8P7 P6

P5

P4

P9

GetAccessToData

Process RenderProcessor 0

Parallelized visualization data flow networkParallel Simulation Code

GetAccessToData


GetAccessToData


GetAccessToData


… … … …

In situ: pros and cons

• Pros:—No I/O!—Lots of compute power available

• Cons:—Very memory constrained—Many operations not possible

• Once the simulation has advanced, you cannot go back and analyze it

—User must know what to look a priori• Expensive resource to hold hostage!

Now we know the tools … what problem are we trying to solve?

• Three primary use cases:—Exploration—Confirmation—Communication

Examples:Scientific discoveryDebugging

Examples:Data analysisImages / moviesComparison

Examples:Data analysisImages / movies

Notional decision process

Need all data at full

resolution?

No Multi-resolution(debugging & scientific

discovery)

YesDo you know

what you want do a priori?

Yes

In Situ(data analysis & images / movies)

Exploration

Confirmation

Communication

Pure parallelism(Anything & esp.

comparison)

No

Also roles for more minor techniques that weren’t

discussed such as streaming and data subsetting.

Prepare for difficult conversations in the future.

• Multi-resolution:—Do you understand what a multi-resolution

hierarchy should look like for your data?—Who do you trust to generate it?—Are you comfortable with your I/O routines

generating these hierarchies while they write?—How much overhead are you willing to tolerate

on your dumps? 33+%?—Willing to accept that your visualizations are

not the “real” data?

Prepare for difficult conversations in the future.

• In situ:—How much memory are you willing to give up

for visualization?—Will you be angry if the vis algorithms crash?—Do you know what you want to generate a

priori? • Can you re-run simulations if necessary?

Summary

• Is there a problem with massive data?—Yes, I/O is a major problem—Yes, obtaining insight is a major problem

• Why is there a problem? Who’s fault is it?—As we scale up, some things get cheap, others

things (like I/O) stay expensive• What can we do about it?

—Multi-res / in-situ• Will it hurt?

—Yes.• Can we do it?

—Yes, see next three talks

• Questions???

• Hank Childs, LBL & UC Davis• Contact info:

—[email protected] / [email protected]—http://vis.lbl.gov/~hrchilds

mailto:[email protected]

mailto:[email protected]

the challenges ahead for visualizing and analyzing massive data sets hank childs lawrence berkeley...

Documents

io doesnt

io rates

seconds slide

seconds approx io time

bgp slide

franklin approx io time

cell data sets

massive scale slide