automatic data selection for in situ analysis

Science Problem: Cognitive capacity (human/scientist understanding), storage and I/O have not kept up with our capacity to generate massive amounts physics-based simulation data. Data must be reduced, but which data do we keep or look at?

Technical Solution: Apply automatic data selection operations using various metrics and criteria to emulate the exploration during physics simulations (in situ) and make the data bandwidth use more “information dense.” Physics simulations are reduced to a representative set based on the selected metric that is used.

Science Impact: Scientist and analyst time is more efficient and bandwidth (cognitive, storage and I/O) is “richer,” as more scientifically-relevant simulation information is packed into it.

Woodring, Myers, Nouanesengsy, Wendelberger, Patchett, Fasel, Ahrens

Automatic Data Selection for In Situ Analysis

discovery

events flow control(lightweight)

(heavyweight)

detectors

triggers

dynamic computationalraw data

productsdata

Combining Statistics, Sampling, and Indexing to Quantitatively Scale Massive Scientific Data

Science Problem: Climate and cosmological analysis and exploration via queries on their massive data sets may still result in a massive amount of data for the scientist to comb through and/or transfer.

Technical Solution: Combine bitmap indexing with stratified random sampling (stratified by bins, random within bins) with precalculated errors to quickly and quantitatively scale data.

Science Impact: Climate and cosmological scientists are able to interactively and automatically scale data queries down to smaller samples with automatic error estimates, accelerating their scientific analysis workflow.

Su, Agrawal, Woodring, Myers, Wendelberger, and Ahrens. “Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices.” To appear in HPDC 2013.

Science Problem: Storage and I/O have not kept up with climate and cosmological simulation capacity to generate data. Data must be reduced, but what has been lost in the process of reduction?

Technical Solution: After every reducing transformation (T), assess and record the residuals/errors via compact measures (M) comparing reduced (a) to “original” data (A) by reconstructed data (B).

Science Impact: Measuring the data uncertainty via a quantitative quality provenance shows that climate and cosmological scientists are able to use reduced data for scientific analysis.

Woodring, Shafii, Biswas, Myers, Wendelberger, Hamann, and Shen. “Metrics and Workflow for Quantifying the Quality of Reduction Transformations on Large-Scale, Scientific Scalar Data.” Submitted to LDAV 2013.

Metrics and Workflow for Quantifying the Quality of ReductionTransformations on Large-Scale, Scientific Scalar Data

Expected Impact Milestones and Status

Novel Ideas

Principal Investigator: James Ahrens et al., LANL

Sept. 25, 2013

In our first year, we have developed several data selection algorithms, designed a prototype visualization and analysis system that utilizes selected data, and quantified the effects on selecting scientific data.Milestone Expected Actual• Data Selection Algorithms 03/13

03/13• Data Product Explorer 09/13

09/13• Selection Quality Measurements 09/13

09/13• Advanced Selection Algorithms 03/14• Perceptually-Driven Presentation 03/14• Selection in Data Product Explorer 09/14• Domain-Driven Selection 03/15• Bandwidth Utilization Adjustment 09/15• Advanced Presentation Methods 09/15

• Save compute time as only selected data are stored with limited I/O bandwidth and capacity

• Not all data can be saved from an exascale simulation, therefore we must be prescriptive on what data are saved

• We will provide data selection algorithms• Save analysis time as selected data are

presented to the scientist• The resulting data will still be massive,

more than any one scientist can look at• Interaction, query, and highlighting

methods drive scientists to view key data

Perception, cognitive capacity, and computer bandwidth are limited, but the scale of data continues to increase. In order to save both compute cycles and analyst time, we explicitly select and present data to the scientist:• Time, space, and variable selection algorithms• Store and index selected data products• Interactive presentation and query methods• Illustratively and artistically highlight to

perceptually drive scientists to key data

IMD Exploration of Exascale In Situ Visualization and Analysis Approaches

Exascale and “Big Data”running simulations, running experiments, static repositories, etc.

Data Selection Algorithms (time, space, variable, product, etc.)

Analysis with Perceptually-Driven Highlighting

raw image geo-metry …

automatic data selection for in situ analysis

Documents

reduced data

data sampling

data bandwidth

data queries

selected data

original data

data uncertainty

scientific scalar data