1 peter fox xinformatics – itec 6961/csci 6960/erth-6963-01 week 11, april 26, 2011 information...

76
1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963- 01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

1

Peter Fox

Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01

Week 11, April 26, 2011

Information integration, life-cycle and visualization

Page 2: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Contents• Review of last class, reading

• Information quality, uncertainty (again) and bias (again) in a project setting.

• Your projects?

• Last few classes

2

Page 3: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

The semantics of data and information quality, uncertainty and bias and

applications in atmospheric science

Peter Fox, and …

Stephan Zednik1, Gregory Leptoukh2,

Chris Lynnes2, Jianfu Pan3

1. Tetherless World Constellation, Rensselaer Polytechnic Inst.

2. NASA Goddard Space Flight Center, Greenbelt, MD, United States

3. Adnet Systems, Inc.

Page 4: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Definitions (ATM)

• Quality– Is in the eyes of the beholder – worst case scenario… or a good challenge

• Uncertainty– has aspects of accuracy (how accurately the real world situation is

assessed, it also includes bias) and precision (down to how many digits)

• Bias has two aspects:– Systematic error resulting in the distortion of measurement data

caused by prejudice or faulty measurement technique

– A vested interest, or strongly held paradigm or condition that may skew the results of sampling, measuring, or reporting the findings of a quality assessment:

• Psychological: for example, when data providers audit their own data, they usually have a bias to overstate its quality.

• Sampling: Sampling procedures that result in a sample that is not truly representative of the population sampled. (Larry English)

4

Page 5: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Acronyms

ACCESS Advancing Collaborative Connections for Earth System Science

ACE Aerosol-Cloud-Ecosystems

AGU American Geophysical Union

AIST Advanced Information Systems Technology

AOD Aerosol Optical Depth

AVHRR Advanced Very High Resolution Radiometer

GACM Global Atmospheric Composition Mission

GeoCAPE Geostationary Coastal and Air Pollution Events

GEWEX Global Energy and Water Cycle Experiment

GOES Geostationary Operational Environmental Satellite

GOME-2 Global Ozone Monitoring Experiment-2

JPSS Joint Polar Satellite System

LST Local Solar Time

MDSA Multi-sensor Data Synergy Advisor

MISR Multiangle Imaging SpectroRadiometer

MODIS Moderate Resolution Imaging Spectraradiometer

NPP National Polar-Orbiting Operational Environmental Satellite System Preparatory Project

Page 6: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Acronyms (cont.)

OMI Ozone Monitoring Instrument

OWL Web Ontology Language

PML Proof Markup Language

QA4EO QA for Earth Observations

REST Representational State Transfer

TRL Technology Readiness Level

UTC Coordinated Universal Time

WADL Web Application Description Language

XML eXtensible Markup Language

XSL eXtensible Stylesheet Language

XSLT XSL Transformation

Page 7: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Data quality needs: fitness for purpose

• Measuring Climate Change:– Model validation: gridded contiguous data with uncertainties– Long-term time series: bias assessment is the must , especially

sensor degradation, orbit and spatial sampling change

• Studying phenomena using multi-sensor data:– Cross-sensor bias is needed

• Realizing Societal Benefits through Applications:– Near-Real Time for transport/event monitoring - in some cases,

coverage and timeliness might be more important that accuracy– Pollution monitoring (e.g., air quality exceedance levels) – accuracy

• Educational (users generally not well-versed in the intricacies of quality; just taking all the data as usable can impair educational lessons) – only the best products

Page 8: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Where are we in respect to this data challenge?

“The user cannot find the data;

If he can find it, cannot access it;

If he can access it, ;

he doesn't know he doesn't know how goodhow good they are; they are;

if he finds them good, he can not if he finds them good, he can not mergemerge them with other data”them with other data”

The Users View of IT, NAS 1989

8

Page 9: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Level 2 data

9

Page 10: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Level 2 data

• Swathfor MISR, orbit 192 (2001)

10

Page 11: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Level 3 data

11

Page 12: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

MODIS vs. MERIS

Same parameter Same space & time

Different results – why?

MODIS MERIS

A threshold used in MERIS processing effectively excludes high aerosol values. Note: MERIS was designed primarily as an ocean-color instrument, so aerosols are “obstacles” not signal.

Page 13: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Why so difficult?

• Quality is perceived differently by data providers and data recipients.

• There are many different qualitative and quantitative aspects of quality.

• Methodologies for dealing with data qualities are just emerging

• Almost nothing exists for remote sensing data quality• Even the most comprehensive review Batini’s book

demonstrates that there are no preferred methodologies for solving many data quality issues

• Little funding was allocated in the past to data quality as the priority was to build an instrument, launch a rocket, collect and process data, and publish a paper using just one set of data.

• Each science team handled quality differently.

Page 14: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Spatial and temporal sampling – how to quantify to make it useful for modelers?

• Completeness: MODIS dark target algorithm does not work for deserts• Representativeness: monthly aggregation is not enough for MISR and

even MODIS• Spatial sampling patterns are different for MODIS Aqua and MISR Terra:

“pulsating” areas over ocean are oriented differently due to different orbital direction during day-time measurement Cognitive bias

MODIS Aqua AOD July 2009 MISR Terra AOD July 2009

Page 15: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Quality Control vs. Quality Assessment

• Quality Control (QC) flags in the data (assigned by the algorithm) reflect “happiness” of the retrieval algorithm, e.g., all the necessary channels indeed had data, not too many clouds, the algorithm has converged to a solution, etc.

• Quality assessment is done by analyzing the data “after the fact” through validation, intercomparison with other measurements, self-consistency, etc. It is presented as bias and uncertainty. It is rather inconsistent and can be found in papers, validation reports all over the place.

Page 16: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Challenges in dealing with Data Quality

• Q: Why now? What has changed? • A: With the recent revolutionary progress in data

systems, dealing with data from many different sensors finally has become a reality.

Only now, a systematic approach to remote sensing quality is on the table.

• NASA is beefing up efforts on data quality.• ESA is seriously addressing these issues.• QA4EO: an international effort to bring communities

together on data quality.• Many information and computer science research

questions

16

Page 17: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Data from multiple sources to be used together:• Current sensors/missions: MODIS, MISR, GOES, OMI. • Future decadal missions: ACE, NPP, JPSS, Geo-CAPE • European and other countries’ satellites• Models

Harmonization needs:• It is not sufficient just to have the data from different sensors and

their provenances in one place• Before comparing and fusing data, things need to be harmonized:

• Metadata: terminology, standard fields, units, scale • Data: format, grid, spatial and temporal resolution,

wavelength, etc.• Provenance: source, assumptions, algorithm, processing steps• Quality: bias, uncertainty, fitness-for-purpose, Quality: bias, uncertainty, fitness-for-purpose,

validationvalidation

Dangers of easy data access without proper assessment of the joint data usage - It is easy to use data incorrectly

Intercomparison of data from multiple sensors

17

Page 18: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Anomaly Example: South Pacific Anomaly

Anomaly

18

MODIS Level 3 dataday definition leads to artifact in correlation

Page 19: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

…is caused by an Overpass Time Difference

19

Page 20: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Different kinds of reported data quality

• Pixel-level Quality: algorithmic guess at usability of data point– Granule-level Quality: statistical roll-up of Pixel-level Quality

• Product-level Quality: how closely the data represent the actual geophysical state

• Record-level Quality: how consistent and reliable the data record is across generations of measurements

Different quality types are often erroneously assumed having the same meaning

Ensuring Data Quality at these different levels requires different focus and action

Page 21: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Sensitivity of Aerosol and Chl Relationship to Data-Day Definition

Correlation Coefficients MODIS AOT at 550nm and SeaWiFS Chl

Difference between Correlation of A and B:A: MODIS AOT of LST and SeaWiFS ChlB: MODIS AOT of UTC and SeaWiFS Chl

Artifact: difference between using LST and the calendar

UTC-based dataday

Artifact: difference between using LST and the calendar

UTC-based dataday

Page 22: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Correlation between MODIS Aqua AOD (Ocean group product) and MODIS-Aqua AOD (Atmosphere group product)

Pixel Count distribution

Only half of the Data Day artifact is present because the Ocean Group uses the better

Data Day definition!

Sensitivity Study: Effect of the Data Day definition on Ocean Color data correlation with

Aerosol data

Page 23: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Factors contributing to uncertainty and bias in Level 2

• Physical: instrument, retrieval algorithm, aerosol spatial and temporal variability…

• Input: ancillary data used by the retrieval algorithm

• Classification: erroneous flagging of the data • Simulation: the geophysical model used for the

retrieval • Sampling: the averaging within the retrieval

footprint Borrowed from the SST study on error budget

23

Page 24: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

What is Level 3 quality?

It is not defined in Earth Science….• If Level 2 errors are known, the corresponding Level

3 error can be computed, in principle• Processing from L2L3 daily L3 monthly may

reduce random noise but can also exacerbate systematic bias and introduce additional sampling bias

• However, at best, standard deviations (mostly reflecting variability within a grid box), and sometimes pixel counts and quality histograms are provided

• Convolution of natural variability with sensor/retrieval uncertainty and bias – need to understand their relative contribution to differences between data

• This does not address sampling bias

Page 25: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Why can’t we just apply L2 quality to L3?

Aggregation to L3 introduces new issues where aerosols co-vary with some observing or

environmental conditions – sampling bias:• Spatial: sampling polar areas more than equatorial• Temporal: sampling one time of a day only (not

obvious when looking at L3 maps)

• Vertical: not sensitive to a certain part of the atmosphere thus emphasizing other parts

• Contextual: bright surface or clear sky bias • Pixel Quality: filtering or weighting by quality may

mask out areas with specific features

Page 26: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Addressing Level 3 data “quality”

• Terminology: Quality, Uncertainty, Bias, Error budget, etc.• Quality aspects (examples):

–Completeness:• Spatial (MODIS covers more than MISR)• Temporal (Terra mission has been longer in space than Aqua)• Observing Condition (MODIS cannot measure over sun glint while MISR can)

–Consistency:• Spatial (e.g., not changing over sea-land boundary)• Temporal (e.g., trends, discontinuities and anomalies)• Observing Condition (e.g., exhibit variations in retrieved measurements due to the

viewing conditions, such as viewing geometry or cloud fraction)

–Representativeness:• Neither pixel count nor standard deviation fully express representativeness of the grid

cell value

– Data Quality Glossary development: http://tw.rpi.edu/web/project/MDSA/Glossary

26

Page 27: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

More terminology• ‘Even a slight difference in terminology can lead to

significant differences between data from different sensors. It gives an IMPRESSION of data being of bad quality while in fact they measure different things. For example, MODIS and MISR definitions of the aerosol "fine mode" is different, so the direct comparison of fine modes from MODIS and MISR does not always give good correlation.’

• Ralph Kahn, MISR Aerosol Lead.

27

Page 28: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Three projects with data quality flavor

• Multi-sensor Data Synergy Advisor (**)– Product level

• Data Quality Screening Service– Pixel level

• Aerosol Status– Record level

28

Page 29: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Multi-Sensor Data Synergy Advisor (MDSA)

• Goal: Provide science users with clear, cogent information on salient differences between data candidates for fusion, merging and intercomparison

–Enable scientifically and statistically valid conclusions

• Develop MDSA on current missions:

– NASA - Terra, Aqua, (maybe Aura)

• Define implications for future missions

29

Page 30: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

How MDSA works?

MDSA is a service designed to characterize the differences between two datasets and advise a user (human or machine) on the advisability of combining them.

• Provides the Giovanni online analysis tool • Describes parameter and products• Documents steps leading to the final data product• Enables better interpretation and utilization of parameter

difference and correlation visualizations. • Provides clear and cogent information on salient differences

between data candidates for intercomparison and fusion. • Provides information on data quality• Provides advice on available options for further data

processing and analysis. 30

Page 31: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Giovanni Earth Science Data Visualization & Analysis Tool

• Developed and hosted by NASA/ Goddard Space Flight Center (GSFC)

• Multi-sensor and model data analysis and visualization online tool

• Supports dozens of visualization types

• Generate dataset comparisons

• ~1500 Parameters

• Used by modelers, researchers, policy makers, students, teachers, etc. 31

Page 32: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Web-based tools like Giovanni allow scientists to compress

the time needed for pre-science preliminary tasks:

data discovery, access, manipulation, visualization,

and basic statistical analysis.

DO SCIENCE

Submit the paper

Minutes

Web-based Services:

Perform filtering/masking

Find data Retrieve high volume data

Extract parameters

Perform spatial and other subsetting

Identify quality and other flags and constraints

Develop analysis and visualization

Accept/discard/get more data (sat, model, ground-based)

Learn formats and develop readers

Jan

Feb

Mar

May

Jun

Apr

Pre-Science

Days for exploration

Use the best data for the final analysis

Write the paper

Derive conclusions

Exploration

Use the best data for the final analysis

Write the paper

Initial Analysis

Derive conclusions

Submit the paper

Jul

Aug

Sep

Oct

The Old Way: The Giovanni Way:

Read Data

Reformat

Analyze

Explore

Reproject

Visualize

Extract Parameter

Gio

vann

i

Mirad

or

Scientists have more time to do science!Scientists have more time to do science!

DO SCIENCE

Giovanni Allows Scientists to Concentrate on the Science

Filter Quality

Subset Spatially

Page 33: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Data Usage Workflow

33

Page 34: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Data Usage Workflow

34Integration

Reformat

Re-project

Filtering

Subset / Constrain

Page 35: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Data Usage Workflow

35

Integration Planning

Precision Requirements

Quality Assessment Requirements

Intended Use

Integration

Reformat

Re-project

Filtering

Subset / Constrain

Page 36: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Challenge• Giovanni streamlines data processing, performing required actions on

behalf of the user

– but automation amplifies the potential for users to generate and use

results they do not fully understand

• The assessment stage is integral for the user to understand fitness-for-

use of the result

– but Giovanni does not assist in assessment

• We are challenged to instrument the system to help users understand

results

36

Page 37: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Advising users on Product quality

We need to be able to advise MDSA users on:• Which product is better? Or, a better formulated

question: Which product has better quality over certain areas?

• Address harmonization of quality across products• How does sampling bias affect product quality?

–Spatial: sampling polar area more than equatorial–Temporal: sampling one time of a day only–Vertical: not sensitive to a certain part of the atmosphere thus emphasizing other parts– Pixel Quality : filtering by quality may mask out areas with specific features– Clear sky: e.g., measuring humidity only where there are no clouds may lead to dry bias– Surface type related

Page 38: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Research approach• Systematizing quality aspects

– Working through literature– Identifying aspects of quality and their dependence of

measurement and environmental conditions– Developing Data Quality ontologies– Understanding and collecting internal and external

provenance

• Developing rulesets allows to infer pieces of knowledge to extract and assemble

• Presenting the data quality knowledge with good visual, statement and references

Page 39: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Data Quality Ontology Development (Quality flag)

Working together with Chris Lynnes’s DQSS project, started from the pixel-level quality view.

Page 40: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Data Quality Ontology Development (Bias)

http://cmapspublic3.ihmc.us:80/servlet/SBReadResourceServlet?rid=1286316097170_183793435_22228&partName=htmltext

Page 42: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

MDSA Aerosol Data Ontology Example

Ontology of Aerosol Data made with cmap ontology editor

Page 43: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

RuleSet Development

[DiffNEQCT:(?s rdf:type gio:RequestedService),(?s gio:input ?a),(?a rdf:type gio:DataSelection),(?s gio:input ?b),(?b rdf:type gio:DataSelection),(?a gio:sourceDataset ?a.ds),(?b gio:sourceDataset ?b.ds),(?a.ds gio:fromDeployment ?a.dply),(?b.ds gio:fromDeployment ?b.dply),(?a.dply rdf:type gio:SunSynchronousOrbitalDeployment),(?b.dply rdf:type gio:SunSynchronousOrbitalDeployment),(?a.dply gio:hasNominalEquatorialCrossingTime ?a.neqct),(?b.dply gio:hasNominalEquatorialCrossingTime ?b.neqct),notEqual(?a.neqct, ?b.neqct)->(?s gio:issueAdvisory giodata:DifferentNEQCTAdvisory)]

Page 44: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Thus - Multi-Sensor Data Synergy Advisor

• Assemble semantic knowledge base

– Giovanni Service Selections

– Data Source Provenance (external provenance - low detail)

– Giovanni Planned Operations (what service intends to do)

• Analyze service plan

– Are we integrating/comparing/synthesizing?

• Are similar dimensions in data sources semantically comparable? (semantic diff)

• How comparable? (semantic distance)

– What data usage caveats exist for data sources?

• Advise regarding general fitness-for-use and data-usage caveats 44

Page 45: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Provenance Distance Computation

Provenance:Origin or source from which something comes, intention for use, who/what

generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility.

Based on provenance “distance”, we tell users how different data products are.

Issues:• Computing the similarity of two provenance traces is non-trivial

• Factors in provenance have varied weight on how comparable results of processing are

• Factors in provenance are interdependent in how they affect final results of processing• Need to characterize similarity of external (pre-Giovanni) provenance• Dimensions/factors that affect comparability is quickly overwhelming• Not all of these dimensions are independent - most of them are correlated with

each other.• Numerical studies comparing datasets can be used, when available, and where

applicable to Giovanni analysis

Provenance:Origin or source from which something comes, intention for use, who/what

generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility.

Based on provenance “distance”, we tell users how different data products are.

Issues:• Computing the similarity of two provenance traces is non-trivial

• Factors in provenance have varied weight on how comparable results of processing are

• Factors in provenance are interdependent in how they affect final results of processing• Need to characterize similarity of external (pre-Giovanni) provenance• Dimensions/factors that affect comparability is quickly overwhelming• Not all of these dimensions are independent - most of them are correlated with

each other.• Numerical studies comparing datasets can be used, when available, and where

applicable to Giovanni analysis

Page 46: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Assisting in Assessment

46

Integration Planning

Precision Requirements

Quality Assessment Requirements

Intended Use

Integration

Reformat

Re-project

Filtering

Subset / Constrain

MDSA Advisory Report

Provenance & Lineage Visualization

Page 47: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Multi-Domain Knowledgebase

47

Page 48: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Advisor Knowledge Base

48Advisor Rules test for potential anomalies, create association between service metadata and

anomaly metadata in Advisor KB

Page 49: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Semantic Advisor Architecture

RPI

Page 50: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

…. complexity

50

Page 51: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Presenting data quality to users

We split quality (viewed here broadly) into two categories:• Global or product level quality information, e.g.

consistency, completeness, etc., that can be presented in a tabular form.

• Regional/seasonal. This is where we've tried various approaches: – maps with outlines regions, one map per

sensor/parameter/season– scatter plots with error estimates, one per a combination of

Aeronet station, parameter, and season; with different colors representing different wavelengths, etc.

Page 52: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Advisor Presentation Requirements

• Present metadata that can affect fitness for use of result

• In comparison or integration data sources– Make obvious which properties are

comparable– Highlight differences (that affect

comparability) where present• Present descriptive text (and if possible

visuals) for any data usage caveats highlighted by expert ruleset

• Presentation must be understandable by Earth Scientists!! Oh you laugh… 52

Page 53: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Advisory Report

• Tabular representation of the semantic equivalence of comparable data source and processing properties

• Advise of and describe potential data anomalies/bias

53

Page 54: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Advisory Report (Dimension Comparison Detail)

54

Page 55: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Advisory Report (Expert Advisories Detail)

55

Page 56: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Quality Comparison Table for Level-3 AOD (Global example)

Quality Aspect MODIS MISR

Completeness

Total Time Range Platform Time Range 2/2/200-presentTerra 2/2/2000-present

Aqua 7/2/2002-present

Local Revisit Time Platform Time Range Platform Time Range

Terra 10:30 AM Terra 10:30 AM

Aqua 1:30 PM

Revisit Time global coverage of entire earth in 1 day; coverage overlap near pole

global coverage of entire earth in 9 days & coverage in 2 days in polar region

Swath Width 2330 km 380 km

Spectral AOD AOD over ocean for 7 wavelengths (466, 553, 660, 860, 1240, 1640, 2120 nm );AOD over land for 4 wavelengths (466, 553, 660, 2120 nm (land)

AOD over land and ocean for 4 wavelengths (446, 558, 672, and 866 nm) 

AOD Uncertainty or Expected Error (EE)

+-0.03+- 5% (over ocean; QAC > = 1)+-0.05+-20% (over land, QAC=3);

63% fall within 0.05 or 20% of Aeronet AOD; 40% are within 0.03 or 10%

Successful Retrievals

15% of Time 15% of Time (slightly more because of retrieval over Glint region also)

Page 57: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Data Quality Screening Service for Remote Sensing Data

The DQSS filters out bad pixels for the user•Default user scenario

– Search for data– Select science team recommendation for quality

screening (filtering)– Download screened data

•More advanced scenario – Search for data– Select custom quality screening parameters– Download screened data

57

Page 58: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

The quality of data can vary considerably

AIRS Parameter Best (%)

Good (%)

Do Not Use (%)

Total Precipitable Water

38 38 24

Carbon Monoxide 64 7 29

Surface Temperature 5 44 51

Version 5 Level 2 Standard Retrieval Statistics

Page 59: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Percent of Biased Data in MODIS Aerosols Over Land Increase as

Confidence Flag Decreases

*Compliant data are within + 0.05 + 0.2Aeronet

Statistics from Hyer, E., J. Reid, and J. Zhang, 2010, An over-land aerosol optical depth data set for data assimilation by filtering, correction, and aggregation of MODIS Collection 5 optical depth retrievals, Atmos. Meas. Tech. Discuss., 3, 4091–4167.

Page 60: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

The effect of bad qualitydata is often not negligible

Total Column Precipitable Water Quality

Best Good Do Not Usekg/m2

Hurricane Ike, 9/10/2008

Page 61: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

DQSS replaces bad-quality pixels with fill values

Mask based on user criteria

(Quality level < 2)

Good quality data pixels

retained

Output file has the same format and structure as the input file (except for extra mask and original_data fields)

Original data array(Total column precipitable water)

Page 62: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

AeroStat?

• Different papers provide different views on whether MODIS and MISR measure aerosols well.

• Peer-reviewed papers usually are well behind the latest version of the data.

• It is difficult to verify results of a published paper and resolve controversies between different groups as it is difficult to reproduce the results - they might have dealt with either different data or used different quality controls or flags.

• It is important to have an online shareable environment where data processing and analysis can be done in a transparent way by any user of this environment and can be shared amongst all the members of the aerosol community.

2/18/2011

62

Page 63: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Monthly AOD Standard deviation

Areas with high AOD standard deviation might point to areas of high uncertainty.Next slide shows these areas on map of mean AOD.

Page 64: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

AeroStat: Online Platform for the Statistical Intercomparison of Aerosols

64

Explore & Visualize Level 3

Compare Level 3

Correct Level 2

Compare Level 2Before and After

Merge Level 2 to new Level 3

Level 3 are too aggregated

Switch to high-res Level 2

04/18/23

Explore & Visualize Level 2

64EGU 2011

Page 65: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Types of Bias Correction

Type of Correction

Spatial Basis

Temporal Basis

Pros Cons

Relative (Cross-sensor) linear Climatological

Region Season Not influenced by data in other regions, good sampling

Difficult to validate

Relative (Cross-sensor) non-linear Climatological

Global Full data record

Complete sampling Difficult to validate

Anchored Parameterized Linear

Near Aeronet stations

Full data record

Can be validated Limited areal sampling

Anchored Parameterized Non-Linear

Near Aeronet stations

Full data record

Can be validated Limited insight into correction2/18/2011

65

Page 66: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Data Quality Issues• Validation of aerosol data show that not all data pixels labeled as

“bad” are actually bad if looking at from a bias perspective.

• But many pixels are biased differently due to various reasons

From Levy et al, 2009

Page 67: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Quality & Bias assessment using FreeMind

from the Aerosol Parameter Ontology

FreeMind allows capturing various relations between various aspects of aerosol measurements, algorithms, conditions, validation, etc. The “traditional” worksheets do not support complex multi-dimensional nature of the task

Page 68: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Reference: Hyer, E. J., Reid, J. S., and Zhang, J., 2011: An over-land aerosol optical depth data set for data assimilation by filtering, correction, and aggregation of MODIS Collection 5 optical depth retrievals, Atmos. Meas. Tech., 4, 379-408, doi:10.5194/amt-4-379-2011

(General) Statement: Collection 5 MODIS AOD at 550 nm during Aug-Oct over Central South America highly over-estimates for large AOD and in non-burning season underestimates for small AOD, as compared to Aeronet; good comparisons are found at moderate AOD.Region & season characteristics: Central region of Brazil is mix of forest, cerrado, and pasture and known to have low AOD most of the year except during biomass burning season

(Example) : Scatter plot of MODIS AOD and AOD at 550 nm vs. Aeronet from ref. (Hyer et al, 2011) (Description Caption) shows severe over-estimation of MODIS Col 5 AOD (dark target algorithm) at large AOD at 550 nm during Aug-Oct 2005-2008 over Brazil. (Constraints) Only best quality of MODIS data (Quality =3 ) used. Data with scattering angle > 170 deg excluded. (Symbols) Red Lines define regions of Expected Error (EE), Green is the fitted slopeResults: Tolerance= 62% within EE; RMSE=0.212 ; r2=0.81; Slope=1.00For Low AOD (<0.2) Slope=0.3. For high AOD (> 1.4) Slope=1.54

(Dominating factors leading to Aerosol Estimate bias): 1.Large positive bias in AOD estimate during biomass burning season may be due to wrong assignment of Aerosol absorbing characteristics.(Specific explanation) a constant Single Scattering Albedo ~ 0.91 is assigned for all seasons, while the true value is closer to ~0.92-0.93. [ Notes or exceptions: Biomass burning regions in Southern Africa do not show as large positive bias as in this case, it may be due to different optical characteristics or single scattering albedo of smoke particles, Aeronet observations of SSA confirm this] 2. Low AOD is common in non burning season. In Low AOD cases, biases are highly dependent on lower boundary conditions. In general a negative bias is found due to uncertainty in Surface Reflectance Characterization which dominates if signal from atmospheric aerosol is low.

0 1 2 Aeronet AOD

Central South America

* Mato Grosso

* Santa Cruz

* Alta Floresta

Page 69: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

AeroStat Ontology

69

Page 70: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Completeness: Observing Conditions for MODIS AOD at 550 nm Over Ocean

Region Ecosystem % of Retrieval Within

Expected Error

Average Aeronet AOD

AOD Estimation Relative to Aeronet

US Atlantic Ocean

Dominated by Fine mode aerosols (smoke & sulfate)

72% 0.15 Over- estimated(by 7%) *

Indian Ocean Dominated by Fine mode aerosols (smoke & sulfate)

64 % 0.16 Over- estimated (by 7% ) *

Asian Pacific Oceans

Dominated by fine aerosol, not dust

56% 0.21 Over-estimated (by 13%)

“Saharan” Ocean

Outflow Regions in Atlantic dominated by Dust in Spring

56% 0.31 Random Bias (1%) *

Mediterranean Dominated by fine aerosol

57% 0.23 Under- estimated (by 6% ) *

*Remer L. A. et al., 2005: The MODIS Aerosol Algorithm, Products and Validation. Journal of the Atmospheric Sciences, Special Section. 62, 947-973.

Page 71: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Completeness: Observing Conditions for MODIS AOD at 550 nm Over Land

Region Ecosystem % of Retrieval Within Expected

Error

Correlation W.r.t Chinese ground Sun Hazetometer

AOD Estimation Relative to Ground based sensor

Yanting, China Agriculture Site (central China)

45% slope=1.04 ; offset= -0.063 Corr ^2 = 0.83

Slightly Over- estimated

Fukung, China Semi Desert (North West China) Site

7% slope=1.65 offset=0.074 Corr ^2 = 0.58

Over- estimated(more than 100% at large AOD values

Beijing Urban SiteIndustrial Pollution

35% Slope = 0.38, Offset = 0.086, Corr ^2 = 0.46%

Severely Under-estimated(more than 100% at large AOD values)

* Li Z. et al, 2007: Validation and understanding of Moderate Resolution Imaging Spectroradiometer aerosol products (C5) using ground-based measurements from the handheld Sun photometer network in China, JGR, VOL. 112, D22S07, doi:10.1029/2007JD008479.

Page 72: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Summary

• Quality is very hard to characterize, different groups will focus on different and inconsistent measures of quality– Modern ontology representations to the rescue!

• Products with known Quality (whether good or bad quality) are more valuable than products with unknown Quality.– Known quality helps you correctly assess fitness-for-use

• Harmonization of data quality is even more difficult that characterizing quality of a single data product

72

Page 73: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Summary

• Advisory Report is not a replacement for proper analysis planning– But provides benefit for all user types summarizing general

fitness-for-usage, integrability, and data usage caveat information

– Science user feedback has been very positive

• Provenance trace dumps are difficult to read, especially to non-software engineers– Science user feedback; “Too much information in provenance

lineage, I need a simplified abstraction/view”

• Transparency Translucency

– make the important stuff stand out

• Uncertainty and bias are evident at may levels!!73

Page 74: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Future Work• Advisor suggestions to correct for potential

anomalies

• Views/abstractions of provenance based on specific user group requirements

• Continued iteration on visualization tools based on user requirements

• Present a comparability index / research techniques to quantify comparability

74

Page 75: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

Reading for this week• None

75

Page 76: 1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 11, April 26, 2011 Information integration, life- cycle and visualization

What is next• Week 12 – course review and summary, and

written part of final project due

• Week 13 – project presentations

76