peter fox (twc/rpi), and … stephan zednik 1 , gregory leptoukh 2 ,
DESCRIPTION
Informatics takes on data and information quality, uncertainty and bias (in atmospheric science). Peter Fox (TWC/RPI), and … Stephan Zednik 1 , Gregory Leptoukh 2 , Chris Lynnes 2 , Jianfu Pan 3. Tetherless World Constellation, Rensselaer Polytechnic Inst. - PowerPoint PPT PresentationTRANSCRIPT
Informatics takes on data and information quality, uncertainty and bias
(in atmospheric science)
Peter Fox (TWC/RPI), and …Stephan Zednik1, Gregory Leptoukh2,
Chris Lynnes2, Jianfu Pan3
1. Tetherless World Constellation, Rensselaer Polytechnic Inst.
2. NASA Goddard Space Flight Center, Greenbelt, MD, United States
3. Adnet Systems, Inc. 1
2
Where are we in respect to this data challenge?
“The user cannot find the data; If he can find it, cannot access it;If he can access it, ;he doesn't know how good they are; if he finds them good, he can not merge
them with other data”
The Users View of IT, NAS 1989
Definitions (ATM)• Quality
– Is in the eyes of the beholder – worst case scenario… or a good challenge
• Uncertainty– has aspects of accuracy (how accurately the real world situation
is assessed, it also includes bias) and precision (down to how many digits)
• Bias has two aspects:– Systematic error resulting in the distortion of measurement data
caused by prejudice or faulty measurement technique – A vested interest, or strongly held paradigm or condition that may
skew the results of sampling, measuring, or reporting the findings of a quality assessment:
• Psychological: for example, when data providers audit their own data, they usually have a bias to overstate its quality.
• Sampling: Sampling procedures that result in a sample that is not truly representative of the population sampled. (Larry English)
3
Data quality needs: fitness for purpose/ use
• Measuring Climate Change:– Model validation: gridded contiguous data with uncertainties– Long-term time series: bias assessment is the must , especially
sensor degradation, orbit and spatial sampling change• Studying phenomena using multi-sensor data:
– Cross-sensor bias characterization is needed• Realizing Societal Benefits through Applications:
– Near-Real Time for transport/event monitoring - in some cases, coverage and timeliness might be more important that accuracy
– Pollution monitoring (e.g., air quality exceedance levels) – accuracy• Educational (users generally not well-versed in the
intricacies of quality; just taking all the data as usable can impair educational lessons) – only the best products
4
MODIS vs. MERIS
Same parameter Same space & time
Different results – why?
MODIS MERIS
A threshold used in MERIS processing effectively excludes high aerosol values. Note: MERIS was designed primarily as an ocean-color instrument, so aerosols are “obstacles” not signal.
5
Spatial and temporal sampling – how to quantify to make it useful for modelers?
• Completeness: MODIS dark target algorithm does not work for deserts• Representativeness: monthly aggregation is not enough for MISR and
even MODIS• Spatial sampling patterns are different for MODIS Aqua and MISR Terra:
“pulsating” areas over ocean are oriented differently due to different orbital direction during day-time measurement Cognitive bias
MODIS Aqua AOD July 2009 MISR Terra AOD July 2009
6
Anomaly Example: South Pacific Anomaly
Anomaly
7
MODIS Level 3 dataday definition leads to artifact in correlation
…is caused by an Overpass Time Difference
8
Correlation between MODIS Aqua AOD (Ocean group product) and MODIS-Aqua AOD (Atmosphere group product)
Pixel Count distribution
Only half of the Data Day artifact is present because the Ocean Group uses the better
Data Day definition!
Sensitivity Study: Effect of the Data Day definition on Ocean Color data correlation with
Aerosol data
9
Why so difficult?
• Quality is perceived differently by data providers and data recipients.
• There are many different qualitative and quantitative aspects of quality.
• Methodologies for dealing with data qualities are just emerging
• Almost nothing exists for remote sensing data quality• Even the most comprehensive review (Batini’s book)
demonstrates that there are no preferred methodologies for solving many data quality issues
• Little funding was allocated in the past to data quality as the priority was to build an instrument, launch a rocket, collect and process data, and publish a paper using just one set of data.
• Each science team handled quality differently. 10
More terminology
• ‘Even a slight difference in terminology can lead to significant differences between data from different sensors. It gives an IMPRESSION of data being of bad quality while in fact they measure different things. For example, MODIS and MISR definitions of the aerosol "fine mode" is different, so the direct comparison of fine modes from MODIS and MISR does not always give good correlation.’
• Ralph Kahn, MISR Aerosol Lead. 11
Quality Control vs. Quality Assessment
• Quality Control (QC) flags in the data (assigned by the algorithm) reflect “happiness” of the retrieval algorithm, e.g., all the necessary channels indeed had data, not too many clouds, the algorithm has converged to a solution, etc.
• Quality assessment is done by analyzing the data “after the fact” through validation, intercomparison with other measurements, self-consistency, etc. It is presented as bias and uncertainty. It is rather inconsistent and can be found in papers, validation reports all over the place.
12
Different kinds of reported data quality
• Pixel-level Quality: algorithmic guess at usability of data point– Granule-level Quality: statistical roll-up of Pixel-level Quality
• Product-level Quality: how closely the data represent the actual geophysical state
• Record-level Quality: how consistent and reliable the data record is across generations of measurements
Different quality types are often erroneously assumed having the same meaning
Ensuring Data Quality at these different levels requires different focus and action
13
14
Three projects with data & information quality flavor
• Multi-sensor Data Synergy Advisor (**)– Product level
• Goal: Provide science users with clear, cogent information on salient differences between data candidates for fusion, merging and intercomparison and enable scientifically and statistically valid conclusions
• Develop MDSA on current missions – Terra, Aqua, (maybe Aura)• Define implications for future missions
• Data Quality Screening Service– Pixel level
• Aerosol Status– Record level
Giovanni Earth Science Data Visualization & Analysis Tool
• Developed and hosted by NASA/ Goddard Space Flight Center (GSFC)
• Multi-sensor and model data analysis and visualization online tool
• Supports dozens of visualization types
• Generate dataset comparisons
• ~1500 Parameters
• Used by modelers, researchers, policy makers, students, teachers, etc.
15
Web-based tools like Giovanni allow scientists to compress
the time needed for pre-science preliminary tasks:
data discovery, access, manipulation, visualization,
and basic statistical analysis.
DO SCIENCE
Submit the paper
Minutes
Web-based Services:
Perform filtering/masking
Find data Retrieve high volume data
Extract parameters
Perform spatial and other subsetting
Identify quality and other flags and constraints
Develop analysis and visualization
Accept/discard/get more data (sat, model, ground-based)
Learn formats and develop readers
Jan
Feb
Mar
May
Jun
Apr
Pre-Science
Days for explorationUse the best data for the final analysis
Write the paperDerive conclusions
Exploration
Use the best data for the final analysis
Write the paper
Initial Analysis
Derive conclusions
Submit the paper
Jul
Aug
Sep
Oct
The Old Way: The Giovanni Way:
Read Data
Reformat
Analyze
Explore
Reproject
Visualize
Extract Parameter
Giovanni
Mirador
Scientists have more time to do science!
DO SCIENCE
Giovanni Allows Scientists to Concentrate on the Science
Filter Quality Subset
Spatially
16
Data Discovery Assessment Access Manipulation Visualization Analyze
Data Usage Workflow
17
Data Discovery Assessment Access Manipulation Visualization Analyze
Data Usage Workflow
18Integration
Reformat
Re-project
Filtering
Subset / Constrain
*Giovanni helpsstreamline / automate
Data Discovery Assessment Access Manipulation Visualization Analyze
Data Usage Workflow
19
Integration Planning
Precision Requirements
Quality Assessment Requirements
Intended Use
Integration
Reformat
Re-project
Filtering
Subset / Constrain
*Giovanni helpsstreamline / automate
Informatics approach
• Systematizing quality aspects– Working through literature– Identifying aspects of quality and their
dependence of measurement and environmental conditions
– Developing Data Quality ontologies– Understanding and collecting internal and external
provenance• Developing rulesets allows to infer pieces of
knowledge to extract and assemble• Presenting the data quality knowledge with
good visual, statement and references 20
Data Quality Ontology Development (Quality flag)
Working together with Chris Lynnes’s DQSS project, started from the pixel-level quality view. 21
Data Quality Ontology Development (Bias)
http://cmapspublic3.ihmc.us:80/servlet/SBReadResourceServlet?rid=1286316097170_183793435_22228&partName=htmltext 22
Modeling quality (Uncertainty)
Link to other cmap presentations of quality ontology:
http://cmapspublic3.ihmc.us:80/servlet/SBReadResourceServlet?rid=1299017667444_1897825847_19570&partName=htmltext
23
MDSA Aerosol Data Ontology Example
Ontology of Aerosol Data made with cmap ontology editor24
RuleSet Development
[DiffNEQCT:(?s rdf:type gio:RequestedService),(?s gio:input ?a),(?a rdf:type gio:DataSelection),(?s gio:input ?b),(?b rdf:type gio:DataSelection),(?a gio:sourceDataset ?a.ds),(?b gio:sourceDataset ?b.ds),(?a.ds gio:fromDeployment ?a.dply),(?b.ds gio:fromDeployment ?b.dply),(?a.dply rdf:type gio:SunSynchronousOrbitalDeployment),(?b.dply rdf:type gio:SunSynchronousOrbitalDeployment),(?a.dply gio:hasNominalEquatorialCrossingTime ?a.neqct),(?b.dply gio:hasNominalEquatorialCrossingTime ?b.neqct),notEqual(?a.neqct, ?b.neqct)->(?s gio:issueAdvisory giodata:DifferentNEQCTAdvisory)]
25
Data Discovery Assessment Access Manipulation Visualization Analyze Re-
Assessment
Assisting in Assessment
26
Integration Planning
Precision Requirements
Quality Assessment Requirements
Intended Use
Integration
Reformat
Re-project
Filtering
Subset / Constrain
MDSA Advisory Report
Provenance & Lineage Visualization
Advisor Knowledge Base
27Advisor Rules test for potential anomalies, create
association between service metadata and anomaly metadata in Advisor KB
Semantic Advisor Architecture
RPI
28
Advisory Report (Dimension Comparison Detail)
29
Advisory Report (Expert Advisories Detail)
30
Summary
• Quality is very hard to characterize, different groups will focus on different and inconsistent measures of quality– Modern ontology representations to the rescue!
• Products with known Quality (whether good or bad quality) are more valuable than products with unknown Quality.– Known quality helps you correctly assess fitness-for-use
• Harmonization of data quality is even more difficult that characterizing quality of a single data product
31
Summary• Advisory Report is not a replacement for proper analysis
planning– But provides benefit for all user types summarizing general
fitness-for-usage, integrability, and data usage caveat information– Science user feedback has been very positive
• Provenance trace dumps are difficult to read, especially to non-software engineers– Science user feedback; “Too much information in provenance
lineage, I need a simplified abstraction/view”
• Transparency Translucency– make the important stuff stand out
32
Future Work• Advisor suggestions to correct for potential
anomalies
• Views/abstractions of provenance based on specific user group requirements
• Continued iteration on visualization tools based on user requirements
• Present a comparability index / research techniques to quantify comparability
33
Extra material
34
Acronyms
ACCESS Advancing Collaborative Connections for Earth System ScienceACE Aerosol-Cloud-EcosystemsAGU American Geophysical UnionAIST Advanced Information Systems TechnologyAOD Aerosol Optical DepthAVHRR Advanced Very High Resolution RadiometerGACM Global Atmospheric Composition MissionGeoCAPE Geostationary Coastal and Air Pollution EventsGEWEX Global Energy and Water Cycle ExperimentGOES Geostationary Operational Environmental SatelliteGOME-2 Global Ozone Monitoring Experiment-2JPSS Joint Polar Satellite SystemLST Local Solar TimeMDSA Multi-sensor Data Synergy AdvisorMISR Multiangle Imaging SpectroRadiometerMODIS Moderate Resolution Imaging SpectraradiometerNPP National Polar-Orbiting Operational Environmental Satellite System Preparatory
Project 35
Acronyms (cont.)
OMI Ozone Monitoring InstrumentOWL Web Ontology LanguagePML Proof Markup LanguageQA4EO QA for Earth ObservationsREST Representational State TransferTRL Technology Readiness LevelUTC Coordinated Universal Time WADL Web Application Description LanguageXML eXtensible Markup LanguageXSL eXtensible Stylesheet LanguageXSLT XSL Transformation
36
Quality & Bias assessment using FreeMind
from the Aerosol Parameter Ontology
FreeMind allows capturing various relations between various aspects of aerosol measurements, algorithms, conditions, validation, etc. The “traditional” worksheets do not support complex multi-dimensional nature of the task
37
Reference: Hyer, E. J., Reid, J. S., and Zhang, J., 2011: An over-land aerosol optical depth data set for data assimilation by filtering, correction, and aggregation of MODIS Collection 5 optical depth retrievals, Atmos. Meas. Tech., 4, 379-408, doi:10.5194/amt-4-379-2011
Title: MODIS Terra C5 AOD vs. Aeronet during Aug-Oct Biomass burning in Central Brazil, South America(General) Statement: Collection 5 MODIS AOD at 550 nm during Aug-Oct over Central South America highly over-estimates for large AOD and in non-burning season underestimates for small AOD, as compared to Aeronet; good comparisons are found at moderate AOD.Region & season characteristics: Central region of Brazil is mix of forest, cerrado, and pasture and known to have low AOD most of the year except during biomass burning season
(Example) : Scatter plot of MODIS AOD and AOD at 550 nm vs. Aeronet from ref. (Hyer et al, 2011) (Description Caption) shows severe over-estimation of MODIS Col 5 AOD (dark target algorithm) at large AOD at 550 nm during Aug-Oct 2005-2008 over Brazil. (Constraints) Only best quality of MODIS data (Quality =3 ) used. Data with scattering angle > 170 deg excluded. (Symbols) Red Lines define regions of Expected Error (EE), Green is the fitted slopeResults: Tolerance= 62% within EE; RMSE=0.212 ; r2=0.81; Slope=1.00For Low AOD (<0.2) Slope=0.3. For high AOD (> 1.4) Slope=1.54
(Dominating factors leading to Aerosol Estimate bias): 1. Large positive bias in AOD estimate during biomass burning season may
be due to wrong assignment of Aerosol absorbing characteristics.(Specific explanation) a constant Single Scattering Albedo ~ 0.91 is assigned for all seasons, while the true value is closer to ~0.92-0.93.
[ Notes or exceptions: Biomass burning regions in Southern Africa do not show as large positive bias as in this case, it may be due to different optical characteristics or single scattering albedo of smoke particles, Aeronet observations of SSA confirm this]
2. Low AOD is common in non burning season. In Low AOD cases, biases are highly dependent on lower boundary conditions. In general a negative bias is found due to uncertainty in Surface Reflectance Characterization which dominates if signal from atmospheric aerosol is low.
0 1 2 Aeronet AOD
MO
DIS
AO
D
Central South America
* Mato Grosso
* Santa Cruz
* Alta Floresta
2
1
38