supersite relational database project: ‘federated pm data warehouse’ ‘federated pm data...

17
Supersite Relational Database Project: ‘Federated PM Data Warehouse’ Rudolf Husar, PI Center for Air Pollution Impact and Trend Analysis (CAPITA) Washington University, St. Louis, MO Nov 27, 2001 Proposal Presentation to the Supersite Program a sub- project of St. Louis Midwest Supersite Project, Jay Turner, PI

Upload: rosamund-bates

Post on 01-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

Supersite Relational Database Project:‘Federated PM Data Warehouse’

Rudolf Husar, PI

Center for Air Pollution Impact and Trend Analysis (CAPITA)

Washington University, St. Louis, MO

Nov 27, 2001

Proposal Presentation to the Supersite Program

a sub- project of St. Louis Midwest Supersite Project, Jay Turner, PI

Page 2: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

Design, Populate and Maintain aSupersite Relational Database System

• Include monitoring data from Supersites and auxiliary projects

• Facilitate cross-Supersite [regional or comparative] data analyses

• Support analyses by a variety of research groups

Purpose of the Project:

Page 3: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

EPA Specs of the Supersite Relational Data System (from RFP)

• Data Input:– Data input electronically

– Modest amount of metadata on sites, instruments, data sources/version, contacts etc.

– Simple data structures, formats and convenient submission procedures

• Data Storage and Maintenance:– Data storage in relational database(s), possibly distributed over multiple servers

– A catalog of data holdings and request logs

– Supersite data updates quarterly

• Data Access:– User-friendly web-access by multiple authorized users

– Data query by parameter, method, location, date/time, or other metadata

– Multiple data output formats (ASCII, spreadsheet, other (dbf, XML)

Page 4: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

General Approach to SRDS Design

• Based on consensus, adopt a uniform relational data structure, suitable for regional and cross-Supersite data integration and analysis.

• We propose a star schema with spatial, temporal, parameter and method dimensions.

• The ‘original’ data are to be maintained at the respective providers or custodians (Supersites, CIRA, CAPITA..).

• We propose the creation of flexible ‘adapters’ and web-submission forms for the transfer of data subsets into the uniformly formatted ‘Federated Data Warehouse’.

• Data users would access the data warehouse manually or through software.

• We propose data access using modern ‘web services’ protocol, suitable for building data viewers and processors (filtering, aggregation and fusion)

Page 5: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

The ‘Minimal’ Star Schema

• The minimal Site table includes SiteID, Name and Lat/Lon.

• The minimal Parameter table consists of ParamterID, Description and Unit

• The time dimensional table is usually skipped since time is self-describing

• The minimal Fact (Data) table consists of the Obs_Value and the three dimensional codes for Obs_DateTime, Site_ID and Parameter_ID

For integrative, cross-Supersite analysis, data queries by time, location and parameter, the database has to have time, location and parameter as dimensions

The above minimal (multidimensional) schema was used in the CAPITA data exploration software, Voyager for the past 22 years, encoding 1000+ datasets.

Most Supersite data require a more elaborate schema to fully capture the content

Page 6: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

From Heterogeneous to Homogeneous Schema• Individual Supersite SQL databases have varied designs, usually following the ‘snowflake’ pattern (see

Database Schema Design for the Federated Data Warehouse)

• Though they have complicated schemata, these SQL servers can be queried along spatial, temporal and parameter dimensions. However, the query to retrieve the same information depends on the of the particular database.

• A way to homogenize the distributed data is access all the data through a Data Adapter using only a subset of the tables/fields from any particular database (red)

• The proposed extracted uniform (abstract) schema is the Minimal Star Schema, (possibly expanded). The final form of the uniformly extracted data schema will be arrived at by consensus.

Subset used

Uniform Schema

Fact

Table

Data AdapterExtraction of

homogeneous data from heterogeneous sources

Page 7: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

Live Demo of the Data Warehouse Prototype

Uniform Data Query regardless of the native schema: Query by parameter, location, time, method

Currently online data are accessible from the CIRA (IMPROVE) and CAPITA SQL servers

The hidden DataAdopter- accepts the uniform query- accesses the data server- transforms the original to uniform data- delivers uniforms DataSets

A rudimentary viewer displays the data in a table for browsing.

http://capita.wustl.edu/DSViewer/DSviewer.aspx

Page 8: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

Federated Data Warehouse Architecture• Tree-tier architecture consisting of

– Provider Tier: Back-end servers containing heterogeneous data, maintained by the federation members – Proxy Tier: Retrieves designated Provider data and homogenizes it into common, uniform Datasets – User Tier: Accesses the Proxy Server and uses the uniform data for presentation, integration or processing

• The Provider servers interact only with the Proxy Server in accordance with the Federation Contract– The contract sets the rules of interaction (accessible data subsets, types of queries)– Strong server security measures enforced, e.g. through Secure Socket layer

• The data User interacts only with the generic Proxy Server using flexible Web Services interface– Generic data queries, applicable to all data in the Warehouse (e.g. space, time, parameter data sub-cube)– The data query is addressed to a Web Service provided by the Proxy Server of the Federation – Uniformly formatted, self-describing data packages are handed to the user for presentation or further processing

SQLDataAdapter1

CustomDataAdapter

SQLDataAdapter2

SQLServer1

SQLServer2

LegacyServer

Presentation

Data Access & Use

Provider Tier Heterogeneous Data

Proxy Tier

Data Homogenization, etc.

Member ServersProxy Server

User Tier

Data Consumption

Processing

Integration

Federated Data Warehouse

Fire Wall, Federation ContractWeb Service, Uniform Query & Data

Page 9: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

Data Re-Use and Synergy

• Data producers maintain their own workspace and resources (data, reports, comments).

• Part of the resources are shared by creating a common virtual resources.

• Web-based integration of the resources can be across several dimensions:Spatial scale: Local – global data sharing

Data content: Combination of data generated internally and externally

• The main benefits of sharing are data re-use, data complementing and synergy.

• The goal of the system is to have the benefits of sharing outweigh the costs.

Content

Content

User

User

User

LocalLocal

GlobalGlobal

Virtual Shared Resources

Data, KnowledgeTools, Methods

User

User

Shared part of resources

Page 10: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

Data Entry to the Supersite Relational Data System:

EPA

Supersite Data

Coordinated

Supersite

Relational

Tables

EOSDIS

Data

Archive

NARSTO ORNLDES, Data Ingest

Supersite

SQL

Server

DES-SQLTransformer

DataAdapterSupersite & other

SQL Data

DataQuery

TableOutput

Direct Web Data Input

1. Automatic translation and transfer of NARSTO-archived DES data to SQL

2. Web-submission of of relational tables by the data producers/custodians

3. Batch transfer of large auxiliary datasets to the SQL server

Page 11: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

Data Warehouse Features

• As much as possible, data should reside in their respective home environment. ‘Uprooted’ data in decoupled databases tend to decay i.e. can not be easily updated, maintained, enriched.

• Data from the providers will be transferred to the ‘federated data warehouse’ through (1) on-line DataAdapters, (2) Manual web submission and (3) Semi-automated transfer from the NARSTO archive.

• Retrieval of uniform data from the data warehouse facilitates integration and comparison along the key dimensions (space, time, parameter, method)

• The open architecture data warehouse (see Web Services) promotes the building of further value chains: Data Viewers, Data Integration Programs, Automatic Report Generators etc..

Page 12: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

Data Preparation Procedures:

• Data gathering, QA/QC and standard formatting is to be done by individual projects

• The data exchange standards, data ingest and archives are by ORNL and NASA

• Data ingest is to automated, aided by tools and procedures supplied by this project– NARSTO DES-SQL translator

– Web submission tools and procedures

– Metadata Catalog and I/O facilities

• Data submissions and access will be password protected as set by the community.

• Submitted data will be retained in a temporary buffer space and following verification transferred to the shared SQL database.

• The data access, submissions etc. will be automatically recorded an summarized in human-readable reports.

Page 13: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

Data Catalog

• Data Catalog will be maintained through CAPITA website.

• Limited metadata (based on user consensus) will be recorded for each dataset

• User feedback on individual datasets will be through comments/feedback pages

• An example is the data section of the St. Louis Supersite website.

Page 14: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

Related CAPITA Projects

• EPA Network Design Project (~$150K/yr –April 2003). Development of novel quantitative methods of network optimization. The network performance evaluation is conducted using the complete PM FRM data set in AIRS which will be available for input into the SRDS.

• EPA WebVis Project (~$120K/yr - April 2003). Delivery current visibility data to the public through a web-based system. The surface met data are being transferred into the SQL database (Since March 2001) and will be available to SRDS.

• NSF Collaboration Support Project (~$140K/yr – Dec 2004). Continuing development of interactive web sites for community discussions and for web-based data sharing; (directly applicable to this project)

• NOAA ASOS Analysis Project (~$50K/yr - May 2002). Evaluate the potential utility of the ASOS visibility sensors (900 sites, one minute resolution) as PM surrogate. Data now available for April-October 2001 – can be incorporated into to the Supersite Relational Data System.

• St. Louis Supersite Project website (~$50K/yr – Dec 2003) . The CAPITA group maintains the St. Louis Supersite website and some auxiliary data. It will also be used for this project

Page 15: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

Federated Data Warehouse Applications

XML WebServices

Satellite

Vector

GIS Data

XDim Data

OLAPCube

SQLTable

HTTPServices

Text Data

WebPage

TextData

Time Chart

Scatter Chart

Text, Table

Data View & Process TierLayered Map

Cursor

Data Warehouse Tier

Data View

Manager

Connection

Manager

Data Access

Manager

Cursor-Query

Manager

OpenGISServices

Data are rendered by linked Data Views (map, time, text)

Distributed data of multiple types (spatial, temporal text)

The Broker handles the views, connections, data access, cursor

Page 16: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

Supersite Relational Data System: Schedule

• First four four months to design of the relational database, associated data transformers, I/O; submitted to the Supersite workgroups for comment

• In six months, Supersite data preparation and entry begins• In Year 2 and Year 3, data sets will be updated by providers as needed; system accessible to data

user community

Year 1 - 2002 Year 2 - 2003 Year 2 - 2004

RDMS Design Feed

back

Impl. &

Test SQL Supersite Data Entry

Auxiliary Data Entry

Other Coordinated Data Entry

Supersite, Coordinated and Auxiliary Data Updates

Page 17: Supersite Relational Database Project: ‘Federated PM Data Warehouse’ ‘Federated PM Data Warehouse Rudolf Husar, PI Center for Air Pollution Impact and

Personnel, Management and Facilities

Personnel• PI, R. B. Husar (10%), Kari Hoijarvi (25%). Software experience at CAPITA, Microsoft, Visala. • 20% of project budget ($12k/yr) to consultants: J. Watson, DRI, W. White and J. Turner, WU.• Collaborators, (CAPITA associates): B. Schichtel, CIRA, S. Falke, EPA, M. Bezic, Microsoft.

Management • This project is a sub-project of the St. Louis-Midwest Supersite project, Dr. Jay Turner, PI.• Special focus is on supporting large scale, crosscutting, and integrative analysis.• This project will leverage the other CAPITA data sharing projects

Resources and Facilities• CAPITA has the ‘largest known privately held collection of air quality, metrological and emission

data’, available in uniform Voyager format and extensively accessed from the CAPITA website• The computing and communication facilities include two servers, ten workstations and laptops,

connected internally and externally through high-speed networks.• Software development tools includes Visual Studio, part of the .NET distributed development

environment