addressing the challenges of multi-domain data integration with the semanteco framework evan w....
Post on 18-Dec-2015
216 Views
Preview:
TRANSCRIPT
Addressing the Challenges of Multi-Domain Data Integration
with the SemantEco Framework
Evan W. Patton, Patrice Seyed, Deborah L. McGuinness
Presented at AGU Fall Meeting 2013
2
Overview
• Motivation & History• The SemantEco Ontology• The SemantEco Pipeline• Domain Extensions• Performance Analysis• Lessons Learned• Conclusions & Future Work
3
The Problem
• Real Life Motivating Example:– In 2009, in Bristol County, Rhode Island, children
became ill with symptoms such as diarrhea. The cause was found to be polluted water (E. Coli) and citizens were asked to boil water until the issue was resolved.
– Public concerns: “When did the contamination begin?”, “How did this happen?”, “How can we keep it from happening again?”
– We need environmental informatics systems that can automatically integrate and analyze water quality.
4
The Problem
1. Raw data from multiple sources and in different formats – difficult to integrate and query.
2. Semantics of the water quality data are not explicitly encoded in the data – machine can’t process data automatically.
3. Large amount of data due to large spatial region, long time span, and large number of pollutants and regulated limit – analysis can be time consuming and complex.
5
SemantEco Ontology
escim:Measurement:SubClassOf: repr:Measurement
unit:hasUnit exactly 1 unit:Unitescim:ofCharacteristic exactly 1
escim:Characteristic escim:hasValue exactly 1
xsd:decimal
water:WaterMeasurement:
SubClassOf: escim:Measurement
6
SemantEco Ontology
ThresholdViolation:
SubClassOf: escim:Measurement
ColiformRegulationViolation:
SubClassOf: ThresholdViolation
IntersectionOf:
water:WaterMeasurement
escim:ofCharacteristic escim:FecalColiform
escim:hasValue some xsd:decimal[> 400]unit:hasUnit escim:MPN_per_mL
7
Limitations
• How to show more than just water data?
• How to incorporate additional datasets with minimal modifications to queries?
• How to provide facets along different dimensions of the data?
8
Modular Approach
• SemantEco Framework employs modules to add functionality, data, domains
• Modules can be hot-deployed, making the system extensible/upgradable at runtime
• Application-level functionality and provenance is captured by a set of core classes that can be repurposed for different applications
10
SemantEco Data Pipeline
User Request
Ontology
Data Load
Forward Inference
Query Answering
Module Processing
12
SemantEco Query Pipeline
Inferred by regulation ontology
SELECT ?measure ?characteristic ?value ?unit ?timeWHERE {
<#site1> a pol:PollutedSite ;
escim:hasMeasurement ?measure .
?measure a escim:Measurement ;
escim:ofCharacteristic ?characteristic ;
escim:hasValue ?value ;
unit:hasUnit ?unit ;
time:inXSDDateTime ?time .
}
13
SemantEco Query Pipeline
SELECT ?measure ?characteristic ?value ?unit ?timeWHERE {
<#site1> a escim:MeasurementSite ;
escim:hasMeasurement ?measure .
?measure a escim:Measurement ;
escim:ofCharacteristic ?characteristic ;
escim:hasValue ?value ;
unit:hasUnit ?unit ;
time:inXSDDateTime ?time .
} Added by Characteristics
Module
escim:ofCharacteristic escim:FecalColiform ;
14
SemantEco Query Pipeline
SELECT ?measure ?characteristic ?value ?unit ?timeWHERE {
<#site1> a escim:MeasurementSite ;
escim:hasMeasurement ?measure .
?measure a escim:Measurement ;
escim:ofCharacteristic ?characteristic ;
escim:hasValue ?value ;
unit:hasUnit ?unit ;
time:inXSDDateTime ?time .
Added by Time Module
escim:ofCharacteristic escim:FecalColiform ;
FILTER( ?time < xsd:date(“2009-09-08”) )}
15
Domain Extensions
• Air quality data from the Environmental Protection Agency
• Bird species count data from Avian Knowledge Network eBird database
• Fish species count data from Santa Barbara Long Term Ecological Research (LTER) group
16
Performance Analysis
• Analysis on San Francisco, CA 94107;one of the largest datasets in the system
• Water data: 74920 triples• Species data: 5605 triples• Time to completion:
– Water: 0:03.790– Species: 0:00.813– Combined: 2:14.015
17
Performance Redux
• Partitioning of knowledge base as a function of declared domain– Water domain: 0:03.778– Bird domain: 0:00.632– “Combined”: 0:11.126
• Transformation of the regulation ontology into a ruleset executed with a traditional RETE engine improves performance ~66 %
18
Lessons Learned
• Best to represent thresholding as rules to improve inference performance
• Smartly partition data when disjointness is not or cannot be ontologically explicit
• Design modules to high-level ontologies for reusability in other domains
• Inferencing keeps queries simple to understand at the cost of making debugging more complex
19
Conclusions
• Semantic technologies and toolchains support integrating data from multiple domains and presenting it in a single portal
• Ontologies can assist in machine interpretation of data for non-expert end users
• Semantic software design is critical to a flexible, robust platform for building integrative applications
• The SemantEco framework has shown itself to be reusable and extensible both in semantic environmental and ecological monitoring and is proving to be useful more broadly
20
Future Work
• Cross-domain query answering• Support for moving average regulations• Support for registering ontology
transformations for Input/Output• Applications of the framework in other
scientific domains (e.g. health)• Expand our collaborations on semantic
monitoring (contact dlm@cs.rpi.edu)
21
Acknowledgements
• Dr. A. Patrice Seyed• Prof. Deborah L. McGuinness, Advisor• Drs. F. Joshua Dein & R. Sky Bristol (USGS)• Students: Ping Wang, Jin Guang Zheng, Theodora Kampelou,
Linyun Fu, Matthew Ma, Chen Wang, Lynn Zheng, Robin Liu, Katherine Chastin, Brendan Ashby, & Irene Khan
• National Science Foundation– Graduate Research Fellowship (EWP)– DataONE Initiative (APS)
• RPI Tetherless World Constellation
• Contact pattoe@rpi.edu or dlm@cs.rpi.edu for collaboration opportunities
top related