design and creation of ontologies for environmental (multimedia) information retrieval * vipul...

27
Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine [email protected] Workshop on Science and the Semantic Web October 24, 2002 * Work done by the author when at MCC and LSDIS Lab, UGA

Post on 21-Dec-2015

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval*

Vipul KashyapNational Library of Medicine

[email protected]

Workshop on Science and the Semantic WebOctober 24, 2002

* Work done by the author when at MCC and LSDIS Lab, UGA

Page 2: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 2

Outline

Ontologies for Information Retrieval: The InfoSleuth System

The Ontology Design Process:– “Reverse Engineering” from a database schema

– Ontology refinement based on user queries

– Using a data dictionary and Thesaurus

Ontology-based Multimedia Information Retrieval– Information Extraction from Textual Data

– Information Extraction from Image Data

Conclusions and Future Work

Page 3: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 3

Ontology-basedretrieval query

KQML/OKBCagents Document Database

e.g., Verity

Structured Databasee.g., Oracle

Image Database:features, patterns, semantic objects

Ontologies for Information Retrieval:The InfoSleuth System

Page 4: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 4

A Multimedia GIS Query using an ontological modelA Multimedia GIS Query using an ontological model

Get me all regions (blocks, counties) having a population greater than 500 and area greater than 50 acres having an urban land cover and such that all the nearby fires have excellent containment

Fire

name

isLocatedNear Region

county blockarea

population

spatial_locationland_cover

containment

select county, block, spatial_locationfrom regionwhere area > 50 and population > 500and land_cover = “urban”and region.isLocatedNear.containment = “excellent”

Page 5: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 5

Ontologies for Information Retrieval

Provide a concise, uniform, declarative description of semantic information

Independent of syntactic representations, conceptual models of the underlying information bases

Domain models provide wider access by supporting multiple world views on the same underlying data

EDEN ontology defined in the context of the InfoSleuth system:– important and crucial to capture elements of environmental information

Page 6: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 6

Sources for Ontology construction

Pre-existing Database Schemas– data directed component

Collection of representative set of queries possibly parameterized based on application user interface– application directed component

Thesauri and Vocabularies (e.g., EEA Thesaurus)– knowledge directed component

Ontology = knowledge-based middle ground between applications and data !!!

Page 7: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 7

The Ontology Design Process

Choose newDatabase Schema

Abstract detailsfrom Database Schema

Determine entitiesand attributes

Group information,Analyze foreign keysand dependencies

DetermineRelationships

EvaluateOntology

Implementand Test

Drop entitiesand attributes

Add new entitiesand attributes

Add new subclassesand superclasses

Choose new query

No morequeries

Ontology fromDatabase Schema

Ontology fromQueries

Page 8: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 8

Environmental Databases

CERCLIS 3– http://www.epa.gov/enviro/html/cerclis/

ITT

HAZDAT– http://www.atsdr.cdc.gov/hazdat.html

ERPIMS– http://ns1.ktc.com/personal/larnold/erpims.htm

Basel Convention Database– http://www.unep.ch/basel

Page 9: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 9

Grouping Information in Multiple TablesSitesite_id (PK)site_namesite_ifms_ssid_codesite_rcra_idsite_epa_id

Site_Characteristic site_id (PK, FK to Site) rsic_code (PK, FK to Ref_Sic)sc_date

Ref_Sicrsic_code (PK)rsic_code_desc

Site_Aliassite_id (PK, FK to Site)site_alias_id (PK)sa_name

Site

date

name

code

alias_name

description

Database Schema

Ontology

Page 10: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 10

Identifying RelationshipsSitesite_id (PK)site_namesite_ifms_ssid_codesite_rcra_idsite_epa_id

Actionsite_id (PK, FK to Site)rat_code (PK, FK to ref_action_type)act_code_id (PK)

Ref_action_typerat_code (PK)rat_namerat_def

Waste_Src_Media_Contaminatedwsmrc_nmbr (PK)site_id (PK, FK to Action)rat_code (FK to Action)act_code_id (FK to Action)

Remedial_Responsesite_idact_code_idrat_code

Site

Contaminant

RemedialResponsePerformedAt

actionName

Database Schema

Ontology

Page 11: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 11

Ontology refinement based on user queries

Addition of New Attributes– At NPL sites with a land use category of INDUSTRIAL, what is the cleanup level

range for LEAD ….– Add an attribute landUseCategory to the entity Site in the ontology

Addition of new Relationships– What is the range of concentrations for ARSENIC is a contaminant of concern

in the SURFACE SOIL at NPL sites– Add a relationship HasContaminant between the entities Site and Contaminant

in the ontology

Addition of class-subclass relationships and new entities– How many Super fund sites are in Edison County, New Jersey ?– Add an entity SuperFundSite as a subclass of Site in the ontology

Page 12: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 12

Using a data dictionary (EDR) to enhance the ontology

Site

state

StateName StateCode StateAbbr

coding_scheme1

Map

coding_scheme2

coding_scheme3

select * from Site where state = ‘TX’ or state = ‘California’

select coding_scheme1 from Map where coding_scheme3 = ‘TX’

{ “Texas”, “California” } { “TX”, “CA” }

Page 13: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 13

Enhancing the Ontology by using a Thesaurus

abandoned siteTHEME POLLUTIONBT land setupNT disused military site

LandSetup

Site

AbandonedSite

DisusedMilitarySite

SuperfundSite

Page 14: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 14

Information Extraction from Text andMultimedia DataInformation Extraction from Text andMultimedia Data

Get me all regions (blocks, counties) having a population greater than 500 and area greater than 50 acres having an urban land cover and such that all the nearby fires have excellent containment

Fire

name

isLocatedNear Region

county blockarea

population

spatial_locationland_cover

containment

select county, block, spatial_locationfrom regionwhere area > 50 and population > 500and land_cover = “urban”and region.isLocatedNear.containment = “excellent”

Page 15: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 15

Column1 containmentexcellent

fire.name region.county

Information Extraction from Textual DataInformation Extraction from Textual Data

Fire isLocatedNear Region

containment county

<ACCRUE>(<SENTENCE>(<AND>(<NUMBER>(X), X < 25),

<WORD>(%), <WORD>(active)), <PHRASE>(full, containment,,

<STEM>(was), expected)<PHRASE>(the, fire, <STEM>(is),

contained))

= “excellent”

<ACCRUE>(<SENTENCE>( <PHRASE>(<OR>(New, Las, San),

[region.county]), <OR>(county, block, state)))

block

state

<PARAGRAPH>(FIRE, REGION)

Page 16: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 16

Mapping “domain specific” model elements to mediaMapping “domain specific” model elements to media specific metadata specific metadata Mapping “domain specific” model elements to mediaMapping “domain specific” model elements to media specific metadata specific metadata

county(x,y) county(x,y) gets mapped to:gets mapped to:– word(x), phrase(x), accrue(<list-of-subtrees>)word(x), phrase(x), accrue(<list-of-subtrees>)

containment(x, “excellent”)containment(x, “excellent”) gets mapped to: gets mapped to:– sentence(<set-of-words>), stem(x), accrue(<list-of-subtrees>)sentence(<set-of-words>), stem(x), accrue(<list-of-subtrees>)

isLocatedNear(x, y)isLocatedNear(x, y) gets mapped to: gets mapped to:– paragraph(x,y)paragraph(x,y)

Page 17: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 17

select county from regionwhere isLocatedNear.containment = “excellent”

Mapping SQL queries to Topic ExpressionsMapping SQL queries to Topic Expressions

<PARAGRAPH>(

<ACCRUE>(<SENTENCE>(<AND>(<NUMBER>(X), X < 25),

<WORD>(%), <WORD>(active)), <PHRASE>(full, containment,,

<STEM>(was), expected)<PHRASE>(the, fire, <STEM>(is),

contained)),<ACCRUE>(<SENTENCE>( <PHRASE>(<OR>(New, Las, San),

[region.county]), county))

)

Page 18: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 18

Limitations of Current Indexing Technologies: Limitations of Current Indexing Technologies: “selection operation” “selection operation”

Limitations of Current Indexing Technologies: Limitations of Current Indexing Technologies: “selection operation” “selection operation”

select county from region

=> post-processing of patterns returned (WILDCARD as place-holder)=> post-processing of patterns returned (WILDCARD as place-holder)

Problem: WILDCARD may match a lot of words in the same sentenceProblem: WILDCARD may match a lot of words in the same sentence WILDCARD may match different words in different sentencesWILDCARD may match different words in different sentences

<ACCRUE>(<SENTENCE>(<PHRASE>(<OR>(New, Las, San), WILDCARD),

<OR>(county, block, state)))

Page 19: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 19

Using NLP and statistical techniques

WILDCARD matches a number of words in the same sentenceWILDCARD matches a number of words in the same sentence

Yeltsin was appointed Yeltsin was appointed thethe Prime MinisterPrime Minister whenwhen sleepingsleeping

articlearticle nounnoun conjunctionconjunction verb verb

=> Use part of speech tagging to reduce number of possibilities=> Use part of speech tagging to reduce number of possibilities

WILDCARD matches different words in different sentencesWILDCARD matches different words in different sentences Yeltsin was appointed Yeltsin was appointed Prime MinisterPrime MinisterYeltsin was appointed Yeltsin was appointed PresidentPresident=> use frequency statistics to give a level of confidence=> use frequency statistics to give a level of confidence

Page 20: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 20

Definition SupportDefinition Support

INCIDENT MANAGEMENT SITUATION REPORT

Friday August 1, 1997 - 0530 MDT

NATIONAL PREPAREDNESS LEVEL II

CURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires hastaffed for structure protection.

SIMELS, Galena District, BLM. This fire is on the east side of the Innoko Flats, between GalenaThe fore is active on the southern perimeter, which is burning into a continuous stand of black sfire has increased in size, but was not mapped due to thick smoke. The slopover on the eastern 35% contained, while protection of the historic cabit continues.

CHINIKLIK MOUNTAIN, Galena District, BLM. A Type II Incident Management Team (Wehassigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up. contained. Major areas of heat have been mopped-up. All crews and overhead will mop-up wherburned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this wedepending on the results of infrared scanning.

Phrase:

SIMELS, Galina District, BLM.

Slot: fire.name

value: SIMELS

structure:

<name> , <place> , <unit> .

Page 21: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 21

MIDAS*: Information Extraction from Multimedia DataMIDAS*: Information Extraction from Multimedia Data MIDAS*: Information Extraction from Multimedia DataMIDAS*: Information Extraction from Multimedia Data

Query: Get me all regions (blocks, counties) having a population greater than 500 and area greater than 50 acres having an urban land cover

select county, block, area, population, spatial_location, land_cover

from regionwhere area > 50and population > 500and land_cover = ‘urban’and relief = ‘moderate’

*Media Independent DomAin Specific correlation

Page 22: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 22

Get me all regions(counties, blocks) having50 < population < 10025 < area < 50and low density urban arealand cover ...

media independent correlation across domainspecific metadata

correlation across imageand structured data at anintensional domain level

Page 23: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 23

Population: Population: Area:Area:

Boundaries:Boundaries:

Land cover:Land cover:Relief:Relief:

SQL queries to structured data(Census DB)

SQL Gatewayto textual data(TIGER/Line DB)

Image Processing routinesfor Image Data

Page 24: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 24

Page 25: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 25

Mapping “domain specific” model elementsMapping “domain specific” model elementsto media specific metadatato media specific metadataMapping “domain specific” model elementsMapping “domain specific” model elementsto media specific metadatato media specific metadata

contained(<concept>, <image>)contained(<concept>, <image>) gets mapped to: gets mapped to:– latitude/longitude, image-coordinateslatitude/longitude, image-coordinates– bounding box of regionbounding box of region– image type: LULC, DEMimage type: LULC, DEM

land_cover(x, “low density urban”)land_cover(x, “low density urban”) gets mapped to: gets mapped to:– percentage(<pixel-color>, <bounding-box>)percentage(<pixel-color>, <bounding-box>)

relief(x, “moderate”)relief(x, “moderate”) gets mapped to: gets mapped to:– standard-deviation(<pixel-value, <bounding-box>)standard-deviation(<pixel-value, <bounding-box>)

Page 26: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 26

Need for characterization of Domain Need for characterization of Domain VocabulariesVocabulariesNeed for characterization of Domain Need for characterization of Domain VocabulariesVocabularies

Geological Region

Urban Forest Land Water

Residential

Commercial

Industrial

Deciduous

Evergreen

Mixed

LakesReservoirs

Streams and Canals

Geological Region

State

County

City Rural Area

Tract

Block GroupBlock

Another source

of domain ontology

Construction:- Classification Standards

Page 27: Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval * Vipul Kashyap National Library of Medicine kashyap@nlm.nih.gov

Science on the Semantic Web Worksshop – 27

Conclusions and Future Work

Role of semantic content in handling data/information overload– Domain Specific ontologies: an approach for capturing semantic content

Design and construction of domain ontologies– labor intensive, time consuming, difficult endeavor– Re-use readily information: schemas, queries, data dictionaries, thesauri

minimize the involvement of the domain expert

Metadata is the key for MultiMedia Information Retrieval– Use an expanded notion of metadata as schema and declarative SQL like query

language– Pragamatic Incorporation of NLP/Image+Speech+Video Processing/Computer Vision

techniques– Exploit synergy across multiple media for better precision and performance

Extrapolate this technique into other domains:– Medical and Bio-Informatics– telecommunication– IP networks (use of CIM information model by DMTF)

Ontology Extraction from Textual Data:– Clustering techniques to identify central concepts and taxonomic relationships– NLP techniques to identify concept associations– Consensus analysis techniques to establish ontologies