indexing mathematical abstracts by metadata and ontology ima workshop, april 26-27, 2004 su-shing...

27
Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida [email protected]

Upload: jasper-powell

Post on 13-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Indexing Mathematical Abstracts by Metadata and OntologyIMA Workshop, April 26-27, 2004

Su-Shing Chen, University of Florida

[email protected]

Page 2: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Abstract OAI extensions to federated search and other

services for MathML-based metadata indexing and subject classification of mathematical abstracts.

Construction of ontology or conceptual maps of mathematics. Mathematical formulas are considered as elements of the ontology.

Ontology indexing by clustering mathematical abstracts or full papers into an information visualization interface so that users may select using ontology as well as metadata.

Page 3: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

DL Server

Data Provider

OAI_DC

Data Provider

OAI_XXX

ServiceProvider

ServiceProvider

Data Mining

Federated Search

Harvester

Harvest API

A DL Server with OAI Extensions:

Managing the Metadata Complexity

Page 4: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

DigestedMetadata

DigestedMetadata

HarvestedMetadata

HarvestedMetadata

Service Providers’Data

Service Providers’Data

HarvesterHarvester DataProvider

DataProvider

ServiceProvider 1

ServiceProvider 1

ServiceProvider N

ServiceProvider N

Java DataBase Connectivity (JDBC)Java DataBase Connectivity (JDBC)

Server

UserUser DataProvider

DataProvider

ServiceProvider

ServiceProvider

Internet

Page 5: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

A DL Server with OAI Extensions:

Managing the Metadata Complexity

Built in capabilities: Harvester – harvest various OAI compliant

data providers Data provider – expose harvested and

existing metadata sets Service provider – federated search and

data mining capabilities on metadata sets

Page 6: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Harvester

DL Server

Harvester

Harvester Interface:

• URL to harvest• Selective harvesting parameters

Harvest API

parametersharvest

harvest

Data Providers

Harvested metadata

Page 7: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Harvester Interface

Page 8: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Harvester Interface

Page 9: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Data Provider

Expose single or combined metadata sets harvested to other harvesters

Reformat metadata from different data providers to be harvested by other service providers (e.g., originally Dublin Core, reformat to MARC before exposing)

Page 10: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Service Provider: Federated Search

Emulating a federated search service on existing and combined harvested metadata sets

Federated search across potentially other

search protocols

Page 11: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Federated Search

Page 12: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Federated Search

Page 13: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Federated Search

Page 14: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Service Provider: Data Mining

Knowledge discovery on harvested metadata sets

Metadata classification using the Self-Organizing Map (SOM) algorithm

Improving retrieval effectiveness by providing concept browsing and search services

Page 15: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Self-Organizing Map Algorithm

Competitive and unsupervised learning algorithm

Artificial neural network algorithm for visualizing and interpreting complex data sets

Providing a mapping from a high-dimensional input space to a two-dimensional output space

Page 16: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Data Mining Service Provider System Architecture

Metadata Database

SOM Categorizer

Concept Harvester

Input Vector Generator

Noun Phraser

Browser BrowserConcept browsing

requestConcept search

request ResponseResponse

Request Response

Fetch metadata Save SOM

Page 17: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Concept Harvester

Screenshot of the SOM Categorizer

Page 18: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Construction of Two-level Concept Hierarchy

Constructing the SOM for each harvested metadata set SOMs of the lower layer are added to the upper-layer

SOM.

VTETD

Page 19: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Top-level Concept Browsing

Page 20: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Bottom-level Concept Browsing

Page 21: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

MEDLINE Database

Developed by the National Library of Medicine (NLM) Bibliographic citations and abstracts from more than

4,600 biomedical journals published in the United States and 70 other countries.

Covering the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences.

Over 12 million citations Searchable via PubMed or the NLM Gateway

Page 22: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

MeSH (Medical Subject Headings)

MEDLINE uses MeSH as its controlled vocabulary for indexing database articles

Indexers scan an entire article and assign MeSH headings (or MeSH descriptors) to each article

MeSH descriptors are arranged in both an alphabetic list and a hierarchical structure.

Updated annually to reflect the changes in medicine and medical terminology

Page 23: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Our Experimentation Problems

It is well known that searching by descriptors will greatly improve the search precision.

However, it is very difficult for naïve users to know and use exact MeSH descriptors to search.

In addition, as the database of MEDLINE grows, information overload would prevent users from finding relevant information of their interest.

Proposed Approach Categorizations according to MeSH terms, MeSH major topics,

and the co-occurrence of MeSH descriptors Clustering using the results of MeSH term categorization through

the Knowledge Grid Visualization of categories and hierarchical clusters

Page 24: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Data Access Services

MeSH Major Topic Tree View SOM Tree View

Page 25: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Knowledge Grid

RAEM Resource Alloc. Execution Mng.

KDS Knowledge

Directory Service

Generic and Data Grid Services

Core K-Grid layer

High level K-Grid layer

KMR

KBR

DA Data

Access Service

TAAS Tools and Algorithms

Access Service

EPM Execution Plan Management

RPS Result

Presentation Serv.

KEPR

Courtesy of Cannataro and Talia(Knowledge Grid: An Architecture for Distributed Knowledge Discovery)

Knowledge Grid Architecture

Page 26: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Future Directions Develop a federated search service for OAI-

compliant mathematical abstracts. Develop an ontology or conceptual maps for

mathematics. Develop an ontology search service for

mathematical abstracts and full papers. Develop an interoperable architecture with

other services, such as OCR of mathematical formulas.

Page 27: Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida suchen@cise.ufl.edu

Acknowledgement

Many thanks to the NSF NSDL Program. Collaborators – Joe Futrelle (NCSA), Ed

Fox (Virginia Tech) Student Team – Hyunki Kim, Chee Yoong

Choo, Xiaoou Fu, Yu Chen