data portal based on open archives initiative protocols and apache lucene
Post on 28-Jan-2016
48 Views
Preview:
DESCRIPTION
TRANSCRIPT
WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences
Data portal based on Open Archives Initiative Protocols and Apache
Lucene
Uwe Schindler, uschindler@wdc-mare.orgMichael Diepenbroek, mdiepenbroek@wdc-
mare.org
MARUM, University of Bremen, Germany
EGU 2006, Vienna, 2006-04-03WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences
WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences
Data PortalsWDC-MARE with its information system PANGAEA provides data portals for several EU/international
projects:CARBOOCEAN, EUR-OCEANS, IODP
Problem:Not all data are stored centralized, so all datasets
provided in portals must be consolidated from different sources!
WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences
Example:CARBOOCEAN data portal
•Data stays at the data providers•Metadata is harvested by the portal
•Search queries are handled by the centralized catalogue
•Scientist gets link to data at the provider
WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences
Open Archives ProtocolThe Open Archives Initiative Protocol for
Metadata Harvesting (OAI-PMH) is a protocol developed by the Open Archives Initiative.
• uses it during web crawling ( Scholar)
•Almost all digital libraries support it (most famous ones: arXiv and the CERN Document Server)
•Very simple to implement (XML over HTTP based)•Repository software for databases or file system
metadata providers is widely available
WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences
Current OAI-PMH software1. Limited to Dublin Core metadata (libraries)!
2. Limited full text search functionality due to relational databases in the background!
3. No geographic retrievals (because of Dublin Core limitation)!
4. End user interface is part of the software, this limits usability in CMS systems
WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences
Requirements for portal software
1. Open for any XML metadata format2. Any mappings to document fields should be done
by XPath3. Possibility to map incompatible XML schemas
during harvesting by XSL4. No relational database, only a full text search
engine, that contains everything needed for operation
5. Range queries for specific fields (date/time or numeric)
6. Web service interface for the end user software that is accessible from any language (Java/JSP, PHP,
Perl,...)
WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences
Lucene
Lucene
Lucene
XML-Files
OAI-PMH
OAI-PMH
OAI-Harvester
OAI-Harvester
Filesystem-Harvester
OAI protocol in HTTP
OAI protocol in HTTP (specific set)
filesystem directory, FTP,…
Mini PanHTTP ServerJetty HTTP Server
Tomcat
Apache Axis
VirtualIndex
VirtualIndex
XSL
XSL
Portal 1(Webserver, PHP)
Portal 2(Webserver, JSP)Stored:
xmldata (same format everywhere, XSL before indexing), identifier, lastModified, sets
Searchable:
field1: “/oai_dc:dc/dc:author”field2: “/oai_dc:dc/dc:title”field3: “java:org.test.LatLon.parse(/oai_dc:dc/dc:coverage)” *default: “.”
*) xmlns:java=“http://xml.apache.org/xalan/java”
MetadataPortalMetadataPortal Java Java PackagePackage
WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences
Metadata standard harvested for search: DIF v9.4
Searchable fields: Bounding box, date/time, parameters, authors, investigators, title
Data centers:World Data Center for Marine Environmental Sciences (WDC-MARE), University of Bremen and Alfred-Wegener-Institute in Bremerhaven, Germany
Carbon Dioxide Information Analysis Center (CDIAC), Environmental Sciences Division at Oak Ridge National Laboratory, USA
French National Oceanographic Data Centre, SISMER (Systèmes d'Informations Scientifiques pour la Mer) at the Ifremer in Brest, France
CARBOOCEAN Data Portal
WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences
Thank you!
top related