7 th european conference on digital libraries 17-22 august 2003, trondheim, norway

28
An Architecture for Online An Architecture for Online Information Integration on Information Integration on Concurrent Resource Access on a Concurrent Resource Access on a Z39.50 Environment Z39.50 Environment Michalis Sfakakis Michalis Sfakakis 1 and Sarantos Kapidakis and Sarantos Kapidakis 2 1 National Documentation Centre / National Hellenic Research Foundation [email protected] 2 Laboratory on Digital Libraries and Electronic Publishing Archive and Library Sciences Department / Ionian University [email protected] 7 th th European Conference on Digital Libraries European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway 17-22 August 2003, Trondheim, Norway

Upload: calvin-briggs

Post on 01-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

- PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

An Architecture for Online Information An Architecture for Online Information Integration on Concurrent Resource Integration on Concurrent Resource Access on a Z39.50 Environment Access on a Z39.50 Environment

Michalis SfakakisMichalis Sfakakis11 and Sarantos Kapidakis and Sarantos Kapidakis22

1National Documentation Centre / National Hellenic Research [email protected]

2Laboratory on Digital Libraries and Electronic PublishingArchive and Library Sciences Department / Ionian University

[email protected]

77thth European Conference on Digital Libraries European Conference on Digital Libraries

17-22 August 2003, Trondheim, Norway17-22 August 2003, Trondheim, Norway

Page 2: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Presentation SummaryPresentation Summary

Main ContributionsMain Contributions

Resource Access in a Network Environment (models, Resource Access in a Network Environment (models, characteristics, issues, implementations)characteristics, issues, implementations)

Proposed Architecture (goal, critical points, Proposed Architecture (goal, critical points, characteristics, benefits) characteristics, benefits)

Technical Details of the Proposed Architecture Technical Details of the Proposed Architecture

ConclusionsConclusions

Future ResearchFuture Research

Page 3: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Main ContributionsMain Contributions

Analysis of problems (in a networked environment) for:Analysis of problems (in a networked environment) for:• Concurrent resource access via parallel searchConcurrent resource access via parallel search

• Information integrationInformation integration

Proposal of architecture for these problems:Proposal of architecture for these problems:• Able to improve online information integrationAble to improve online information integration

• Taking into account the restrictions imposed by the:Taking into account the restrictions imposed by the: Network environment

Z39.50 information retrieval protocol

Page 4: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Resource Access in Union CataloguesResource Access in Union Catalogues

Give access to library content from one central pointGive access to library content from one central point

Functional requirementsFunctional requirements• Consistent searching & indexing Consistent searching & indexing

• Consolidation of Records (information integration)Consolidation of Records (information integration)

• Performance & Management Performance & Management

… … conformance to current implementation modelsconformance to current implementation models• CentralizedCentralized (the vast majority of the current (the vast majority of the current

implementations): conform well to all functional requirementsimplementations): conform well to all functional requirements

• DistributedDistributed (current approaches – (current approaches – virtual union cataloguesvirtual union catalogues): ): all functional requirements varyall functional requirements vary

Page 5: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Why Virtual Union Catalogues (Why Virtual Union Catalogues (VUCVUC))

Why Centralized Why Centralized Distributed: Distributed:

Local autonomy and control of the participating Local autonomy and control of the participating systemssystems

Retention of the specific resource characteristicsRetention of the specific resource characteristics

User ability to dynamically define his own collections of User ability to dynamically define his own collections of resourcesresources

Vast and increasing number of available resourcesVast and increasing number of available resources

Page 6: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Pre-requirements for VUCPre-requirements for VUC

Ensure systems interoperability, derived from the implementation of international metadata standards and information retrieval protocols

Provide information integration (indicated by user studies)

Achieve accepted performance from the systems which emulate the union catalogue

Have ability for parallel searching

Have adequate network performance

Page 7: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Is it possible to implement VUC now?Is it possible to implement VUC now?

Depends on:Depends on:

Current technology and network improvementsCurrent technology and network improvements

Existence and wide acceptance of metadata standards Existence and wide acceptance of metadata standards (e.g. DC, MARC, MODS, etc)(e.g. DC, MARC, MODS, etc)

Wide acceptance of the Z39.50 information retrieval Wide acceptance of the Z39.50 information retrieval protocolprotocol and its associated profiles and its associated profiles

Page 8: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Requirements for Information IntegrationRequirements for Information Integration

The Information Integration (Consolidation of The Information Integration (Consolidation of Records)Records) is a two step process:is a two step process:• Identification of the duplicate recordsIdentification of the duplicate records

• Presentation: Creation of a union record, or, Presentation: Creation of a union record, or, according to the Z39.50 duplicate detection model, according to the Z39.50 duplicate detection model, the clustering of records in ‘equivalence classes’ and the clustering of records in ‘equivalence classes’ and the selection of a representative recordthe selection of a representative record

Its effectiveness & quality is affected by the:Its effectiveness & quality is affected by the:• Differences in semantic models and formats of the Differences in semantic models and formats of the

metadatametadata

• Metadata Quality (i.e. specificity, completeness of Metadata Quality (i.e. specificity, completeness of fields, syntactic correctness and consistency as fields, syntactic correctness and consistency as implemented by authority files)implemented by authority files)

Page 9: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Methods for Information IntegrationMethods for Information Integration

Depending on the challenge:Depending on the challenge:• High quality duplicate detection and merging on High quality duplicate detection and merging on

large amount of data, offline - without hard time large amount of data, offline - without hard time restrictionsrestrictions

Development of centralized union catalogues, or creation of collection by harvesting techniques

• Good de-duplication quality on medium to small Good de-duplication quality on medium to small amount of data, online and present them to the user amount of data, online and present them to the user in accepted response timein accepted response time

Development of virtual union catalogues

Page 10: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Z39.50 Information Retrieval ProtocolZ39.50 Information Retrieval Protocol

A complicated, state full, client /server protocol, widely A complicated, state full, client /server protocol, widely used in the area of libraries used in the area of libraries

For every session (Z-association) a server: For every session (Z-association) a server: • Holds a search history (at least the last query)Holds a search history (at least the last query)• During the session the client can request data from any result During the session the client can request data from any result

set included in the search historyset included in the search history• The search history stays alive during the sessionThe search history stays alive during the session• The session can be abruptly terminated by the server (timeout), The session can be abruptly terminated by the server (timeout),

on ‘lack of activity’on ‘lack of activity’ The timeout period is server dependent

Depending of the implementation level, a server could Depending of the implementation level, a server could implement in a number of variations the:implement in a number of variations the:

• Sort serviceSort service• Duplicate detection serviceDuplicate detection service

Page 11: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Summary of VUC Implementation IssuesSummary of VUC Implementation Issues

Network dependent:Network dependent:• Network links performance & availabilityNetwork links performance & availability

Protocol dependent:Protocol dependent:• Interoperability level (e.g. supported services and their Interoperability level (e.g. supported services and their

implementation variations)implementation variations)• Timeout period and session reactivationTimeout period and session reactivation

Participating systems dependent:Participating systems dependent:• Performance, availability, extensibility, metadata encoding and Performance, availability, extensibility, metadata encoding and

semanticssemantics

De-duplication complexity & expensiveness: De-duplication complexity & expensiveness: • Highly affected by the different semantic models & formats, Highly affected by the different semantic models & formats,

quality, completeness, consistency and the amount of the quality, completeness, consistency and the amount of the metadatametadata

Overall system performanceOverall system performance

Page 12: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Current VUC ImplementationsCurrent VUC Implementations

Server side:Server side:• Majority support basic services (e.g. Init, Search, Present, Scan)Majority support basic services (e.g. Init, Search, Present, Scan)• A small number support the sort serviceA small number support the sort service• A minority supports the duplicate detection service A minority supports the duplicate detection service

Client side: Client side: • Has to deal with heterogeneity in receiving resulting dataHas to deal with heterogeneity in receiving resulting data• Must overcome timeout issues, avoiding session reactivationMust overcome timeout issues, avoiding session reactivation• Has to de-duplicate incoming results, even if every individual Has to de-duplicate incoming results, even if every individual

server reply does not provide duplicatesserver reply does not provide duplicates• The majority of the implementations does not make any The majority of the implementations does not make any

integration, due to performance issues.integration, due to performance issues.• Primitive duplication detection approaches, based on some Primitive duplication detection approaches, based on some

coded data (e.g. ISBN, ISSN, LC number, etc.)coded data (e.g. ISBN, ISSN, LC number, etc.)

Page 13: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

User – VUC System InteractionsUser – VUC System Interactions

Defines the desired collection of resources Defines the desired collection of resources

Sends a search request, specifying a desired number Sends a search request, specifying a desired number of records (of records (Presentation SetPresentation Set) to display each time) to display each time

After receiving the After receiving the Presentation SetPresentation Set, subsequently , subsequently Presentation SetsPresentation Sets could be requested – or not could be requested – or not

Resource 1…j

Z39.50 Server

Resource j+1…k

Z39.50 Server

Resource l+1…r

Z39.50 Server

User Interaction

Virtual Union Catalogue System

Page 14: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Goal of the Proposed ArchitectureGoal of the Proposed Architecture

To improve information integration in online access of a To improve information integration in online access of a distributed system, which:distributed system, which:

Accesses concurrently resources via the networkAccesses concurrently resources via the network

Applies online good quality duplicate detection Applies online good quality duplicate detection procedures (for presenting only once each record that procedures (for presenting only once each record that is multiply located in the resourcesis multiply located in the resources))

Page 15: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Critical Points of the Proposed ArchitectureCritical Points of the Proposed Architecture

We have to deal with:We have to deal with:

Performance of the network links and the availability of Performance of the network links and the availability of the resourcesthe resources

Complexity and expensiveness of the duplicate Complexity and expensiveness of the duplicate detection algorithms, especially in large amount of detection algorithms, especially in large amount of records records

Extraction of the Extraction of the Presentation setPresentation set in reasonable in reasonable response timeresponse time

Page 16: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Characteristics of the Proposed Characteristics of the Proposed ArchitectureArchitecture

What we do:What we do:

We do not apply the duplicate detection algorithms in We do not apply the duplicate detection algorithms in one shot – the duplicate detection process is applied one shot – the duplicate detection process is applied using each received set of data and comparing them using each received set of data and comparing them against the previously processed results against the previously processed results

Incremental comparison and elimination of the Incremental comparison and elimination of the duplicates in every Presentation Set – the processed duplicates in every Presentation Set – the processed results are sorted and do not contain duplicates results are sorted and do not contain duplicates

Usage of the sort or duplicate detection service, when Usage of the sort or duplicate detection service, when supportedsupported

During the time the user is reading the results, the During the time the user is reading the results, the system prepares few next sets of unique records system prepares few next sets of unique records

Page 17: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Benefits of the Proposed ArchitectureBenefits of the Proposed Architecture

Avoid downloading large amounts of data over the Avoid downloading large amounts of data over the network and unnecessarily loading the serversnetwork and unnecessarily loading the servers

Apply the duplicate detection algorithm to a small Apply the duplicate detection algorithm to a small number of records – especially in the first stepsnumber of records – especially in the first steps

Every record is compared against a processed set Every record is compared against a processed set during de-duplicationduring de-duplication

We deploy the time the user is reading the presented We deploy the time the user is reading the presented data, without exhausting the system resourcesdata, without exhausting the system resources

Page 18: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

OverviewOverview of the Proposed Architectureof the Proposed Architecture

Modules: Request Interface, Data Integrator, Resource Modules: Request Interface, Data Integrator, Resource CommunicatorCommunicator

Components: Data Provider, Local Result Set Manager, De-Components: Data Provider, Local Result Set Manager, De-duplicator, Data Presenterduplicator, Data Presenter

Interaction is accomplished by messages or synchronous data Interaction is accomplished by messages or synchronous data transmissionstransmissions

Resource 1…j

Z39.50 Server

Resource j+1…k

Z39.50 Server

Resource l+1…r

Z39.50 Server

Resource Communicator

Data Integrator

Request Interface

De-duplicatorData Presenter

Local Result Set

User Interaction

Profiles of the Z39.50 Servers

Output QueueInput Queue

Data Provider

Local Result Set Manager

Presentation Set

Page 19: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

ModulesModules of the Proposed Architectureof the Proposed Architecture

The The Request InterfaceRequest Interface: Receives every user request (search or : Receives every user request (search or present), dispatches it to the appropriate modules, waiting the present), dispatches it to the appropriate modules, waiting the Presentation SetPresentation Set

The The Resource CommunicatorResource Communicator: Access the resources and supplies : Access the resources and supplies the data for the integrationthe data for the integration

The The Data IntegratorData Integrator: Receives the data sets, makes the : Receives the data sets, makes the information integration and manages the unique records to be information integration and manages the unique records to be ready for presentationready for presentation

Resource 1…j

Z39.50 Server

Resource j+1…k

Z39.50 Server

Resource l+1…r

Z39.50 Server

Resource Communicator

Data Integrator

Request Interface

User Interaction

Page 20: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

ComponentsComponents of the Proposed Architectureof the Proposed Architecture

The The Local Result Set ManagerLocal Result Set Manager: Holds and arranges (e.g. sorts) the de-: Holds and arranges (e.g. sorts) the de-duplicated records and prepares the duplicated records and prepares the Presentation SetPresentation Set

The The Data ProviderData Provider: Receives data from the : Receives data from the Resource Communicator Resource Communicator Module and sends one at a time for further processModule and sends one at a time for further process

The The De-duplicator (s)De-duplicator (s): Receives a record from the : Receives a record from the Local Result Set Local Result Set ManagerManager and compares it with all the unique records in the and compares it with all the unique records in the Local Result Local Result SetSet

The The Data PresenterData Presenter: Dispatches the received request for data, from the : Dispatches the received request for data, from the Request InterfaceRequest Interface, to the , to the Local Result Set ManagerLocal Result Set Manager and returns back the and returns back the next unique records for presentationnext unique records for presentation

Request Interface

Resource Communicator Profiles of the Z39.50 Servers

Data Integrator

De-duplicator

Data Presenter

Local Result Set

Output QueueInput Queue

Data Provider

Local Result Set Manager

Presentation Set

Page 21: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Resource 1…j

Z39.50 Server

Resource j+1…k

Z39.50 Server

Resource l+1…r

Z39.50 Server

Resource Communicator

Data Integrator

Request Interface

User Interaction

Page 22: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Accomplishing a search request –Accomplishing a search request –Module InteractionsModule Interactions

1.1. The The Request InterfaceRequest Interface requests p records from the requests p records from the Data IntegratorData Integrator and and waits for (at most p) recordswaits for (at most p) records

2.2. The The Request InterfaceRequest Interface, also, forwards the search request including the , also, forwards the search request including the number p, to the number p, to the Resource CommunicatorResource Communicator and continues monitoring for and continues monitoring for user requestsuser requests

3.3. The The Resource CommunicatorResource Communicator waits for messages from the waits for messages from the Request Request InterfaceInterface and when it receives a new search request, it concurrently and when it receives a new search request, it concurrently starts the following sequences of steps for every server:starts the following sequences of steps for every server:

1.1. Interprets the search request to the appropriate message format for the Interprets the search request to the appropriate message format for the server, sends it and waits for its reply server, sends it and waits for its reply

2.2. Adds the number of hits from all the replies and sends it to the Request Adds the number of hits from all the replies and sends it to the Request InterfaceInterface

3.3. If the server supports either the duplicate detection or the sort service, it If the server supports either the duplicate detection or the sort service, it invokes it after its initial response to the search requestinvokes it after its initial response to the search request

4.4. Requests a number of records (e.g. p) from every server that replied on its Requests a number of records (e.g. p) from every server that replied on its last requestlast request

5.5. It sends the arrived data to the Data IntegratorIt sends the arrived data to the Data Integrator6.6. Waits for further commands, but if there is no communication with the server Waits for further commands, but if there is no communication with the server

for a period close to its timeout, the procedure jumps to step 3.4for a period close to its timeout, the procedure jumps to step 3.4

4.4. The The Data IntegratorData Integrator de-duplicates part of the received data, prepares de-duplicates part of the received data, prepares the set of unique records and when p records are found, it sends them the set of unique records and when p records are found, it sends them to the to the Request InterfaceRequest Interface

Page 23: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Module Interactions:Module Interactions:Comments & ClarificationsComments & Clarifications

All modules work in parallelAll modules work in parallel

The number of requested records from every server could vary, The number of requested records from every server could vary, depending upon its: performance, timeout, the network links and depending upon its: performance, timeout, the network links and the Result Set sizethe Result Set size

For the overall system performance, the Resource Communicator For the overall system performance, the Resource Communicator realizes if a server is down, using the Profiles of the Z39.50 realizes if a server is down, using the Profiles of the Z39.50 servers, and continues the interaction with the other modulesservers, and continues the interaction with the other modules

The calculated number of hits is not the actual oneThe calculated number of hits is not the actual one

To avoid session reactivation, imposed by the server timeout, the To avoid session reactivation, imposed by the server timeout, the Resource communicator could request data from any server at Resource communicator could request data from any server at any timeany time

A threshold value activates the Data Integrator to ‘request data’ A threshold value activates the Data Integrator to ‘request data’ from the Resource Communicatorfrom the Resource Communicator

Page 24: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Request Interface

Resource Communicator Profiles of the Z39.50 Servers

Data Integrator

De-duplicator

Data Presenter

Local Result Set

Output QueueInput Queue

Data Provider

Local Result Set Manager

Presentation Set

Page 25: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Accomplishing a search request –Accomplishing a search request –Component InteractionsComponent Interactions

1.1. The The Data ProviderData Provider starts to transfer data, possibly by rearranging them. starts to transfer data, possibly by rearranging them. If the number of data contained in it is less than a threshold (e.g. 5p), If the number of data contained in it is less than a threshold (e.g. 5p), the Data Provider sends a ‘request data’ message to the the Data Provider sends a ‘request data’ message to the Resource Resource CommunicatorCommunicator

2.2. While the While the Local Result Set ManagerLocal Result Set Manager has less than a threshold (e.g. 3 p) has less than a threshold (e.g. 3 p) unique record, it tries to read from the unique record, it tries to read from the Data ProviderData Provider and for every and for every record found, it calls the record found, it calls the De-DuplicatorDe-Duplicator to compare the record: to compare the record:

1.1. The The De-DuplicatorDe-Duplicator compares the record with the records in the compares the record with the records in the Local Result Local Result SetSet and then sends the results back to the and then sends the results back to the Local Result Set ManagerLocal Result Set Manager

2.2. The The Local Result Set ManagerLocal Result Set Manager receives the results from the duplicate receives the results from the duplicate detection process and arranges the record into the detection process and arranges the record into the Local Result SetLocal Result Set

3.3. If the number of new unique records in the If the number of new unique records in the Local Result SetLocal Result Set becomes p, it becomes p, it copies the p new unique records into the copies the p new unique records into the Presentation SetPresentation Set and activates the and activates the Data PresenterData Presenter

3.3. When the When the Presentation SetPresentation Set is filled with (the p) records, the is filled with (the p) records, the Data Data PresenterPresenter component dispatches the records to the component dispatches the records to the Request InterfaceRequest Interface module and waits to receive the next ‘request data’ message from it. If module and waits to receive the next ‘request data’ message from it. If the component does not receive any request during its predefined the component does not receive any request during its predefined timeout period, it terminates the systemtimeout period, it terminates the system

Page 26: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Component Interactions:Component Interactions:Comments & ClarificationsComments & Clarifications

The combination of the threshold values in Data The combination of the threshold values in Data Provider & Local Result Set Manager, controls the Provider & Local Result Set Manager, controls the ‘request data’ activity from the Resource ‘request data’ activity from the Resource CommunicatorCommunicator

The Local Result Set Manager keeps two orderings for The Local Result Set Manager keeps two orderings for the unique records in order to:the unique records in order to:

• Improve the performance of the De-duplicatorImprove the performance of the De-duplicator

• Present and Facilitate easy access of the stored recordsPresent and Facilitate easy access of the stored records

Page 27: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

ConclusionsConclusions The online de-duplication process from resources accessed The online de-duplication process from resources accessed

concurrently in a network environment:concurrently in a network environment:• Is a requirement identified by user studiesIs a requirement identified by user studies • Is challenged by a number of issues relevant to:Is challenged by a number of issues relevant to:

Performance of the participating servers Their network links The complexity and the expensiveness of the duplicate detection algorithms

These issues make inefficient any approach to the application of These issues make inefficient any approach to the application of the information integration: the information integration:

• In online environmentsIn online environments• Especially when large amounts of data must be processedEspecially when large amounts of data must be processed

In our proposed system: In our proposed system: • We do not try to integrate all the results from all the recourses at onceWe do not try to integrate all the results from all the recourses at once • We attack this problem by:We attack this problem by:

Retrieving a small number of records, independently if the servers provide de-duplicated or sorted results

Appling the de-duplication process on small amounts of sorted records Creating a presentation set of unique records to display to the user Deploying the time the user is reading the presented data, without misapplying the system

resources

Page 28: 7 th  European Conference on Digital Libraries 17-22 August 2003, Trondheim, Norway

Future ResearchFuture Research

To better approximate the number of records satisfying To better approximate the number of records satisfying the search requestthe search request

To derive priorities for the servers and their resourcesTo derive priorities for the servers and their resources

To select or adapt a good de-duplication algorithm for To select or adapt a good de-duplication algorithm for different record completeness and different provision of different record completeness and different provision of records by the serversrecords by the servers

To optimize the number of requested records from a To optimize the number of requested records from a serverserver

To implement the system and evaluate its performanceTo implement the system and evaluate its performance