integration and warehousing of social metadata for …disi.unitn.it/~dmirylenka/socinfo11.pdf ·...

Integration and Warehousing of Social Metadatafor Search and Assessment of Scientific

Knowledge

Daniil Mirylenka, Fabio Casati, and Maurizio Marchese

Department of Information Engineering and Computer ScienceUniversity of Trento, Via Sommarive 14, 38123, Trento, Italy

{dmirylenka, casati, marchese}@disi.unitn.it

Abstract. With the advancement of Web, novel types of scientific-related data and metadata are emerging from a growing number of vari-ous sources. Alongside traditional bibliographic data provided by digitallibraries great amounts of social metadata (such as bookmarks, ”reads”,tags, comments and ”likes”) are created and accumulated by social net-working services. We believe that these metadata can be fruitfully usedfor improving search and assessment of scientific knowledge. The individ-ual sources of scientific metadata differ largely in their focus, function-ality, data coverage and data quality, and are currently limited to theirown databases and data types. We suggest that we can enhance the cur-rent individual services by integrating their data and metadata. In thispaper we discuss the opportunities and challenges of such integrationfor the purpose of facilitating the discovery and evaluation of scientificknowledge, and present a framework for integration and warehousing ofboth bibliographic and social scientific metadata.

1 Introduction

Dissemination and evaluation of scientific knowledge is essential to the progressof science in any field. On a daily basis researchers search for scientific contri-butions, being guided by various reputation metrics in judging their quality andrelevance. With the advent of the Web, the opportunity for new models of sci-entific knowledge dissemination and evaluation has emerged. Digital librarieshave enabled effective search over the large collections of bibliographic meta-data about published contributions and their authors, and provided access tocitation-based metrics such as the number of citations and h-index [6].

The Social Web has created new types of scientific data and metadata. Beingno longer restricted to published articles, scientific knowledge may now be con-tained in different types of resources such as publication preprints or user blogs.Social networking services have also influenced the way the scientific knowledgeis disseminated. Using the Web, researchers now generate large amounts of usagemetadata by expressing their opinions on scientific resources, either explicitly orimplicitly – by adding them to the personal and group libraries, ”liking”, shar-ing, downloading or sending them by e-mail to the colleagues. Moreover, theysemantically enrich and structure the information by tagging, annotating andlinking the resources.

2 University of Trento

There are, however, a number of problems that prevent these social and bib-liographic metadata from being fully exploited. First, today’s scientific digitallibraries differ in their focus, data coverage and data quality, restricting mostoften the search to one particular database. Second, web users usually partic-ipate in a limited number of social networking services, thus partitioning thepotentially available social metadata and, similarly, limiting the search to fewsources at a time, typically one.

In our current work, we propose a conceptual model, design and implementa-tion of a socially-oriented Scientific Resource Space (SRS) – an IT platform forthe integration of various sources of bibliographic and social scientific metadata.Our final goal is to use this platform and the warehoused metadata to facilitatethe discovery, navigation and assessment of the scientific knowledge. We arguethat, by integrating the metadata from various sources, we will be able to im-prove upon existing services by providing: (a) enhanced search over a greateramount of data and metadata, (b) optimized search and navigation taking intoaccount larger amounts of user-provided structural metadata, such as tags andlinks between resources, (c) improved ranking algorithms based on the combina-tion of traditional citation-based and novel social usage metrics. Moreover, wesuggest that the proposed platform can be the primary tool for exploring andanalyzing the social metrics and the space of scientific resources in general.

In the following section, we first present a critical overview of the state ofthe art about the integration of bibliographic and social metadata. In Section3 we describe our conceptual model for the proposed socially-oriented ScientificResource Space, while in Sections 4 we detail the architecture and the processesinvolved in the implementation of the SRS platform. In Section 5 we discuss ourearly experiences and conclude the paper.

2 State of The Art

Many scholarly digital libraries are populated in part or in full by collectingdata from other digital libraries or web pages. Therefore, the problem of dataand metadata integration has been widely addressed in this field. Dependingon the scope and sizes of their datasets, different libraries employed differentapproaches to collecting and maintaining their data.

The Digital Bibliography & Library Project (DBLP) is one of the most widelyused bibliographic databases in the Computer Science domain. DBLP gets itsdata from a predefined number of sources, relying largely on the human effortduring both data acquisition and cleaning phases [11]. This approach allows forhigh data quality, but is only feasible with small to medium datasets.

Scholarly search engines, such as Google Scholar or Microsoft AcademicSearch, represent another approach to integration of bibliographic metadata.Although they disclose no precise information about their architectures and al-gorithms, it is known that they obtain their data by crawling the web andextracting bibliographic records from the publication texts1.

1 http://scholar.google.com/scholar/inclusion.htmlhttp://academic.research.microsoft.com/About/Help.htm

Integrating Social Metadata 3

Attempts at personalization and socialization of digital libraries led to cre-ation of a number of specialized social networking services (SNS) for scientists [7].Besides search functionality, sites like Mendeley, CiteULike or Connotea allowusers to create and maintain the libraries of scientific resources, and tag, an-notate, group, share and comment on them. In contrast to traditional digitallibraries, resource metadata can also be provided by users, which are allowed(and encouraged) to add new resources to the database. In general, this ap-proach creates the opportunity for collecting large amounts of metadata, butresults in a lower metadata quality, for instance resource duplication.

Various data models and protocols of scholarly metadata integration havebeen proposed, with their focus being mainly on bibliographic metadata. TheDublin Core [19] metadata standard has been adopted by Open Archive Initia-tive (OAI) [9] to enable interoperability between the digital libraries through ametadata harvesting protocol (OAI-PMH) [10]. This allowed for creation of largebibliographic databases, such as OAIster, that integrate data from the sourcesthat support OAI-PMH (by exposing their data in a predefined format). Othermodels of scientific metadata include The Bibliographic Ontology [5], The SWRCOntology [18], Common European Research Information Format (CERIF) [1],CCLRC Scientific Metadata Model [12], MESUR ontology [17] and others.

Attempts have also been taken to face the integration of social metadata.World Wide Web Consortium (W3C) has investigated the opportunities for fed-erated social web [16]. Among other activities, this includes the work on OSta-tus (a standard for exchanging status updates), Portable Contacts, and SemanticMicroblogging Framework (SMOB) [14]. The Semantic Web community has pro-posed ontologies for modeling different aspects of social web [3][2][4][15].

3 Conceptual model

In this work we rely on the definition of the Scientific Resource Space (SRS) [13]and extend it with the notion of social metadata. In brief, SRS provides ho-mogeneous programmatic access to scientific resources present on the web byabstracting the various kinds of data and metadata into a uniform conceptualmodel to support uniform access logic.

Our conceptual model for a socially-oriented SRS (Figure 1) revolves aroundscientific resources and relations between them, as well as on their bibliographicand social metadata. Scientific Resource is the central concept in our model, andit represents any identifiable content on the web that is of scientific value or isinvolved in the research process, be it a publication, review, dataset, experimentor even blog entry or wiki page. The main attributes of a Scientific Resource areURI, type, format and title. For example, consider the Scientific Resource withthe following attributes: (a) title: ”Data integration: A theoretical perspective”,(b) URI: http://doi.acm.org/10.1145/543613.543644, (c) type: conference paper,and (d) format: PDF.

Connected to Scientific Resource with many-to-many relations are entities:Author, Publisher and V enue, representing respectively contributors to the cre-ation of the resource and to its dissemination. For our example paper, the Authorwould be entity representing a scientist, Maurizio Lenzerini, V enue would rep-


Fig. 1. Conceptual model of Scientific Resource Space (SRS)

resent Symposium on Principles of Database Systems, and Publisher wouldrepresent Association for Computing Machinery (ACM).

Connections between scientific resources are modeled as relations of differ-ent types, of which an important example is a citation relation between publica-tions. Others include relations between papers and the corresponding presenta-tion slides, or between experiments and datasets they are based on, or betweenexperiments and papers reporting on them. Versioning and similarity betweenscientific resources are among other aspects that can be modeled via relations.

The main focus of our model are the social activities around scientific re-sources, such as how people use and disseminate the resources, and what theysay about them. SocialMetadata captures these activities with its three sub-types: FreeText represents unstructured texts such as comments or notes at-tached to resources, which we do not intend to interpret. LabelText is a textthat can serve for classification of resources, with typical example being users’tags in social bookmarking sites. We may or may not want to interpret these la-bels, establish relations between them or merge them into a single categorizationscheme. The third type of SocialMetadata is Action, and it models any kind ofuser activities towards resource, such as sharing, ranking, ”liking”, bookmark-ing or downloading. Depending on the type and value associated with it, actionmay express users’ interest to resource, and their assessment of its quality orrelevance. The interpretation of Actions is, however, left to applications.

The presented conceptual model is also the underlying model of our metadatawarehouse, thereby explicitly including some of the attributes of the data inte-gration process. Source stands for the source system, such as DBLP or Mendeleyor CiteULike, that provided the particular metadata element. Time is the time


when the metadata element was acquired from the source. User is the optionalattribute representing the web user who created, explicitly or implicitly, themetadata element within the Source. In the case of SocialMetadata, User isthe same subject who preforms an activity involving a scientific resources.

4 Socially-oriented Scientific Resource Space and Meta-data Integration

The proposed SRS model presents a facade between the client applicationsand various data sources, providing a uniform access to the integrated data of thelatter. It is composed of the integration layer and the set of APIs through whichit is accessed by the applications. The integration layer consists of the adapterlayer, the Metadata Warehouse, and the on-demand data acquisition engine. TheAdapter Layer incapsulates the particularities of the data sources and their datamodels and helps to cope with the heterogeneity of scientific metadata. Eachadapter is responsible for getting metadata according to the protocols and APIsof the source and transforming it into the model of Scientific Resource Space.The transformed metadata can then be subjected to warehousing or, after beingprocessed, served directly to the client application. In the following, we describethe metadata warehouse and the on-demand data acquisition in more detail.

4.1 Metadata WarehousingThe central component of the SRS integration layer is the Metadata Ware-

house module (Figure 2), whose implementation largely follows the traditionalETL (Extract Transform Load) process. The scientific metadata is first gatheredfrom a source by the corresponding adapter and stored into a so-called sourcedump – set of preliminary tables dedicated to this source. The metadata is thenloaded into the staging area where it is joined with metadata from other sources.At this stage, metadata elements are preliminary merged based on the identi-fiers provided by sources to ensure no duplicates at the source level. During thefollowing cleaning phase the staging area is analyzed to discover and merge en-tities duplicated across different sources. After being cleaned, the metadata isfinally loaded from the staging area into the target database, where it is madeavailable to the applications. At each stage of the process only the incrementalchanges are made to the corresponding tables, which is achieved by computingthe difference between the desired and the current state of the tables.

The applications built on top of SRS focus on different usage of the scientificmetadata. In order to provide useful functionality with reasonable performance,they require efficient access to their own representations of the scientific resourcespace. For instance, Reseval [8] – a tool for evaluating scientific contributions,authors and institutions – uses various research impact metrics. The number ofcitations and self-citations of a paper or an author are the primary units of datafor Reseval, accessed very frequently and used to construct more complex met-rics. For efficient access, these numbers can not be calculated dynamically andhave to be precomputed. SRS addresses this problem by creating the application-specific views that contain all the data needed by the application in a suitableformat, and are updated at the final stage of the ETL process.


In order to enable source-dependent requests, SRS propagates the informa-tion about the sources of metadata elements through all of the stages of theprocess to the target database and the application-specific views. At any timefor any metadata element it is possible to know which source or sources it orig-inates from. In case of Reseval this enables the computation of metrics withrespect to any source or combination of them.

Data Sources … Scopus DBLP MAS Google

Scholar

Digital Libraries Search Engines

…

Delicious Twi9er

General Social Web

… CiteULike Mendeley

Scientific Social Web

…

Applications and Mashups Reseval …

z Search API

SRS

APIs Graph API

Integration Layer

Metadata Warehouse ETL

On-‐demand data

acquisiJon

Source adapters

...

applicaJon views

Fig. 2. High-level architecture of ScientificResource Space (SRS) Fig. 3. On-the-fly search application

4.2 On-demand data acquisition and integrationFor some applications it is possible to answer queries by forwarding requests

to the services provided by the sources and integrating the results on the fly.This functionality is implemented by the on-demand data acquisition engine ofSRS. It allows for the small portion of up-to-date metadata to be fetched fromthe sources and used to answer the query, without making it undergo the heavyand off-line warehousing process. The adapter layer is still involved to translatethe query into the language of the source and map the results back into SRSmodel. The integration can, however, be done on demand and in real time.

One of the examples of an application using this implementation of SRS isa scientific metasearch2. In this application the search queries are forwarded tothe sources (in the specific case, Mendeley, Microsoft Academic Search (MSAS)and CiteULike), and the search results are obtained and transformed into themodel of SRS. Results from different sources are then matched against eachother to identify results representing the same resources. The matched resultsare merged into a single resource combining the metadata of all of them. Forinstance, the search results for the term ”dataspaces” (Figure 3) contain 8 entitiescorresponding to the paper ”From databases to dataspaces...”, 6 of them comingfrom Mendeley and two from MSAS and CiteULike respectively. In the search

2 http://metasearch.mateine.org


results of our system they all are merged into one resource, for which the citationdata is coming from MSAS, while readership statistics and tags are aggregatedover the number of corresponding entities in Mendeley and CiteULike.

The aggregated resources can optionally be augmented with metadata fromother sources, re-ranked and filtered according to the user preferences. The usercan explore the results by reordering and filtering them, and following the linksto resources within various source systems.

5 Preliminary results and conclusion

At present, we are using the first implementation of SRS and experimentingwith a prototype search application2 following the on-demand metadata acquisi-tion and integration approach. In our experiments we have used Microsoft Aca-demic Search (MSAS), CiteULike and Mendeley as sources. All these sourceshave provided primary metadata about publications, such as authors, venue andpublication year. In addition, citation statistics (the number of citations) hasbeen obtained from Microsoft Academic Search, while CiteULike and Mende-ley have also provided some usage statistics (mainly the number of users whobookmarked the publication). This application has allowed us to compare thesearch results returned by these services and start exploring the difference be-tween them. This has supported our intuition that joining the search results fromdifferent sources can improve the coverage and the diversity of search results.

We have also learnt some lessons regarding the benefits and the limitationsof the on-demand data acquisition. On the one hand, this approach enables touse more sources. Specifically, in this approach we can leverage from the factthat normally sources are more likely to provide search API than the directaccess to their data. On the other hand, this approach does not allow us toinfluence the search algorithms of the sources, but only to reorder the retrievedresults. In contrast, a full metadata warehousing solution requires all the datato be gathered and processed in advance, but it provides complete control in theimplementation and fine-tuning of the search algorithms. Another limitation ofthe on-demand approach is the response time.

We have built an initial implementation of the Metadata Warehouse and usedit to build a number of research applications. One example of such applications isa survey on how researchers find references for their papers3. Given a user name,the application suggests a number of recent publications of the user. The usercan choose the publication, and specify, for each reference of this publication,the way in which it was found (for example, by searching in a digital library,or as a suggestion of a colleague, etc.). The results of this survey can later beused as another source of metadata for SRS, and thus made available to otherapplications. Another application built on top of SRS investigates the potentialof various social networks as sources of reference recommendation4.

In this paper we have focused on the management and use of social and bib-liographic metadata available on the Web for search and evaluation of scientific

3 http://survey.mateine.org/4 http://discover.mateine.org


resources. We have discussed the challenges and opportunities of the integrationof these metadata, and proposed an integration solution called Scientific Re-source Space (SRS). We have then described the model and the architecture ofSRS and discussed some preliminary results. Future work includes: (1) a rigorousinvestigation of the difference in the ranking of the search results obtained fromdifferent metadata and (2) the exploration of novel social metrics based both onsocial metadata and on the combination of bibliographic and social metadata.

References

[1] A. Asserson, K. Jeffery, and A. Lopatenko. CERIF: past, present and future: anoverview. In CRIS, 2002.

[2] J. Breslin and S. Decker. SIOC: an approach to connect web-based communities.International Journal of Web Based Communities IJWBC, 2(2), 2006.

[3] D. Brickley and L. Miller. FOAF vocabulary specification, 2005.[4] Y. Ding, I. Toma, S. Kang, M. Fried, and Z. Yan. Data mediation and interop-

eration in social web: Modeling, crawling and integrating social tagging data. InSWSM, 2008.

[5] B. DArcus and F. Giasson. Bibliographic ontology specification. Retrieved Octo-ber, 8, 2010.

[6] J. Hirsch. An index to quantify an individual’s scientific research output. Pro-ceedings of the National Academy of Sciences of the United States of America,102(46), 2005.

[7] D. Hull, S. R. Pettifer, and D. B. Kell. Defrosting the digital library: bibliographictools for the next generation web. PLoS Computational Biology, 4(10), 2008.

[8] M. Imran, M. Marchese, A. Ragone, A. Birukou, F. Casati, J. Jos, and J. Laconich.Reseval : An open and resource-oriented research impact evaluation tool. ResearchEvaluation, 2010.

[9] C. Lagoze and H. Van de Sompel. The Open Archives Initiative: Building a low-barrier interoperability framework, 2001.

[10] C. Lagoze, H. Van de Sompel, M. Nelson, and S. Warner. Open archives initiative-protocol for metadata harvesting-v. 2.0. 2002.

[11] M. Ley and P. Reuther. Maintaining an Online Bibliographical Database: theProblem of Data Quality., 2006.

[12] B. Matthews and S. Sufi. The CCLRC Scientific Metadata Model-Version 2. 2002.[13] C. Parra, M. Baez, F. Daniel, F. Casati, M. Marchese, and L. Cernuzzi. A scientific

resource space management system. 2010.[14] A. Passant, U. Bojars, J. Breslin, T. Hastrup, M. Stankovic, P. Laublet, et al.

An Overview of SMOB 2: Open, Semantic and Distributed Microblogging. InICWSM, 2010.

[15] A. Passant and P. Laublet. Meaning of a tag: A collaborative approach to bridgethe gap between tagging and linked data. In LDOW2008, 2008.

[16] E. Prodromou and H. Halpin. W3C Federated Social Web Incubator Group, 2010.[17] M. A. Rodriguez, J. Bollen, and H. V. D. Sompel. A practical ontology for the

large-scale modeling of scholarly artifacts and their usage. In ICDL, 2007.[18] Y. Sure, S. Bloehdorn, P. Haase, J. Hartmann, and D. Oberle. The SWRC On-

tology – Semantic Web for Research Communities. EPIA, 2005.[19] S. Weibel, J. Kunze, C. Lagoze, and M. Wolf. Dublin core metadata for resource

discovery. Internet Engineering Task Force RFC, 2413, 1998.

integration and warehousing of social metadata for …disi.unitn.it/~dmirylenka/socinfo11.pdf ·...

Documents