linked stream data processing - university of hy561/papers/storageaccess... · pdf file...

Click here to load reader

Post on 07-Sep-2020

13 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Linked Stream Data Processing

    Danh Le-Phuoc, Josiane Xavier Parreira, and Manfred Hauswirth

    Digital Enterprise Research Institute,National University of Ireland, Galway {danh.lephuoc,josiane.parreira,manfred.hauswirth}@deri.org

    Abstract. Linked Stream Data has emerged as an effort to represent dynamic, time-dependent data streams following the principles of Linked Data. Given the increasing number of available stream data sources like sensors and social network services, Linked Stream Data allows an easy and seamless integration, not only among heterogenous stream data, but also between streams and Linked Data collections, enabling a new range of real-time applications.

    This tutorial gives an overview about Linked Stream Data process- ing. It describes the basic requirements for the processing, highlighting the challenges that are faced, such as managing the temporal aspects and memory overflow. It presents the different architectures for Linked Stream Data processing engines, their advantages and disadvantages. The tutorial also reviews the state of the art Linked Stream Data pro- cessing systems, and provide a comparison among them regarding the design choices and overall performance. A short discussion of the current challenges in open problems is given at the end.

    Keywords: Linked Stream Data, Data Stream Management Systems, Linked Data, Sensors, query processing.

    1 Introduction

    We are witnessing a paradigm shift, where real-time, time-dependent data is becoming ubiquitous. Sensor devices were never so popular. For example, mo- bile phones (accelerometer, compass, GPS, camera, etc.), weather observation stations (temperature, humidity, etc.), patient monitoring systems (heart rate, blood pressure, etc.), location tracking systems (GPS, RFID, etc.), buildings management systems (energy consumption, environmental conditions, etc.), and cars (engine monitoring, driver monitoring, etc.) are continuously producing an enormous amount of information in the form of data streams. Also on the Web, services like Twitter, Facebook and blogs, deliver streams of (typically unstruc- tured) real-time data on various topics.

    Integrating these new information sources—not only among themselves, but also with other existing sources—would enable a vast range of new, real-time applications in the areas of smart cities, green IT, e-health, to name a few. However, due to the heterogeneous nature of such diverse streams, harvesting the data is still a difficult and labor-intensive task, which currently requires a lot of “hand-crafting.”

    T. Eiter and T. Krennwallner (Eds.): Reasoning Web 2012, LNCS 7487, pp. 245–289, 2012. c© Springer-Verlag Berlin Heidelberg 2012

  • 246 D. Le-Phuoc, J.X. Parreira, and M. Hauswirth

    Recently, there have been efforts to lift stream data to a semantic level, e.g., by the W3C Semantic Sensor Network Incubator Group1 and [25,103,126]. The goal is to make stream data available according to the Linked Data principles [22]—a concept that is known as Linked Stream Data [99]. As Linked Data facilitates the data integration process among heterogenous collections, Linked Stream Data has the same goal with respect to data streams. Moreover, it also bridges the gap between stream and more static data sources.

    Besides the existing work on Linked Data, Linked Stream Data also benefits from the research in the area of Data Streams Management Systems (DSMS). However, particular aspects of Linked Stream Data prevents existing work in these two areas to be directly applied. One distinguishing aspect of streams that the Linked Data principles do not consider is their temporal nature. Usually, Linked Data is considered to change infrequently. Data is first crawled and stored in a centralised repository before further processing. Updates on a dataset are usually limited to a small fraction of the dataset and occur infrequently, or the whole dataset is replaced by a new version entirely. Query processing in Linked Data databases, as in traditional relational databases, is pull based and one-time, i.e., the data is read from the disk, the query is executed against it once, and the output is a set of results for that point in time. In contrast, in Linked Stream Data, new data items are produced continuously, the data is often valid only during a time window, and it is continually pushed to the query processor. Queries are continuous, i.e., they are registered once and then are evaluated continuously over time against the changing dataset. The results of a continuous query are updated as new data appears. Therefore, current Linked Data query processing engines are not suitable for handling Linked Stream Data. It is interesting to notice that in recent years, there has been work that points out the dynamics of Linked Data collections [115,107]. Although at a much slower pace compared to streams, it has been observed that centralised approaches will not be suitable if freshness of the results is important, i.e., the query results are consistent with the actual “live” data under certain guarantees, and thus an element of “live” query execution will be needed [107]. Though this differs from stream data, some of properties and techniques for Linked Stream Data processing may also be applicable to this area.

    Data Streams Management Systems, on the other hand, are designed to han- dle and scale with fast changing, temporal data, such as Linked Stream Data. However Linked Stream Data is usually represented as an extension of RDF— the most popular standard for Linked Data representation. This contrasts with the relational storage model used in DSMS. It has been shown that in order to efficiently process RDF data using the relational model, the data needs to be heavily replicated [125,91]. Replication fast changing RDF streams is pro- hibitive, therefore DSMS can’t be directly used for storage and processing of Linked Stream Data.

    In this tutorial we will give an overview on Linked Stream Data process- ing. We will highlight the basic requirements, the different solutions and the

    1 http://www.w3.org/2005/Incubator/ssn/

    http://www.w3.org/2005/Incubator/ssn/

  • Linked Stream Data Processing 247

    advantages and disadvantages from each approach. We will also review existing Linked Stream Data processing system, and provide a comparison among them, regarding the design choices and overall performance.

    We will start by introducing a running example that will be used throughout the tutorial (Section 1.1). The example highlights the potential benefits in inte- grating stream data with other sources, and it also shows what are the challenges faced when designing a Linked Stream Data Processing system. And we men- tioned earlier, Linked Stream Data is closely related to Linked Data and stream processing. We assume that the attendees of this tutorial are familiar with the research in Linked Data, and in Section 2 we will focus providing a background on the fundamentals of stream processing that also applies to Linked Stream Data. We will show what are the basic models and techniques, how continu- ous semantics are represented, the diverse operators and processing optimisa- tion techniques, and how issues like time management and memory overflow are handled. In the Linked Stream Data processing (Section 3), we first show the formalisation to represent the data and the continuous queries and the query op- erators needed. We then moved into the architecture and system design of Linked Stream Data processing engines, showing the different possible approaches and design choices. We highlight the state of the art systems in Linked Stream Data processing, and show how each of them implement the different architectures. We also provide a performance comparison in terms of query processing times and scalability. The end of this tutorial is dedicated to a short discussion of the current challenges in open problems (Section 4).

    1.1 Running Example

    Inspired by the experiments of Live Social Semantics [4,109], we use the follow- ing running example through the rest of the tutorial. The scenario focuses on the data integration problem between data streams given by a tracking system and static data sets. Similar to several real deployments in Live Social Semantics, the tracking system is used for gathering the relationship between real-world identifiers and physical spaces of conference attendees. These tracking data can then be correlated with non-stream datasets, like online information about the attendees (social network, online profiles, publication record, etc). The benefits of correlating these two sources of information are manifold. For instance, con- ference rooms could be automatically assigned to the talks, based on the talk’s topic and the number of people that might be interested in attending it (based on their profile). Conference attendees could be notified about fellow co-authors in the same location. A service that suggest which talks to attend, based on profile, citation record, and distance between talks locations can be designed. These are just a few examples.

    For the tracking service in our example, attendees of a conference wear RFID tags that constantly stream the location in a building, i.e which room/sec- tion/area they currently are. Each reading streamed from the RFID tags to RFID readers has two properties, the tagid and the signal strength.

  • 248 D. Le-Phuoc, J.X. Parreira, and M. Hauswirth

    The static datasets include data about the conference attendees and metadata about the conference building. Each attendee has a profile containing his personal information and the tagid given to him. The profile has also data links to the attendee’s publication records in