virtual clusters for (rdf) stream processing
TRANSCRIPT
Alejandro Llaves
Ontology Engineering GroupUniversidad Politécnica de Madrid
Madrid, [email protected]
Oct 21 2015
Virtual Clusters for (RDF) Stream Processing
Outline
Some context: morph-streams++
Motivation
Use case: Sensor Cloud data integration
Topologies everywhere
Setting up a virtual cluster
Deploying Storm topologies
Conclusion
Some context...
Motivation
Integrating an unbounded stream of heterogeneous sensor observations
Solution:
– Storm topologies for real-time processing
– Semantic Sensor Network (SSN) ontology for modelling observations
– SWEET ontology for environmental phenomena
Use case: Sensor Cloud data integration (1/3)
Sensor Cloud Viticulture, water
management, weather monitoring, oyster farming...
RESTful API – JSON
Network → Platform → Sensor → Phenomenon → Observation
Lack of semantic descriptions, e.g. rain_trace vs Rain.
Multiple HTTP requests to query various streams.
Source: CSIRO
Use case: Sensor Cloud data integration (2/3)
Sensor Cloud messages to field-named tuples
SWEET annotations for heterogeneous phenomena descriptions
<sample time=”20150528T16:30” value=”15” sensor=”bom_gov_au.94961.air.air_temp”/>
[“20150528T16:32”, “20150528T16:30”, “15”, “bom_gov_au”, “94961”, “air”, “air_temp”,“43.3167”, “147.0075”]
network
phenomenon
platform sensorsampling time
system timelatitude longitude
SensorCloudParserBolt
SweetAnnotationsBolt
Use case: Sensor Cloud data integration (3/3)
SSN mapping
SSNConverterBolt
Topologies everywhere
A Storm topology “is a graph of stream transformations where each node is a spout or bolt”. https://storm.apache.org/documentation/Tutorial.html
Example of simple topology
Setting up a virtual cluster (1/2)
Wirbelsturm - https://github.com/miguno/wirbelsturm/
Allows deploying (local or remote) virtual clusters.
Focus on Big Data technologies: Storm, Kafka, Zookeeper...
Uses Vagrant for “easy to configure, reproducible, and portable work environments” - https://docs.vagrantup.com/v2/why-vagrant/index.html
Uses Puppet for provisioning: installation and configuration of SW packages in the cluster nodes.
Setting up a virtual cluster (2/2)
$ ./deploy Show wirbelsturm.yaml
Check Storm GUI - http://localhost:28080/index.html
Deploying Storm topologies
$ ./deploy Show wirbelsturm.yaml
Check Storm GUI - http://localhost:28080/index.html
Describe simple topology
Compile & deploy
Describe a topology set
Configure Kafka
Compile & deploy
Conclusion
Conclusion
Wirbelsturm allows easy configuration & deployment of virtual clusters, with focus on Big Data technologies.
SSN and SWEET ontologies to model and integrate environmental sensor observations.
Parallelization of bottleneck tasks reduces the average message processing latency (up to some extent). More about Storm parallelization: http://bit.ly/1NVyjU2
Delaying RDF conversion does not speed up the processing of Sensor Cloud messages in the tested environment.
Submitted paper to IJSWIS, special issue on Velocity and Variety Dimensions of Big Data – Llaves, Corcho et al.
What's coming next
Flying faster with Heron - https://blog.twitter.com/2015/flying-faster-with-twitter-heron
The presented research has has been funded by Ministerio de Economía y Competitividad (Spain) under the project ”4V:
Volumen, Velocidad, Variedad y Validez en la Gestión Innovadora de Datos” (TIN2013-46238-C4-2-R), by the EU Marie Curie
IRSES project SemData (612551), and supported by an AWS in Education Research Grant award.
Alejandro [email protected]
Thanks!