myriam phd

Contextualise Sensors with Linked Data

To Improve Relevancy, Data Quality and Network Adaptability

Myriam LeggieriPhD Thesis

Sensor out of Context

2

Contextualise Sensors with Linked Data

3

dB, Km, µPa? dBs in water have a different relative value than in air

Q1. How to model it?

—> Is it worth it?

4

4

dB, dB, Km, µPa?

Yes, because

Q1. Contextualised Model for

Q3. Relevancy Prediction

Q4. Enriched Web Content

Q5. Network Adaptability

Q2. Cross-Network Communication

Research Questions

5

Q1. How to model Linked Sensor Data

for

Q2. Cross-Network Communication

Q3. Relevancy Prediction

Q4. Web Data Quality

Q5. Network Adaptability

Outline

1. Linked Sensor Data Model [Q1]

2. LD4Sensors Web Service [Q2]

3. Sensor Relevancy Prediction [Q3]

4. Enriched Web Content [Q4]

5. Network Adaptability [Q5]

6. Research Answers

7. Lessons Learned and Future Work

6

Core Research and

Results

Conclusion

Q1. How can contextual information be used to enrich sensor data?

Linked Sensor Data Model

7

Ontology Modularisation

ContextNetwork

Components

Energy Conservation


8

Application Ontology

Domain Ontology Task Ontology

Upper Ontology

Ontology Aligning Inheritance & Reuse

Dolce+DnS Ultralite(DUL)

W3C Semantic Sensor Network (SSN)

Our Ontology

Quantities, Units, Dimensions and Data Types (QUDT)


9



Provenance (PROV)

Event Model-F (EVENT)

Unified Code for Units of Measure (UCUM)

Friend Of A Friend (FOAF)

Measurement Unit (MUO)

Online Presence (OPO)

Review Vocabulary (REV)


Ontology Aligning Inheritance & Reuse

10


spt:Agent

spt:Activity

ssn:Device

ssn:Sensor

EventParticipation

ssn:Stimulus

spt:Place

11


OWL FullSymmetric

Transitive Inverse

Equivalent

room A floor 2 house H

spt:containedIn

Asserted, Inferred, Direct Relations

12

Comment &

Rate

from

to

title

date-time

link motivation

same thing

same domain

same date-time

same location


Social Feedback and

Sharing

Outline






6. Research Answers


13

Core Research and

Results

Conclusion

Q1. How can contextual information be used to enrich sensor data?

Q2. How can sensors communicate across different platforms without ad-

hoc solutions?

14



Provenance (PROV)

Event Model-F (EVENT)

Unified Code for Units of Measure (UCUM)

Friend Of A Friend (FOAF)

Measurement Unit (MUO)

Online Presence (OPO)

Review Vocabulary (REV)


Which ontologies?

Which links? How to enable inference?

How to enable cross-communication?

Non-experts Average users

Non-experts Average users

Automate Inference

Automate Reasoning

Automate Link Search &

Creation

Easy Browsing

REST API

GUI

LD4Sensors (LD4S) Web Service

Easy Storing

16

LD4Sensors (LD4S) Web Service

17

LD4S: Evaluation

Goals

1. Actual Gain in Building Automation

2. Uptake, Usability,Utility

Performance

3. Implementation Quality

Users feedbackDeployment

18

LD4S: Evaluation

1. Actual Gain in Building Automation

Deployment

80%Accuracy Matching the real

consumption

19

LD4S: Usability Evaluation


Users feedback

Tot. Participants: 38

1% had previously interacted with sensors

SurveyGUI Usable and clear API to be improved Applicability to be made explicit

20

LD4S: Utility Evaluation

Time of usage

# Data accessed

# Data transmitted

Amount

Type

Uniqueness

Location

Quality

Time Sensitivity

Relevance

Web Service Resources Linked Data output

Purpose to be made more explicit Links relevancy to purpose to be improved Highlight importance of network/context metadata

21

LD4S: Uptake Evaluation

Unique accessesTot # accesses

Per day (over the 30 days period)


Users feedback

Satisfying for pilot evaluation Project-driven modelling: positive feedback from partners To be repeated over a longer time period

22

LD4S: Performance EvaluationThreshold = # requests / response time (sec)

compared to

Payload size sent + received by LD4S

Performance

3. Implementation Quality Decrease of throughput as payloads increases

But not exponential Improvable by implementing a cache

Outline






6. Research Answers


23

Core Research and

Results

Conclusion

Q2. How can sensors communicate across different platforms without ad-

hoc solutions?

Q3. How to identify which sensors are more relevant sources of information to define a

specific small scope of interest?

24

Relevancy Predictionin Activity Logging

25

LexicalRealisation

ConceptualObjects Cooking

Concept

Objects

FridgeMicrowave

OvenSink

...

Locations

Kitchen...

Fridge

13

12

11

10

9

8

7

6

5

4

3

2

L

5V

A0

ANALO

G IN

AREF

1

GND

TXRX

RESET

3V3

A1

A2

A3

A4

A5

VIN

GND

GND

DIG

ITAL (PWM

=)

ArduinoTM

IOREF

ICSP

ICSP2

ON

POW

ER

0

1TX0

RX0

RESET

Sensor<switch,fridge>


Distributional Semantics

Hierarchical Clustering

Feature of interest

(FoI)

26


1. from DataHub:Algorithm

Sensors sharing Location & Time

Activated Sensors

2. EasyESA Similarity (X,Y)

X

Y

3. Add to Distance Matrix

4. Clustering

Sensors in the same cluster are relevant for the same activity. Activity = Cluster

Predicting Sensor Relevancy for ADLs Logging 127

of the rows corresponds to a word that occurs inS

i=1...n di. An entry T [i, j] in the table

corresponds to the TF-IDF value of term ti in document dj.

T [i, j] = tf(ti, dj) ⇤ logn

dfi(5.1)

where tf(ti, dj) is the term frequency of the term ti in the document dj defined as

tf(ti, dj) =

8<

:1 + log(count(ti, dj)) if count(ti, dj) > 0

0 otherwise(5.2)

while dfi = |dk : ti 2 dk| is the document frequency, i.e., the total amount of documents

that contain ti.

EasyESA . The size of the textual corpus on which semantic models rely upon is critical

to the quality of the results. This leads to high hardware and software requirements on the

implementation side (e.g., the English version of Wikipedia 2013 contains 43 GB of article

data). For simplicity, we use EasyESA [Carvalho et al., 2014], a JSON webservice which

implements ESA based on Wikiprep-ESA10. It can be queried for either the semantic

relatedness measure, concept vectors or the context windows. In particular, we query

the online available instance11 which run ESA on the English version of Wikipedia 2013.

The query asks for semantic relatedness of pairs of sensors represented as tuples of terms

like ¡ switch, fridge ¿.

5.6.2. Unsupervised Hierarchical Clustering

Unsupervised methods . We chose unsupervised methods because we believe that

given the amount of di↵erent activities and sensors involved, supervised methods are not

likely to scale with the expansion of the Internet of Things phenomenon. In particular,

we chose hierarchical clustering because it is the approach that has so far achieved the

better precision [Kwon et al., 2014].

10https://github.com/faraday/wikiprep-esa11http://vmdeb20.deri.ie:8890/esaservice

Relevancy Prediction:Distributional Semantics

Term frequency Inverse document frequency

model

term frequency of term t in document d

tot documentstot documents containing the term t

Relevancy Prediction:Hierarchical Clustering

Unweighted Pair Group Method with Arithmetic mean (UPGMA)

Weighted Pair Group Method with Arithmetic mean (WPGMA)

Farthest Point or VoorHees (VH)

Reflection of Semantic Distribution

Reflection of Structural Subdivision

Reflection of Centrality

29


with the sensors manually annotated as part of such activity logging. These annotations

and readings are taken from the public14 dataset MITes [Tapia et al., 2004] and were

collected during live experiment settings. We pre-processed such dataset (i.e., CSV files

of sensor readings and metadata about both sensors and activities) to form HTTP PUT

requests to the LD4S API for annotating and storing the data, as in Listing 5.7. Based

on such comparison, the overall accuracy and precision of our system are calculated when

applying either of the clustering algorithms UPGMA, WPGMA or VH.⇤ �1 PUT ld4s:device/2_99

2

3 payload: {’observed_property ’: ’switch’,

4 ’location -name’: [’Kitchen ’],

5 ’foi’: [’Fridge ’]}

6

7 headers: {’Content-type’: ’application/json’,

8 ’Accept ’: ’application/x-turtle ’}⇥ �Listing 5.1: HTTP PUT request forwarded to the LD4S RESTful API.

DataHub (see Section 5.5) was then queried for all the sensor datasets available15

thus returning a JSON list of details of these datasets such as their ID, title, tags, license

and endpoint URIs. The system filters only those datasets that either have no license or

grant an open-access 1. expose a SPARQL endpoint and forward the query in Listing 6.1

towards each of them.Since LD4S triple store is published on DataHub, its endpoint

is also mentioned in such JSON list. Consequently, our query will be forwarded to the

LD4S endpoint as well, so that we will actually get all the data that we had annotated

and stored in the pre-processing step but while also assuring that any other potential

dataset is considered.

The results obtained from each endpoint are XML files - as by W3C standard

recommendation - that the system merged and parsed to distinguish between sensors

that sensed a change in status and the others who just happened to share the same

location. In this experiment we evaluated the worse case: only one sensor has recently

sensed a change in status. The semantic relatedness must be calculated between the

higher amount of possible pairs that share the same location at the same time. This is

used to fill a distance matrix on which the hierarchical clustering algorithms were applied.

In addition to precision and overall accuracy, we also evaluated the performances in

14http://courses.media.mit.edu/2004fall/mas622j/04.projects/home/thesis_data_txt.zip15http://ckan.net/api/3/action/package_search?q=sensor


Table 5.1.: Activities labelled in the MITes dataset.

Number of Examples per Class

Activity Subject 1 Subject 2

Preparing dinner 8 14

Preparing lunch 17 20

Listening to music - 18

Taking medication - 14

Toileting 85 40

Preparing breakfast 14 18

Washing dishes 7 21

Preparing a snack 14 16

Watching TV - 15

Bathing 18 -

Going out to work 12 -

Dressing 24 -

Grooming 37 -

Preparing a beverage 15 -

Doing laundry 19 -

Cleaning 8 -

Internet of Things expansion. The growth of time cost is analysed more thoroughfully in

Section 5.7.4.

The lowest semantic similarity value calculated was �1.0 for the pair ¡switch, tv¿ and

¡ switch, hamper ¿, followed by 0.00036 for the pair ¡ switch, jewelry box¿ and ¡ switch,

microwave ¿. While the highest similarity value was 0.75839 for the pair ¡ switch, cabinet

¿ and ¡ switch, medicine ¿, followed by 0.11285 for the pair ¡ switch, refrigerator ¿ and ¡s

witch, freezer ¿.

5.7.3. Algorithms Comparison

The hypothesis we wanted to verify by applying the chosen algorithms were 1. UPGMA:

is the distance of the semantic distribution of similarities relevant in predicting the

sensor-activity association? 2. WPGMA: does considering the structural subdivision

of the sensor objects positively influence such prediction? 3. VH: can we rely on the

Relevancy Prediction:Evaluation Data

27 FoIs —> 351 Similarity Pairs

132 Predicting Sensor Relevancy for ADLs Logging

terms of execution time for the di↵erent HTTP requests, the SPARQL queries, the whole

pre-processing step and the overall system.

5.7.1. MITes Dataset

Tapia et al. [Tapia et al., 2004] published the MITes dataset from an experiment where

human activity was collected for two weeks. They installed 200 switch sensors deployed

on 27 di↵erent features of interest (FoIs) in two single-person apartments. The sensors

were installed in everyday objects such as drawers, refrigerators, containers, etc. to

record opening-closing events (activation deactivation events) as 2 subjects carried out

everyday activities. The subjects used a software application while they were performing

an activity, to manually annotate it. This resulted in the annotated activities associated

with readings as in Table 5.1. In our experiment we used the data from both subjects

combined together, since evaluating the system di↵erently according to the subject at

end was out of the scope of this paper.

5.7.2. Similarity Results

We considered the worse case in which only one of the sensors sharing the same location

at the same time range has recently sensed a change in status for the current ongoing

activity, while all the other nearby ones which will likely do so in the near future have to

be predicted. In this case, given n sensors, the amount of pairs to check for semantic

relatedness is the binomial coe�cient as in Equation 5.10. In our case since there are 27

di↵erent features of interest, there are 27 di↵erent types of sensors and 351 distinct pairs.

✓n

2

◆=

n!

2!(n� 2!)(5.10)

Even though the binomial coe�cient grows quickly, it only depends on the amount

of features of interest rather than on the amount of actually deployed sensors. At the

same time, the amount of ICOs is expected to grow but the amount of ”types” of

sensors is not, since there is only so much in the real world that can be monitored by

sensors. Our method then is not expected to hinder the system from scaling during the

Worst case scenario: only one of the sensors sharing the same location at the same time range has recently sensed a change in status for the current ongoing activity


Figure 5.4.: Clustering performed by the Voor Hees algorithm.

When comparing our results with the annotated dataset, since we do not perform

cluster labelling, it was not possible to directly map our clusters to the labels in Table 5.1.

However, we considered the match verified whenever the sensors belonging to the same

cluster according to our system (i.e., predicted class) were the ones that sensed the same

activity in the MITes annotations (i.e., actual class). Consequently, we considered a

2-class classification problem, i.e., whether the sensors actually part of the same activity

had been clustered in the same cluster. As a result a separate confusion matrix (Table 5.2)

was created for each of the annotated activity.

Table 5.2.: Confusion matrix displaying number of true positives, true negatives, false positivesand false negatives for a 2-class classification problem.

Predicted vs Actual Actual class

1 2

Predicted class1 TP11 FP12

2 FN21 TN22

With such settings, we calculated precision and overall accuracy.

Precision =TP11

TP11 + FP12(5.11)

Accuracy =TP11 + TN21

TP11 + TN22 + FP12 + FN12(5.12)













1 2


2 FN21 TN22


Precision =TP11

TP11 + FP12(5.11)


TP11 + TN22 + FP12 + FN12(5.12)

Relevancy Prediction:Evaluation: Precision

Dressing Cleaning Toileting Laundry Dinner WashingUp Snack Lunch

Precision of the Activity Clustering

Performance%

0

20

40

60

80

100WPGMAUPGMAVH













1 2


2 FN21 TN22


Precision =TP11

TP11 + FP12(5.11)


TP11 + TN22 + FP12 + FN12(5.12)













1 2


2 FN21 TN22


Precision =TP11

TP11 + FP12(5.11)


TP11 + TN22 + FP12 + FN12(5.12)


Accuracy of the Activity Clustering

Accuracy%

0

20

40

60

80

WPGMAUPGMAVH

Relevancy Prediction:Evaluation: Accuracy

Relevancy Prediction:Hierarchical Clustering

Unweighted Pair Group Method with Arithmetic mean (UPGMA)

Weighted Pair Group Method with Arithmetic mean (WPGMA)

Farthest Point or VoorHees (VH)

Reflection of Semantic Distribution

Reflection of Structural Subdivision

Reflection of Centrality

Relevancy Prediction:Evaluation: Comparison with SoTA



Accuracy of the Activity Clustering

Accuracy%

0

20

40

60

80

WPGMAUPGMAVH

Figure 5.6.: Comparison between accuracy percentages achieved by the clustering algorithmsfor some of the activities.

Table 5.3.: Comparison between the experiment setup and results for our own approach andthe previous closest research e↵orts.

Kwon et al. Wyatt et al. Ours

# Sensors 3 100 200

# Activities 5 26 16

Collection Time 50 mins 360 mins 2 weeks

Goal AR AI RSP

Algorithms HIER HMM UH

Precision 79% 70% 89%

Accuracy - 52% 69%

Our results are relevant as we can notice that our system improved the accuracy by 32%

and the precision by 5% with respect to such previous e↵orts from the state of the art.

5.7.4. Performance

The evaluated system run on a laptop equipped with Intel CoreTM2 Duo and 305GB

of disk space. We used the LD4S and EasyEsa service instances running on external

servers in order to support and test a modular and distributed architecture. These were

Increase of 32% accuracy and 5% precision

Relevancy Prediction:Evaluation: Performance

50 100 150 200

2040

6080

Features of Interest (FoIs)

Tim

e (m

sec)

●

●

●

Time Complexity Growth

Time Growth per Amount of FoIs

●

●

●

#FoIs275481112135162189216

HTTP PUT requests: 3ms

Overall Execution: 18ms

Dataset Discovery on DataHub: 3ms (20 datasets)

LD4S SPARQL response: 246ms

ESA: 14ms (351 similarity pairs)

Easy-ESA response: 9ms

Highest time cost = 1 min 26 sec for comparing 216 FoIs

Possibility of updating sensors similarities at run-time CoRE devices (RAM 4 kB and ROM 128 kB): pre-compute offline clustering

Outline






6. Research Answers


35

Core Research and

Results

Conclusion

Q3. How to identify which sensors are more relevant sources of

information to define a specific small scope of interest?

Q4. How can contextualised sensors improve the quality of traditional

Web content?

36

Enriched Web ContentBetween Web and Real Place

37

Hospital Dublin

?

38

Short-lived Data

Long-lived Data

Cost

Enriched Web ContentBetween Web and Real Place

39

1. from DataHub:Algorithm

Sensors sharing

Location & Time

2. Extract Google Search results representing Real Places

3. Live Data Fetching

4. Result Dictionary Update

Bridging the gap between Web and Real Places

Enriched Web ContentG-Sensing

40

Enriched Web ContentG-Sensing Frontend

41

Enriched Web ContentEvaluation Deployment

DataHub

Clinic

Clinic

Clinic

30 sensors

1Km

LD4S

PUT <JSON sensor metadata>

G-Sensing

Google Places

3.692 locations 1.455 (39.4%) have a website

Query: Acupuncture Galway Salthill

42

Enriched Web ContentEvaluation Coverage

How much of the area defined by the virtual locations overlaps with the city of

Galway within radius r=150m

Coverage percentage as we vary the vicinity radius

Added value of our approach for integrating live data into physical locations' websites:

! We divided the areas of Galway divided into squares with different side lengths l

! We counted the number of virtual locations within each square.

Enriched Web ContentEvaluation Distribution

! # virtual locations per square and their respective frequency shows a power-law relationship

! while most squares only contain a small set of locations, a few squares contain a very large number of locations

! (e.g., city centres, business parks).

44

Enriched Web ContentEvaluation Performance

! Google search result page: ~145 KB ! After enabling G-Sensing: ~175 KB (~20% increase)

! At browser start-up: query to DataHub for data source discovery: 3 ms ○ 20 sensor datasets discovered ○ 3 sensor datasets have an open license + expose a SPARQL endpoint ○ 1 sensor dataset’s SPARQL endpoint was accessible (LD4S): 246 ms

G-Sensing does not impede on a user's browsing experience

Bandwidth Overhead

Response Time

Outline






6. Research Answers


45

Core Research and

Results

Conclusion

Q4. How can contextualised sensors improve the quality of traditional

Web content?

Q5. How can contextualised sensors improve the adaptability of

mobile constrained and heterogeneous sensor networks?

Network Adaptability: Demo

46

LD4S

Fuzzy Logic

6LoWPAN + CoAP

Automated LD4S Annotation of new Sensors entering a network

Outline






6. Research Answers


47

Core Research and

Results

Conclusion

Q5. How can contextualised sensors improve the adaptability of

mobile constrained and heterogeneous sensor networks?

Research Answers

48

Future Work• Filtering

• of links according to the resource rating/review LD4S system• of sensor data injected into Google Search results according to user’s prefs &

context• Extending

• derive labels of activities beyond the per-activity sensor clustering• sensor data injected into any Web page and content• sensor data sources extended to include, e.g., TripAdvisor and other user-

generated content• collect user’s feedback on auto-derived annotations for incremental learning

• Evaluation• Long-term large-scale user study to gather insights into how users really use

the current functionalities offered by LD4S• Other areas of research

• Sensor-triggered data can feed back to Linguistic Linked Data knowledge

49

myriam phd

Software

linked data

linked sensor data model

linked sensor data model

linked sensor data model

sensor eventparticipation

data types qudt ontology

web data quality q5

sensor relevancy prediction