ios press reality mining on micropost streams · bottari is an augmented reality application for...

18
Undefined 1 (2009) 1–5 1 IOS Press Reality Mining on Micropost Streams Deductive and Inductive Reasoning for Personalized and Location-based Recommendations Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name Surname, University, Country Marco Balduini b , Irene Celino a,* , Daniele Dell’Aglio a , Emanuele Della Valle b,a,** , Yi Huang c , Tony Lee d , Seon-Ho Kim d and Volker Tresp c a CEFRIEL – ICT Institute, Politecnico di Milano, via Fucini 2, 20133, Milano, Italy E-mail: {irene.celino,daniele.dellaglio}@cefriel.it b Dip. Elettronica e Informazione – Politecnico di Milano, via Ponzio 34/5, 20133, Milano, Italy E-mail: [email protected],[email protected] c Siemens AG, Corporate Technology, Otto-Hahn-Ring 6, 81739 München, Germany E-mail: {yihuang,volker.tresp}@siemens.com d Saltlux, 7F, Deokil Building, 967, Daechi-dong, Gangnam-gu, Seoul 135-848 Korea E-mail: {shkim,tony}@saltlux.com Abstract. The rapid growth of published personal opinions in form of microposts such as found in Twitter is the basis of novel emerging social and commercial services. In this paper, we describe BOTTARI, an augmented reality application that permits the per- sonalized and localized recommendation of points of interests (POIs) based on the temporally weighted opinions of the com- munity. The technological basis of BOTTARI is the highly scalable LarKC platform for the rapid prototyping and development of Semantic Web applications. In particular, BOTTARI exploits LarKC’s deductive and inductive stream reasoning. We present an evaluation of BOTTARI based on a three year collection of tweets about 319 restaurants located in the 2 km 2 district of Insadong, a popular tourist area of the South Korean city of Seoul. BOTTARI is the winner of the 9th edition of the Semantic Web Challenge, co-located with the 2011 International Semantic Web Conference. BOTTARI is currently field tested in Korea by Saltlux. Keywords: reality mining, stream reasoning, personalized recommendation, social media analysis, mobile app 1. Introduction Imagine you are a tourist in Seoul. You read in your guide book that the Insadong district in Seoul is the perfect place to dine out. However, when you reach Insadong-gil (the main street of the district), hundreds of restaurant advertisements around you look almost the same (see Figure 1). You know you can open the guide book and choose one of the few restaurants listed * Corresponding author. E-mail: [email protected]. ** Corresponding author. E-mail: [email protected]. there. Alternatively, you can also use one of your mo- bile apps that recommend restaurants based on users’ reviews. But then you would probably read reviews only in a language that you understand, which might not be the local language. Instead, you decide that you would like to dine where the locals do, i.e., you would like to find out how the locals rated Insadong’s restau- rants, with an emphasis on recent ratings. Such a story would have sounded like science fic- tion only five years ago, but it can become reality by exploiting information from microposts like Twitter in combination with the most recent developments in Se- 0000-0000/09/$00.00 c 2009 – IOS Press and the authors. All rights reserved

Upload: others

Post on 13-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

Undefined 1 (2009) 1–5 1IOS Press

Reality Mining on Micropost StreamsDeductive and Inductive Reasoning for Personalized and Location-based Recommendations

Editor(s): Name Surname, University, CountrySolicited review(s): Name Surname, University, CountryOpen review(s): Name Surname, University, Country

Marco Balduini b, Irene Celino a,∗, Daniele Dell’Aglio a, Emanuele Della Valle b,a,∗∗, Yi Huang c,Tony Lee d, Seon-Ho Kim d and Volker Tresp c

a CEFRIEL – ICT Institute, Politecnico di Milano, via Fucini 2, 20133, Milano, ItalyE-mail: {irene.celino,daniele.dellaglio}@cefriel.itb Dip. Elettronica e Informazione – Politecnico di Milano, via Ponzio 34/5, 20133, Milano, ItalyE-mail: [email protected],[email protected] Siemens AG, Corporate Technology, Otto-Hahn-Ring 6, 81739 München, GermanyE-mail: {yihuang,volker.tresp}@siemens.comd Saltlux, 7F, Deokil Building, 967, Daechi-dong, Gangnam-gu, Seoul 135-848 KoreaE-mail: {shkim,tony}@saltlux.com

Abstract.The rapid growth of published personal opinions in form of microposts such as found in Twitter is the basis of novel emerging

social and commercial services. In this paper, we describe BOTTARI, an augmented reality application that permits the per-sonalized and localized recommendation of points of interests (POIs) based on the temporally weighted opinions of the com-munity. The technological basis of BOTTARI is the highly scalable LarKC platform for the rapid prototyping and developmentof Semantic Web applications. In particular, BOTTARI exploits LarKC’s deductive and inductive stream reasoning. We presentan evaluation of BOTTARI based on a three year collection of tweets about 319 restaurants located in the 2 km2 district ofInsadong, a popular tourist area of the South Korean city of Seoul. BOTTARI is the winner of the 9th edition of the SemanticWeb Challenge, co-located with the 2011 International Semantic Web Conference. BOTTARI is currently field tested in Koreaby Saltlux.

Keywords: reality mining, stream reasoning, personalized recommendation, social media analysis, mobile app

1. Introduction

Imagine you are a tourist in Seoul. You read in yourguide book that the Insadong district in Seoul is theperfect place to dine out. However, when you reachInsadong-gil (the main street of the district), hundredsof restaurant advertisements around you look almostthe same (see Figure 1). You know you can open theguide book and choose one of the few restaurants listed

*Corresponding author. E-mail: [email protected].**Corresponding author. E-mail: [email protected].

there. Alternatively, you can also use one of your mo-bile apps that recommend restaurants based on users’reviews. But then you would probably read reviewsonly in a language that you understand, which mightnot be the local language. Instead, you decide that youwould like to dine where the locals do, i.e., you wouldlike to find out how the locals rated Insadong’s restau-rants, with an emphasis on recent ratings.

Such a story would have sounded like science fic-tion only five years ago, but it can become reality byexploiting information from microposts like Twitter incombination with the most recent developments in Se-

0000-0000/09/$00.00 c© 2009 – IOS Press and the authors. All rights reserved

Page 2: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

2 M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis

Fig. 1. A picture of Insadong district: the density of restaurants isvery high (cf. also Section 4).

mantic Web technologies like deductive and inductivestream reasoning. This is exactly what happens withBOTTARI, the winner of the Semantic Web Challenge2011.

BOTTARI is an augmented reality application forpersonalized and localized restaurant recommendationin Seoul. At a first look, it may appear like other mobileapps that recommend restaurants, but BOTTARI is dif-ferent: BOTTARI is a reality mining [19] applicationthat listens to social media (specifically Twitter) andperceives the reputation of all points of interest (POIs)under consideration, i.e., Insadong’s restaurants.

In this paper, we describe some of the research andtechnical details that made BOTTARI a reality. BOT-TARI is an Android application where the user canlook at POIs and receive general information, com-ments from social media and personalized recommen-dations. The two latter features are based on informa-tion extracted from Twitter microposts using naturallanguage processing technologies. The extracted infor-mation is then mapped to ontological background in-formation describing the POIs. To extract opinions andto personalize recommendations, we employed noveldeductive and inductive stream reasoning technologiesexplored within the EU FP7 LarKC project [5]. Theoverall approach is experimentally tested using a threeyear collection of tweets about 319 restaurants locatedin the 2 km2 district of Insadong.

The paper is organized as follows. Section 2 intro-duces relevant background. Section 3 illustrates the de-

tails of BOTTARI from the user’s point of view. Sec-tion 4 describes the data and the knowledge used inthe application. Section 5 presents the pluggable sys-tem architecture and the components of BOTTARI’sback-end. The inductive and deductive stream reason-ing techniques that realize BOTTARI recommenda-tions are presented in Section 6. Sections 7 and 8 reportour experimental results and analyse the infrastructurescalability. Finally, Section 9 concludes the paper andsketches our future works.

2. Background Work

In this section, we provide some background aboutthe application setting and the technologies we use forthe development of BOTTARI. This is intended to pro-vide the reader with sufficient details to understand therest of the paper.

2.1. Location-based Information and Services

The popularity of guide books appears to be fad-ing out. Starting from 2000, when TripAdvisor1 wasfunded, a number of Web-based and collaborative ver-sions of tourist guides, like Yelp2 or Qype3, has at-tracted a growing attention from the tourist audience.

Moreover, the wide spread of mobile internet accessand the increasing availability of smart phones haveallowed users to search the Web while on the go. Asdemonstrated by large scale studies [28,29,44], mostusers perform local searches for POIs, e.g., restaurants.In particular, several studies recognized the impor-tance of the distance between the user’s location andboth the returned POIs [22,10,35] and POI’s rankings[1,35]. Interestingly, studies [38,18] have shown thatpersonal preferences play a more important role in mo-bile search applications rather than in Web searches.

The increasing availability of GPS enabled smartphones, then, paved the way for the success of theso-called Location-based Services (LBSs), i.e., mobileand Web applications that provide context-dependentinformation and functionalities. The best known exam-ple of LBS is the social application foursquare4 that al-

1TripAdvisor (http://www.tripadvisor.com/) is a pioneertravel website that assists customers in gathering travel informationby colleting reviews and opinions of travel-related content.

2Cf. http://www.yelp.com/.3Cf. http://www.qype.com/.4Cf. http://foursquare.com/.

Page 3: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis 3

lows users to “check-in” and communicate their loca-tion to people in their online communities. Other sim-ilar applications are Facebook Places5, Google Lati-tude6, and Gowalla7.

2.2. Stream Reasoning

Stream reasoning [15] is the research field aimedat reasoning upon rapidly changing information. Inthe stream reasoning settings, well known ArtificialIntelligence techniques like belief revision [23] aretoo cumbersome and slow. Instead, stream reasoningbuilds on techniques from Data Stream ManagementSystems (DSMS) [24] and Complex Event Process-ing [34].

Available solutions for stream reasoning (like [8,2,16]) share three main concepts:

– modelling the information flow as an RDF stream(i.e., a sequence of RDF triples annotated with anon-decreasing timestamp),

– reasoning “on the fly” on RDF streams, and– exploiting the temporal order of the data stream

to optimize the computation.

BOTTARI employs both deductive and inductivestream reasoning. The deductive stream reason-ing approach used in BOTTARI is based on an ex-tension of SPARQL, namely Continuous SPARQL(C-SPARQL) [8], that continuously processes RDFstreams observed through windows (as done in DSMS).The syntax and semantics of C-SPARQL were de-scribed in [4], while the C-SPARQL execution en-gine and its optimization techniques were illustratedin [6]. Our approach to publish RDF stream as LinkedData (namely Streaming Linked Data) is based on [3].Finally in [7], the deductive reasoning support to C-SPARQL was elaborated.

In the inductive stream reasoning approach usedin BOTTARI, the SUNS (Statistical Unit Node Set)approach [41,26] is employed as a scalable machinelearning framework for predicting unknown but po-tentially true statements by exploiting the regularitiesin structured data, especially in Semantic Web andLinked Data. In the SUNS framework, a regularizedmultivariate learning approach is employed. In [27]a kernel variant of SUNS approach for dealing withvery high-dimensional data was proposed. Recently, a

5Cf. http://www.facebook.com/about/location.6Cf. http://www.google.com/latitude.7Cf. http://gowalla.com/.

Fig. 2. Our sentiment analysis approach.

modularized SUNS approach was investigated in [42]which allows for easily integrating contextual informa-tion such as temporal information based on a Markovdecomposition.

2.3. Relation Prediction

The SUNS framework employes matrix factoriza-tion for relation prediction. Relation prediction has at-tracted increasing interest in machine learning commu-nity in recent years. Matrix factorization approachesare clearly among the leading approaches since theycan readily exploit structure in relational patterns, al-though a number of different approaches have beenproposed for this task [32,40,25,17,30,43]. In partic-ular, the winning entry to the NETFLIX competitionused matrix factorization approaches [39,9]. For rela-tions with an arity larger than two, tensor factoriza-tion recently have become popular, enabling the mod-eling of relations such as likes(User, Restaurant,May2010) [36,42]. A recent overview has been pro-vided in [31]. Relation prediction is very related to col-laborative filtering and multi-task learning. Low-rankmatrix factorization is also one of the most success-ful methods for collaborative filtering, e.g. [37]. SUNSis a low-rank matrix factorization based approach forrelation prediction. By design, it is suitable for deal-ing with data in Semantic Web and Linked Data whichis very large, highly sparse and incomplete. Due tothe regularization, SUNS is insensitive to its hyperparameters making it easy to use for non-experts inmachine learning. It outperforms many state-of-the-art matrix factorization approaches on different datasets [41,27,26,5].

Page 4: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

4 M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis

2.4. Sentiment Analysis

In BOTTARI, Saltlux’s proprietary sentiment anal-ysis solution is applied to microposts as illustrated inFigure 2 to detect if a message talks about a particularPOI and, in case, which opinion it expresses: positive,negative or neutral. The sentiment analysis follows atwofold approach. First, a Korean Morphological Ana-lyzer tags the messages. Then, the tagged messages aretransmitted to a rule-based module to label their “senti-ments”. The rules are trained a priori by using machinelearning algorithms. If some messages are impossibleto tag for the Morphological Analyzer, they are passedto a machine learning module using Support VectorMachines (SVM) [12] which classify the messages inpositive, negative and neutral classes. These two mod-ules are applied in a committee solution to improve theprecision of the reputations.

2.5. Large-scale Data Management with LarKC

In BOTTARI, the LarKC platform was used to pro-cess the large amounts of microposts required by theapplication. The LarKC platform [14] is a pluggableSemantic Web framework for massive distributed in-complete reasoning that was intended to remove thescalability barriers in existing reasoning systems andwas developed as part of the EU FP7 Integrated ProjectLarKC [21] – the Large Knowledge Collider.

The LarKC platform achieves scalability by: a) en-riching the current logic-based Semantic Web reason-ing methods with methods from information retrieval,machine learning, databases, and probabilistic reason-ing; b) building a distributed reasoning platform, de-ployable on a high-performance computing cluster;and c) employing cognitively inspired approaches andtechniques.

We adopted the LarKC platform to integrate thetechniques required to process the micropost stream.To this end, we developed a number of LarKC “plug-ins” that encapsulate those techniques (cf. Sections2.2, 2.3, and 2.4); the plugins are orchestrated in aLarKC workflow (cf. Section 5) that provides the rec-ommendation functionalities for the BOTTARI mobileapplication (cf. Sections 3 and 6).

3. The BOTTARI Mobile App

In Korean language, “bottari” is a cloth bundlethat carries a person’s belongings while traveling. The

BOTTARI mobile app is a location-based service thatexploits the social context to provide relevant con-tents to the user in a specific geographic location; assuch, BOTTARI lets the user “transport” the location-specific knowledge, derived from the local social me-dia, when moving in the geographic space.

BOTTARI is an Android application (for smartphones and tablets) in augmented reality (AR) that di-rects the users’ attention to points of interest (POIs) inthe neighborhood of the user’s position, with partic-ular reference to restaurants and dining places. How-ever, BOTTARI does not simply show the availableplaces; indeed it provides personalized recommenda-tions based on the local context and the “sentiment”derived by microposts analysis. As explained in Sec-tion 6, it offers different types of recommendationsbased on the combination of the different techniquesof inductive and deductive stream reasoning. For eachPOI selected by the user, BOTTARI provides detailedinformation including trust or reputation. Figure 3shows some screenshots of BOTTARI. A video dis-playing BOTTARI at work, on a mobile phone and ona tablet, is available on YouTube8.

Four types of recommendations are available (seeupper left corner of Figure 3(a)):

1. the for me button offers personalized recommen-dations as suggested by large scale mobile searchstudies like [38,18];

2. the popular button bases its recommendations onthe last 3 years of ratings extracted from the mi-croposts;

3. the emerging button, instead, focuses on the rat-ings extracted in the last month; and

4. the interesting button returns the POIs describedwith a category of interest of the user, e.g., fortourists.

Moreover, given the importance of the distance ofthe user from the recommended POIs [22,10,35], BOT-TARI offers a functionality for filtering recommendedPOIs based on their distance in the right-upper side ofits interface (see Figure 3(a)).

By pointing the device to frame the surroundingenvironment, the users see in AR the recommendedPOIs, as shown in Figure 3(b). In this view, the POIsare indicated with different icons (e.g., restaurants with

or snack bars with ) and their reputation is in-

8A video of BOTTARI is available on YouTube at http://www.youtube.com/watch?v=c1FmZUz5BOo.

Page 5: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis 5

Fig. 3. Some screenshots of the BOTTARI Android application.

dicated by thumb-up and thumb-down icons,which means that the sentiment analysis results arepositive or negative respectively. The user can selecta POI to learn more about it (b) and to display all itsdetails (c); BOTTARI can also exhibit the trend overtime in people’s sentiment about this POI (d).

Due to the success of a field test of the implemen-tation of the research prototype described in this pa-per, Saltlux decided to continue the BOTTARI devel-opment. The commercialization of the BOTTARI An-droid App by Saltlux is planned for the first quarter of2012.

4. Data and Ontology Used in BOTTARI

As mentioned in the previous sections, BOTTARIbuilds on the analysis of microposts published in Koreato provide recommendations to its users. In this sec-tion, we explain the data and knowledge used in BOT-TARI and the ontology BOTTARI clients use to issuethe queries illustrated in Section 6.

Figure 5 illustrates the Insadong area, a 2 km2 dis-trict with a high density of restaurants: the left sideshows the map of Insadong and the right side showsa heat-map displaying the number of restaurants in a5-by-7 grid. Restaurants are so dense that finding arestaurant entrance (without knowing it) requires bothhigh GPS accuracy and an image of the restaurant.

Fig. 5. Restaurant distribution in the 2 km2 Insadong district.

In developing BOTTARI, the information aboutthe 319 restaurant of Insadong was collected from

Page 6: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

6 M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis

(a) (b) (c)

Fig. 4. Dataset statistics: (a) positive ratings for POIs, (b) positive ratings by users and (c) tweet distribution over time.

Yelp9, PoiFriend10, Yahoo! Local11, TrueLocal12, Ko-rean restaurant Web sites and several Korean portals.The result is a high quality geo-referenced knowl-edge base where each restaurant is described by 44attributes (e.g., name, images, position, address, am-biance, specialities, categories, etc.). Compared to theinformation that an average tourist can obtain from tra-ditional channels, BOTTARI’s POI knowledge base isvery rich.

As reference example of guide book information,let us consider the popular RoughGuides13 that havean edition dedicated to South Korea. This lists only8 restaurants in this area of Seoul; one of them iseven outside the Insadong district. The other 7 are alsopresent in BOTTARI knowledge base. Not surprisinglythe coverage of guides book is very poor when com-pared to BOTTARI.

As remarkable example of Web 2.0 tourist site, let usconsider the well known TripAdvisor14. A user search-ing for restaurant in Insadong would find only 13 re-sults. Among those listed by TripAdvisor, 12 are alsoin BOTTARI knowledge base, but only 5 were actuallymentioned in the microposts analysed by BOTTARI.A bit more surprisingly also Web 2.0 tourist sites havea low coverage when compared to BOTTARI and thereviews are much less frequent then the number of mi-croposts collected by BOTTARI.

As described in details in Section 5, the reputationand people’s sentiment about those POIs were gath-ered from Twitter. Microposts were crawled and col-lected both from the Web and by using the Twitter API.The sentiment analysis technique described in Section

9Cf. http://www.yelp.com/.10Cf. http://www.poifriend.com/.11Cf. http://local.yahoo.com/.12Cf. http://www.truelocal.com.au/.13Cf. http://www.roughguides.com/.14Cf. http://www.tripadvisor.com/.

2.4 was employed to analyse the tweets, filter them andextract the POI’s reputation.

In total, 200 million tweets were gathered in 3 years,from February 4th, 2008 to November 23rd, 2010(1,023 days). After applying the sentiment analysistechnique – which discards the tweets that do not relateto any POI – BOTTARI was able to identify 110,000tweets produced by more than 31,000 user (see Table1 for statistics of the collected data set).

The temporal distribution of tweets is shown in Fig-ure 4(c). Clearly most tweets (85%) were collected inthe last six months. Some concrete numbers of tweetsin different time frames are shown in Table 2 in theevaluation section. The exponential growing trend oftweets over time is due to different reasons: a) thetweets from 2008 were collected one year later, in2009, and it was difficult to gather those messages be-cause of the “oblivion” in Twitter stores; b) the usageof Twitter became mainstream in Korea only in 2009;finally, c) the crawling algorithm was changed and im-proved in April 2010.

Table 1 does not only show the numbers of theentities, i.e., users and POIs, but also the numbersof the positive/negative/neutral ratings. For example,

Data type Nr.

Entities User 31,369

POI 245

POI User Sparsity

Ratings Positive 19,045 213 12,863 99.30%

Negative 14,404 181 10,448 99.24%

Neutral 75,941 245 28,056 98.90%

Total 109,390 245 31,369 98.58%

Table 1Statistics of the data set

Page 7: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis 7

Fig. 6. Ontology modelling of BOTTARI data.

there are 14,404 negative ratings about 181 POIs madeby 10,448 users. Based on these statistics we analysesome characteristics of the data set as follows:

– High sparsity: Only 1.42% of POIs are rated peruser. The sparsity of the positive ratings is evenhigher with 99.3%, where the sparsity is definedas sparsity = 1− #Ratings

#POIs×#Users .– Incompleteness: While some POIs have neither

positive nor negative ratings, a great number ofusers do not provide either positive ratings or neg-ative ratings. For example, only 41% of users pos-itively rated at least one POI.

– Multiple ratings: A user can rate a particular POIfor several times expressing different opinions.BOTTARI utilized all three kinds of ratings bysetting positive ratings to 1, negative ratings to−1 and neutral ratings to 0.1 and by ignoringtheir frequency. For instance, user u rated POIp positively twice, negatively once and neutrallyfive times. Then one obtains positive(u, p) =1, negative(u, p) = −1 and neutral(u, p) =0.1. Note that neutral ratings were interpreted asslightly positive.

Besides the temporal distribution shown in Fig-ure 4(c), we made statistics of ratings over users andPOIs in order to better understand the data distribu-tion. Especially interesting is the distribution of posi-tive ratings over POIs and over users, plotted in Fig-ure 4(a) and 4(b) respectively. In average, a user gave1.5 positive ratings and a POIs was positively rated by89 users. 31 POIs got more than 200 positive ratings.

Those POIs can be considered as the most liked or themost popular ones.

POI descriptions and microposts were converted, re-spectively, into RDF and RDF streams and then pro-cessed by the system described in Section 5. TheRDF-ized data are modelled with respect to the on-tology represented in Figure 6, which is based onthe SIOC vocabulary [11]. BOTTARI ontology takesinto account the relation between Twitter users andtheir tweets and enriches tweets description by intro-ducing their topic (i.e., the POI cited by the Twitteruser15) and the “sentiment” expressed in relation tothe topic (i.e. the user opinion about the POI). To thisend, our model adds the sentiment by means of thetwd:talksAbout property – and its sub-propertiesfor positive, negative and neutral opinions – that relatestweets with POIs.

An example of RDF stream in BOTTARI is given inListing 1. Each RDF triple uses the ontology describedabove and is annotated with the tweet’s timestamp. Thenotation used here is defined in [3].

(<:userID twd:posts :tweetID >, 2011-10-12T13:34:41)(<:tweetID twd:talksAboutPositively :poiID >,

2011-10-12T13:34:41)

Listing 1: Social media RDF stream

The geographical perspective of our ontology is in-troduced by modelling the POIs as spatial things (ac-

15Only very few tweets talk about more than one POIs.

Page 8: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

8 M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis

cording to the WGS84 vocabulary16) and enrichingthem with specific categorizations (e.g., the ambiencedescribing the atmosphere of a restaurant) and the dy-namic count of positive/negative/neutral ratings. TheRDF data are stored as an instance of the SOR triplestore enriched with geo-spatial features (cf. bottom-right part of Figure 7).

5. System Architecture

BOTTARI’s back-end is built on the inductive anddeductive stream reasoner [5] that was developed aspart of the LarKC project (see Section 2). BOTTARI’ssystem architecture is illustrated in Figure 7. It consistsof two parts: a data initiated (i.e., PUSH) segment anda query initiated (i.e., PULL) segment.

Fig. 7. Architecture of BOTTARI back-end. A data initiate (i.e.,PUSH) part of the system continuously analyse microposts, whileBOTTARI client can issue SPARQL queries to the query initiated(i.e., PULL) part of the system. The plug-able Semantic Web plat-form LarKC couples the PULL part of the system with the PUSHone.

The PUSH segment continuously analyses the mi-croposts’ stream crawled from the Web. The SEMAN-TIC MEDIA CRAWLER AND OPINION MINER crawls3.4 millions tweets related to Seoul per day, identi-fies the subset that talks about the Insadong restaurants(thousands per day) and try to mine the users’ opinionsusing the approach described in Section 2.4. The re-sult is an RDF stream of positive, negative and neutralopinions about the restaurants of Insadong modelled in

16Cf. http://www.w3.org/2003/01/geo/.

the ontology illustrated in Section 4 (see also Listing1). The RDF stream flows at an average rate of almosta hundred tweets per day, peaking at tens of tweetsper minute. The RDF stream is processed in real-timeby the STREAMING LINKED DATA SERVER (SLD

SERVER) by the means of a network of C-SPARQLqueries (see Figure 8 in Section 6). In particular, SLD

SERVER runs a C-SPARQL query with a window ofone month that prepares the data used for the emerg-ing recommendations and a second C-SPARQL querythat, aggregating the results of the previous query overthe months, prepares the data used for popular rec-ommendations. BOTTARI’s inductive stream reasoner(namely SUNS) periodically reads the results of SLD

SERVER and, through the RDF2MATRIX PLUG-IN, isresponsible for the for me recommendation.

The PULL segment is responsible to answer theSPARQL query that BOTTARI clients issue to theLarKC platform when the users press one of the fourrecommendation buttons in BOTTARI’s interface (seeupper right corner of the screenshot in Figure 3(a)).LarKC acts as an ontology-based information-integrationplatform [33] where an ontology (in our case the oneillustrated in Figure 6) has the role of an integratedconceptual model that logically spans over multipleheterogeneous data sources – in our case the data mod-els of the different plug-ins involved in generating therecommendations. In this setting, a mapping existsbetween the ontology and the data model internallyused by each plug-in. BOTTARI clients issue queriesover the integrated ontology, the QUERY REWRITER

analyses the queries and rewrites them into a set ofqueries expressed in the query languages of the plug-ins downstream, according to the specified mappings.The plug-ins are executed in parallel. Each plug-in re-ceives the re-written query and sends results to theQUERY EVALUATOR. This plug-in joins the resultsof the re-written queries and returns answers to theclients, as if the query had been evaluated on the in-tegrated ontology directly. Caching of entire queriesand intermediate results is applied in order to minimizequery latency.

The plug-ins that were deployed in the LarKC plat-form for computing BOTTARI recommendations are:

– the SOR PLUG-IN that can efficiently evaluategeo-spatial SPARQL queries (e.g., queries thatcontains geo-spatial functions such as within_

Page 9: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis 9

distance), on SOR – Saltlux’s geo-spatial en-abled RDF store17;

– the SUNS PLUG-IN that can estimate probabilitiesof unknown statements <user talksPosi-tivelyAbout POI> and provide users with alist of POIs ranked by estimated probabilities (thehigher the score is, the more likely the statementexists); and

– the SLD PLUG-IN that can query the instan-taneous results of the network of C-SPARQLqueries in SLD SERVER for retrieving the topPOIs (in terms of the number of positive tweets)of the day, of the week, of the month and of theyear.

To better clarify how the LarKC platform evaluatesBOTTARI client queries, let us consider the query inListing 2. Lines 3–5 ask for points of interests (POIs)within 200 meters from the user’s position. Lines 6 to9 add the additional constraint that the user may talkpositively about these POIs in future18. Line 10 fur-ther constraints the results to POIs that have been pos-itively discussed. Line 12 asks for POIs to be ordered.The first result will be the POI with the highest numberof positive tweets, the one with the highest estimatedprobability for the user to talk positively about it andthe one closest from the user. Finally, Line 13 limitsthe first 10 results to be returned to the user.

1. SELECT ?poi ?name ?lat ?long2. WHERE {3. ?poi a ns:NamedPlace ; ns:name ?name ;4. geo:lat ?lat ; geo:long ?long .5. FILTER(:within_distance(37.5,126.9,?lat,?long,200))6. { :user sioc:creator_of ?tweet7. ?tweet twd:talksAboutPositively ?poi .8. WITH PROBABILITY ?prob9. ENSURE PROBABILITY [0.5..1) }10. ?poi twd:numberOfPositiveTweets ?numPos .11. }12. ORDER BY DESC(?numPos*?prob*(distance(37.5,126.9,

?lat,?long,200)/200))13. LIMIT 10

Listing 2: Sample BOTTARI client query in SPARQL.

The QUERY REWRITER, knowing the three avail-able plug-ins and their respective capabilities, can

17Cf. http://bit.ly/saltlux-sor.18See [13] and Section 6.3 for the specification of the syn-

tax and the semantics of WITH PROBABILITY and ENSUREPROBABILITY.

break down the query in three parts and rewrite them inspecific queries for the different plug-ins19: Lines 3–5in a query for the SOR PLUG-IN, Lines 6–9 in a queryfor the SUNS PLUG-IN and Line 10 in a query for theSLD PLUG-IN. All three plug-ins return a ranked listof POIs. The QUERY EVALUATOR joins the three listsand produces the final list of POIs ranked as requestedby the BOTTARI client.

6. Recommendation Services in BOTTARI

In this section, we provide more details about thedifferent kinds of recommendations that BOTTARI of-fers. We start by presenting the Interesting recommen-dations, because they involve only the SOR PLUG-IN.We, then, illustrate the Popular and the Emerging rec-ommendations which both rely on the SLD and theSOR PLUG-INS. Finally, we report on the For me rec-ommendations that use the SUNS and the SOR PLUG-INS.

6.1. Semantic Information Retrieval for InterestingPOIs

The Interesting recommendation is based on theSPARQL query in Listing 3. It asks for a list of POIs(Lines 3 and 4) that belong to one of the category avail-able in BOTTARI, e.g., “attractions of interest to for-eign visitors” (Line 5), that are located in a given dis-tance from the user’s current location (Line 6), and thatare in front of the user20 (Line 7). The results are or-dered by distance from the user and are limited to the10 closest ones (Lines 8 and 9).

1. SELECT ?poi ?name ?lat ?long2. WHERE {3. ?poi a ns:NamedPlace ; ns:name ?name ;4. geo:lat ?lat ; geo:long ?long ;5. ns:category :InterestingForForeigners .6. FILTER(:within_distance(37.5,126.9,?lat,?long,200))7. FILTER(:dest_point_viewing(37.5,126.9,?lat,?long,

45,50,200))8. ORDER BY distance(37.5,126.9,?lat,?long)9. LIMIT 10

Listing 3: Query for Interesting recommendations.

19We provide sample rewritten queries in Section 6.3.20Filtering the POIs in order to load only those in front of the user

is important, because BOTTARI is an augmented reality applicationand BOTTARI’s users are supposed to look at the world around themthrough their device.

Page 10: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

10 M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis

Note that a) the query highly relies on the geo-spatial features of the SOR GEO-SPATIAL KB (see Fig-ure 7); b) the query matches also POIs that are de-scribed with a sub-category of InterestingFor-Foreigners, because the required category can bededuced; and c) this query involves only the SORPLUG-IN and requires no rewriting, because BOT-TARI ontology is an extension of the ontology used inSOR GEO-SPATIAL KB.

6.2. Stream Reasoning to Get Emerging and PopularPOIs

The opinions of users on POIs change over time,sometimes for seasonal effect (e.g., Insadong’s restau-rant clients appear to prefer meat restaurants in winterrather than in summer) and sometimes because POIscan be “in fashion” only for a limited period.

The Emerging and Popular recommendations arebased on this observation. The Emerging recommen-dations consider the list of most positively discussedPOIs in the last month, thus aim at recommendingPOIs considering the seasonal effect and the “on fash-ion” effect on a short term scale. The Popular rec-ommendations consider the list of most positively dis-cussed POIs over three years, thus averaging on theseasonal effect, but still considering the “in fashion”effect on a long term scale.

Fig. 8. The network of C-SPARQL that continuously analyses theRDF stream produced by the SOCIAL MEDIA CRAWLER AND

OPINION MINER and keeps up to date the data used for the Emerg-ing and Polular recommendations

Figure 8 illustrates how the SLD SERVER contin-uously elaborates the RDF stream produced by theSOCIAL MEDIA CRAWLER AND OPINION MINERto compute the aggregated information used for theEmerging and Popular recommendations. SLD SERVERcontinuously runs a network of C-SPARQL queries

that consume each other’s results and publish their re-sults as Linked Data. From left to right in the figure,the COUNT +1 FOR POI IN [1 DAY] query countsthe positive opinions about a POI for each day. Thepositive messages per day can be further aggregatedover the week (in the SUM +1 FOR POI IN [7 DAYS]query), over the month (in the SUM +1 FOR POI IN[4 WEEKS] query), and over the year (in the SUM +1FOR POI IN [1 YEAR] query).

Listing 4 shows the leftmost query in Figure 8 thatcounts the positive opinions for each POI per day;Listing 5 refers to the query that consumes the resultstream of the previous one, summing the positive opin-ions for each POI over the last week.

REGISTER QUERY Count +1 For POI ASCONSTRUCT { ?o twd:numberOfPositiveTweets ?n . }FROM STREAM <#bottari> [RANGE 1d STEP 1d]WHERE {{ SELECT DISTINCT ?o (count(?s) as ?n)WHERE { ?s twd:talksAboutPositively ?o }GROUP BY ?o

}}

Listing 4: Query to count positive opinions per day.

REGISTER QUERY Sum +1 For POIs ASCONSTRUCT { ?s twd:numberOfPositiveTweetsInWeek ?n . }FROM STREAM <#Count+1forPOI> [RANGE 7d STEP 7d]WHERE {{ SELECT ?s (sum(?o) as ?n)WHERE { ?s twd:numberOfPositiveTweets ?o . }GROUP BY ?s

}}

Listing 5: Query to sum the positive opinions for eachPOI over the last week.

The results of each query are published as linkeddata by the WINDOWERS. The WINDOWERS are usedto answer the BOTTARI query for the Emerging rec-ommendations; the last one in Figure 8 is used to com-pute Popular recommendations. The trend lines illus-trated in Figure 3(d) are drawn using the 12 most re-cent answers of the SUM +1 FOR POI IN [4 WEEKS]published in the WINDOWER with 12 graphs connectedto its stream of results.

When BOTTARI clients require Emerging recom-mendations, they issues to the LarKC platform theSPARQL query in Listing 6.

Page 11: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis 11

1. SELECT ?poi ?name ?lat ?long2. WHERE {3. ?poi a ns:NamedPlace ; ns:name ?name ;4. geo:lat ?lat ; geo:long ?long .5. FILTER(:within_distance(37.5,126.9,?lat,?long,200))6. FILTER(:dest_point_viewing(37.5,126.9,?lat,?long,

45,50,200))7. ?poi twd:numberOfPositiveTweetsInMonth ?n .8. ORDER BY DESC(?n*distance(37.5,126.9,?lat,?long))9. LIMIT 10

Listing 6: Query for Emerging recommendations.

Lines 3 to 6 are sent by the QUERY REWRITER tothe SOR PLUG-IN. Line 7 is rewritten in a query thatruns against the latest result of the SUM +1 FOR POIIN [4 WEEKS] C-SPARQL query published as linkeddata by the SLD SERVER. The QUERY EVALUATOR

receives the list of POIs returned by the two plug-ins and aggregates the ranking in accordance with thefunction at Line 8, i.e., trendy and close-by restaurantsfirst.

6.3. Inductive Stream Reasoning to Get For me POIs

Finally, BOTTARI can provide personalized recom-mendations using the inductive stream reasoning solu-tions introduced in Section 2.2.

Figure 9 sketches the architecture of the inductivereasoner. It is composed of the workflow describedin [41], and the SUNS and RDF2MATRIX PLUG-INS.The RDF2MATRIX PLUG-IN is necessary to transformdata and knowledge from RDF statements into the datamatrix as required as input by machine learning ap-proaches, including SUNS.

At training time, the RDF2MATRIX PLUG-IN is in-voked to convert RDF statements to data matrices.Then, the SUNS PLUG-IN trains models based on thematrices. If necessary, external machine learning li-braries can be invoked. Based on learnt models, theSUNS PLUG-IN estimates the truth value of unknownstatements of interest. To facilitate the usability of theplug-ins, a default configuration is available. The ma-chine learning experts can refine the setting of theplug-ins by modifying the defualt configuration.

In BOTTARI, the RDF2MATRIX PLUG-IN readsthe RDF streams as well as data aggregated by theSLD SERVER and transforms them to data matrices,while the SUNS PLUG-IN learns the users’ preferencesand predicts the probabilities that can be queried forthe For me recommendations.

Fig. 9. Architecture of inductive stream reasoning

When BOTTARI clients request For me recommen-dations, they issue the SPARQL query in Listing 7.Lines 7 to 10 makes use of SPARQL with probabil-ity [26]. The triple pattern at lines 7 and 8 matchesPOIs that a particular user might positively talk about.The WITH PROBABILITY clause at line 9 extendsSPARQL by letting it query probabilities. The variable?prob may assume values between 0 and 1, wherethe value 1 refers to those POIs the user already pos-itively rated. The ENSURE PROBABILITY clause atline 10 accepts only those solutions whose estimatedprobabilities is larger or equal to, in this example, 0.5.

1. SELECT ?poi ?name ?lat ?long2. WHERE {3. ?poi a ns:NamedPlace ; ns:name ?name ;4. geo:lat ?lat ; geo:long ?long .5. FILTER(:within_distance(37.5,126.9,?lat,?long,200))6. FILTER(:dest_point_viewing(37.5,126.9,?lat,?long,

45,50,200))7. { :user sioc:creator_of ?tweet8. ?tweet twd:talksAboutPositively ?poi .9. WITH PROBABILITY ?prob10. ENSURE PROBABILITY [0.5..1) }11. ORDER BY

DESC(?prob*distance(37.5,126.9,?lat,?long))12. LIMIT 10

Listing 7: Query for “For me” recommendations.

The query in Listing 7 is rewritten by the QUERYREWRITER in the query for the SUNS PLUG-IN illus-trated in Listing 8, the query for the SOR PLUG-IN (cf.Section 6.1) and the query for the QUERY EVALUA-TOR illustrated in Listing 10.

Page 12: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

12 M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis

1. CONSTRUCT {2. :user twd:mayTalkPositively [3. ns:about ?poi ;4. ns:withProbability ?prob ] }5. WHERE {6. :user sioc:creator_of ?tweet7. ?tweet twd:talksAboutPositively ?poi .8. WITH PROBABILITY ?prob9. ENSURE PROBABILITY [0.5..1) }10. ORDER BY DESC(?prob)

Listing 8: The query in Listing 7 as rewritten for theSUNS PLUG-IN.

Notably, in Listing 8 the WHERE clause (Lines 6–9)corresponds to Lines 7–10 in Listing 7. The peculiarityof the query in Listing 8 is the CONSTRUCT clause. Itallows the embedding of the probability in the resultof the query processed by the SUNS PLUG-IN with-out using annotations and reification. For instance, theresults of the query in Listing 8 can be:

:user ns:mayTalkPositively [ns:about :TheVolga ;ns:withProbability "0.8" ], [ns:about :TheVegetarian ;ns:withProbability "0.6" ] .

Listing 9: Sample result of the query in Listing 8.

Note that the query rewritten for the QUERY EVAL-UATOR in Listing 10 at Lines 6–8 assumes the pres-ence of the triples generated by the SUNS PLUG-IN

with Listing 8.

1. SELECT ?poi ?name ?lat ?long2. WHERE {3. ?poi a ns:NamedPlace ; ns:name ?name ;4. geo:lat ?lat ; geo:long ?long ;5. ns:distance ?distance .6. :user twd:mayTalkPositively [7. ns:about ?poi ;8. ns:withProbability ?prob ]9. ORDER BY DESC(?prob*?distance))10. LIMIT 10

Listing 10: The query in Listing 7 as rewritten for theQUERY EVALUATOR.

Wrapping up, the SUNS PLUG-IN provides a genericway for probabilistic materialization. The predictedRDF triples are maintained in a compact internal repre-sentation and are serialized in RDF triples on demand.

In BOTTARI, the SUNS PLUG-IN is applied to answeran ad-hoc query “Which POIs could be recommendedfor a particular user?”. When a BOTTARI client issuesthe SPARQL query for For me recommendations, theSUNS PLUG-IN is utilized to estimate the probabilitythat a user might like a POI, based on the sentiment thesame user expressed about other POIs and the opinionthat other users expressed about that POI. In this sense,we provide a personalized and collective recommenda-tion engine suggesting users the most interesting POIswith respect to their preferences.

7. Evaluation

We tested BOTTARI using the data set described inSection 4. The goal was to evaluate the quality and theefficacy of recommendations.

7.1. Methodology

In the evaluation, we applied three baselines: ran-dom guess (Random), k-nearest neighbour (KNNItem)and the most liked items (MostLiked).

Let P be the number of the POIs and U be the num-ber of the users. For the baseline KNNItem, we usedthe cosine as the similarity measure of POIs definedas similarity(pi, pj) =

<pi,pj>‖pi‖∗‖pj‖ , where pi and pj

represent the vectors of ratings of the i-th and the j-th POIs given by all users for i, j ∈ {1, . . . , P}, andwhere <·, ·> is the scalar product of two vectors and ‖·‖is the 2-norm of a vector. We set k to the total numberof the POIs. The baseline MostLiked takes all positiveratings over at all times into account. It recommendsthe most positively rated POIs to every user. This base-line was realized by SLD SERVER as explained in Sec-tion 6.2.

First, we compared SUNS with the baseline meth-ods. Then, we examined the performance of the com-bination of both inductive and deductive stream rea-soning, i.e., SUNS and MOSTLIKED.

We used two evaluation methods:

– Normalized discounted cumulative gain (nDCG):is calculated by summing over all the gains inthe rank list R with a log discount factor asnDCG(R) = 1

Z

∑k

2r(k)−1log(1+k) , where r(k) denotes

the target label for the k-th ranked item in R, andr is chosen such that a perfect ranking obtainsvalue 1. To focus more on the top-ranked items,we also consider the nDCG@n which only counts

Page 13: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis 13

Nr. of ratings %

Last day 188 0.17

Last 2 days 703 0.64

Last 7 days 5,057 4.62

Last 30 days 27,049 24.73

Last 90 days 65,600 70.01

Last 180 days 93,696 85.65

Total 109,389 100.00

Table 2Number of ratings with different time frames

the top n items in the rank list for a user. In ourevaluation R is the rank list of POIs and we reportnDCG@all.

– Accuracy at the top N POIs Acc@N=|TPN |,where TPN is the number of true positives in thetop N recommendations. In our case, since wehave just one test data point per user, Acc@N iseither equals to one or zero.

For both evaluation measures, we averaged thescores across all ranking lists (a list per user).

7.2. Settings

The evaluation was exploited in two different set-tings:

– Setting 1: We randomly withheld one positive rat-ing for each user and treated it as a test data point.We trained the models using the remaining rat-ings and then estimated the values of the test datapoints. This corresponds to the common methodof splitting the data into a training and a test sub-sets. We repeated this data split five times.

– Setting 2: We withheld the newest rating foreach user as test data. An additional baselineTimeWindow was introduced which correspondsto MOSTLIKED with different time frames: 1 day,2 days, 7 days, 30 days, 90 days and 180 days21.To this end, we removed the older ratings. Table 2shows the number of ratings that remained for thetraining.

21Also this baseline is evaluated using the network of C-SPARQLqueires in continuous execution in SLD SERVER illustrated in Sec-tion 6.2.

We implemented the baselines, SUNS and the eval-uation methods in Matlab. We ran the evaluation on alaptop with Windows XP, CPU 2.1 GHz and 3.24 GBRAM, which corresponds to a 80 e/month share in acloud environment22.

7.3. Results

Figure 10 shows the results that we obtained inSetting 1. On the left the nDCG scores of the testedmethods are plotted against the number of the la-tent variables. Since the baselines are independent ofthis number, they produced three horizontal lines eachfor Random, MOSTLIKED and KNNItem. We evalu-ated SUNS with 20, 50, 100, 150 and 200 latent vari-ables. As expected, Random was the worst. MOST-LIKED was slightly better than KNNItem. This mightbe due to the “bandwagon effect” that exists in manysocial communities. SUNS significantly outperformedall baselines after the number of the latent variablesreached more than 100. The best ranking ever was pro-duced by the combination of both SUNS and MOST-LIKED. These results confirm the idea presented in [5]that a combined approach of deductive and inductivestream reasoning works best. Figure 10 on the rightshows the accuracy of the top N POIs, where N ={5, 10, 15, 20, 25, 30}. The quality of the recommen-dations made by SUNS was much higher than that ofthe baselines. The combination of SUNS and MOST-LIKED generated the overall best recommendations.

In Setting 2, as mentioned beforehand, we varied thesize of the time window from 1 day to 180 days. Fig-ure 11 plots the nDCG scores on the left and the ac-curacy at top 10 on the right. TIMEWINDOW, the onlycurve in the figure, catched MOSTLIKED referred tonDCG@all, when using the last 30 days ratings, andwas very closed to MOSTLIKED referred to Acc@10,while the time window being 90 days. This fact tellsus that, in this setting, keeping a window of 90 days isnearly as effective as keeping the full history.

We also evaluated the recommendations given bythe experts Rough Guide and Trip Advisor describedin Section 4. Table 3 clearly indicates that both expertsand their combination performed as poorly as the ran-dom guess and were not comparable with SUNS at all.

In the evaluation, SUNS displayed excellent scalingcapability. The training took approximately 86 secondswith 200 latent variables and the recommendation of

22The calculation of the cost per month was done using https:

//www.gandi.net/hosting/vps.

Page 14: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

14 M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis

Fig. 10. Evaluation in Setting 1: nDCG scores (left) and accuracy values at top N (right).

Fig. 11. Evaluation in Setting 2: nDCG scores (left) and accuracy values at top 10 (right).

Setting 1 Trip Rough Comb Random SUNS

NDCG@5 0.0087 0.0149 0.0197 0.0134 0.2724

Acc@5 (%) 0.0099 0.0181 0.0253 0.0255 0.3191

Setting 2

NDCG@5 0.0072 0.0152 0.0190 0.0136 0.2681

Acc@5 (%) 0.0077 0.0181 0.0255 0.0255 0.3153

Table 3Comparisons of Trip Advisor (Trip), Rough Guide (Rough) and theircombination (Comb) with random guess and SUNS according tonDCG score and accuracy.

POIs for a user costs on average less than 5 millisec-onds. Internal studies have confirmed that, by exploit-ing sparsity, SUNS computational requirements scaleslinearly with the number of known ratings. In addition,

we observed that SUNS was very robust and insensi-tive to the number of latent variables. This can be ex-plained by the fact that SUNS, in contrast to other ma-trix factorization methods such as Singular Value De-composition (SVD), is regularized. This property cansimplify the use of SUNS, in particular for people with-out machine learning expertise.

8. Scalability

The SOCIAL MEDIA CRAWLER (cf. Figure 7)probes hundreds of thousands of tweets per day, butthe RDF stream produced by the OPINION MINER

contains an average of 150 RDF triples per day (cor-responding to 75 tweets). The large majority of thecrawled tweets cannot be related to Insadong’s restau-

Page 15: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis 15

Fig. 12. Settings for SLD SERVER scalability tests

rants. The flow rate of this RDF stream does not stressthe PUSH segment of BOTTARI’s back-end.

In this section, we report on the stress tests we con-ducted on SLD SERVER to assess the scalability of thesystem in future scenarios, i.e., when BOTTARI willbe deployed at larger scales (e.g., the entire Korea).

8.1. Methodology

An RDF stream, as explained in Section 2.2, is asequence of triples annotated with a non-decreasingtimestamp. The C-SPARQL query network illustratedin Figure 8 uses only logical tumbling windows23, i.e.,each query is evaluated when the difference betweenthe timestamp of the most recent triple and that of theoldest triple is equal to or greater than the window size.This means that SLD SERVER can be stressed with avariable speed test without risking to alter the seman-tics of the C-SPARQL query network. In other words,the RDF stream can be replayed at a faster rated, stress-testing SLD SERVER, but without altering the resultsof the queries. As in publish-subscribe systems [20],the most appropriate quantity to be measured is the in-put throughput, i.e., the ability to consume triples asinputs. That is computed as follows:

input throughput =size input

time to process the input

We measure input throughput by sending to SLD

SERVER a recorded portion of the RDF stream and bymeasuring the time required to process it, i.e., by com-

23We refer readers interested in understanding more of C-SPARQL windowing to Section 2.2 of [8].

puting all the answers to all the queries in the networkfor the portion of RDF stream.

To improve confidence, we repeated each experi-ment for 30 minutes and we measured average, mini-mum and maximum time required to process the por-tion of the RDF stream.

8.2. Settings

The stress test was conducted in nine different set-tings using a growing number of tweets in the cor-responding segment of the RDF stream presented toBOTTARI. The first settings is a portion of the RDFstream recorded between May, 1st and May, 3rd, 2010.The ninth contains 45 days of tweets recorded be-tween May, 1st and June, 15th, 2010. Figure 12 showsthe characteristics of those nine settings. The num-ber of rated POIs grows logarithmically in the num-ber of tweets in the portion of RDF Stream. The samelogarithmic dependency exists between the number oftweets and average number of ratings per POI. Themaximum number of ratings per POI, instead, has anS shape. For small windows (e.g., the first setting con-taining only three days of tweets) the maximum num-ber of rating per POI is a small fraction of the num-ber of rated POIs, while for large windows (e.g., thesixth setting containing one month of tweets) it tendsto become proportional to the number of rated POIs.This happens because the number of positive ratingsper POI is distributed according to a power law (seeFigure 4(a)).

The experiments were conducted on a laptop withCPU 2.2 GHz and 4 GB RAM, which corresponds toa 80 e/month share in a cloud environment24.

24The calculation of the cost per month was done using https://www.gandi.net/hosting/vps.

Page 16: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

16 M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis

8.3. Results

The results of these stress tests are illustrated in Fig-ure 13. In the first six settings, SLD SERVER keeps thepace of the growing rate of tweets per second in input.The average latency remains stable at around 2 sec-onds and consequently the input-throughput linearlyincreases with the number of tweets in the segment ofrecorded RDF stream used in the settings. In the firstsetting, the segment of recorded RDF stream contains85 tweets recorded in 3 days and the input through-put is 55 tweets per second, whereas in the sixth set-ting the segment of RDF streams contains 1.501 tweetsrecorded in 1 month (i.e., the data required by theEmerging recommendations) and the input throughputis 700 tweets per second.

Fig. 13. The results of the scalability tests performed on SLD

SERVER

From the seventh setting on, SLD SERVER is nolonger able to keep the pace. The average inputthroughput remains stable on 700 tweets per second,while the minimum throughput drops to 100 tweetsper second (i.e., the maximum latency of the systemgrows to 25 seconds). This is normal in DSMS: thesystem keeps the pace of the data in input up to the mo-ment where all computational resources are saturated.If pick rates above this saturation point are expected,queues are used to temporarily store the incoming data.

8.4. Interpretation of Results

Here, we attempt to interpret the results of thesescalability experiments. The most straight forward in-terpretation, i.e., that the SLD SERVER can supportup to 700 tweets per seconds for Insadong, is erro-neous. The correct interpretation, instead, is that SLD

SERVER could support the computation needed for the

Emerging recommendation for tens of thousands25 ofareas like Insadong. This is under the assumptions thatthese areas contain hundreds of POIs, that are posi-tively discussed in hundreds of tweets per day, that inaverage each POI is rated ten times per month, that atleast half of the POIs are rated once per month, andthat the maximum number of positive rates per monthper POI is proportional to the number of POIs. The re-sults show the possibility to deploy BOTTARI acrossentire Korea where thousands of areas like Insadongcan be located.

9. Conclusions and Future Works

In the last few years, we have been witnessing theincreasing popularity and success of Location-basedServices (LBS), especially of those with a Social Net-working flavor. Typical services bring a wide range ofuseful information about tourist attractions, local busi-nesses and points of interests (POIs) into the phys-ical world. Although these services are enormouslypopular, users still suffer from a number of short-comings. The overwhelming information flow comingfrom those channels often confuses users; it is alsovery difficult to distinguish between a fair personalopinion and a malicious or opportunistic advice.

In this paper, we presented our work to design anddevelop BOTTARI, a location-based Android applica-tion for mobile users, that recommends POIs in the In-sadong district of Seoul by making use of the rich andcollective knowledge obtained by mining the urban re-ality as sensed via micropost streams. BOTTARI wonthe first prize at the 9th Semantic Web Challenge26.

From the user’s point of view, BOTTARI is an aug-mented reality application whose interface is designedto be intuitive and easy to use, hiding the complex-ity of the technology behind it. BOTTARI is an effec-tive business analysis and market scanning tool; it alsopromises to be more effective than guide books andWeb 2.0 travel review sites that often suffer from verylow recall.

BOTTARI combines location-specific static infor-mation about POIs in Insadong (POIs description) withdynamic data coming from public micropost streams.Our dataset thus includes heterogeneous informationfrom real sources at real scale; we employed Semantic

25The exact results of the proportion are 87,500 areas like In-sadong.

26Cf. http://challenge.semanticweb.org/.

Page 17: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis 17

Web technologies to allow for a seamless integration ofheterogeneous components within an ontology-basedinformation access framework: both POI descriptionsand microposts are represented and stored in RDF anddescribed by OWL ontologies; in the data processing,SPARQL and its extensions are heavily used.

BOTTARI’s effectiveness and efficiency is the re-sult of an interdisciplinary approach. The data process-ing to provide POIs recommendations is based on anovel mix of technologies: beside a more traditionallocation-based information retrieval, we are applyingsentiment analysis and stream reasoning (both deduc-tive and inductive) to the micropost stream to obtain afull reality mining result.

The presented evaluation demonstrates that our re-sults are very good in terms of efficiency and quality ofthe produced recommendations and displays a signifi-cant better performance with respect to the baselines.Moreover, our techniques demonstrate the capabilityto efficiently deal with the data scale of social media;indeed, BOTTARI’s back-end is able to process a sig-nificantly large flow of microposts.

The efficacy of recommendations depends on BOT-TARI’s ability to keep into consideration the user con-text, in terms of time, location and feeling. It is worthnoting, however, that the history of user context canimprove efficacy of recommendation, but penalize ef-ficiency: the window-based approach offered by ourstream reasoning allows for finding the appropriatetrade-off between efficacy and efficiency.

Even if BOTTARI is currently just a research pro-totype, its potential as a commercial product is clearand encouraging and Saltlux decided to continue theBOTTARI development for its Korean customers.

Our future work will be devoted to identifying andrecommending POI “mavens” as described in [13]; wealso intend to study different strategies to aggregatepositive and negative ratings, beyond the straightfor-ward approach we are currently adopting. Finally, wewould like to better leverage the promises of a LinkedData-based approach: while in its current implementa-tion, BOTTARI adopts Linked Data only as a techni-cal mean for seamless interoperability (the open linkeddatasets available turned out to be of too low qual-ity for effective usage), we look forward to integrat-ing BOTTARI with LOD sources as soon as publisheddata sets display sufficient quality.

Acknowledgments

This work was partially supported by the EU projectLarKC (FP7-215535).

References

[1] Amin, A., Townsend, S., van Ossenbruggen, J., & Hardman,L. (2009). Fancy a drink in canary wharf?: A user study onlocation-based mobile search. In T. Gross, J. Gulliksen, P. Kotzé,L. Oestreicher, P. A. Palanque, R. O. Prates, and M. Winckler, ed-itors, INTERACT (1), volume 5726 of Lecture Notes in ComputerScience, pages 736–749. Springer.

[2] Anicic, D., Fodor, P., Rudolph, S., & Stojanovic, N. (2011). Ep-sparql: a unified language for event processing and stream reason-ing. In S. Srinivasan, K. Ramamritham, A. Kumar, M. P. Ravin-dra, E. Bertino, and R. Kumar, editors, WWW, pages 635–644.ACM.

[3] Barbieri, D. F. & Della Valle, E. (2010). A proposal for pub-lishing data streams as linked data - a position paper. In C. Bizer,T. Heath, T. Berners-Lee, and M. Hausenblas, editors, LDOW,volume 628 of CEUR Workshop Proceedings. CEUR-WS.org.

[4] Barbieri, D. F., Braga, D., Ceri, S., Della Valle, E., & Gross-niklaus, M. (2010a). C-sparql: a continuous query language forrdf data streams. Int. J. Semantic Computing, 4(1), 3–25.

[5] Barbieri, D. F., Braga, D., Ceri, S., Della Valle, E., Huang, Y.,Tresp, V., Rettinger, A., & Wermser, H. (2010b). Deductive andInductive Stream Reasoning for Semantic Social Media Analyt-ics. IEEE Intelligent Systems, 25(6), 32–41.

[6] Barbieri, D. F., Braga, D., Ceri, S., & Grossniklaus, M. (2010c).An execution environment for c-sparql queries. In I. Manolescu,S. Spaccapietra, J. Teubner, M. Kitsuregawa, A. Léger, F. Nau-mann, A. Ailamaki, and F. Özcan, editors, EDBT, volume 426of ACM International Conference Proceeding Series, pages 441–452. ACM.

[7] Barbieri, D. F., Braga, D., Ceri, S., Della Valle, E., & Gross-niklaus, M. (2010d). Incremental Reasoning on Streams and RichBackground Knowledge. In Proc. of ESWC2010.

[8] Barbieri, D. F., Braga, D., Ceri, S., Della Valle, E., & Gross-niklaus, M. (2010e). Querying rdf streams with c-sparql. SIG-MOD Record, 39(1), 20–26.

[9] Bell, R. M., Koren, Y., & Volinsky, C. (2010). All together now:A perspective on the netflix prize. Chance.

[10] Berberich, K., König, A. C., Lymberopoulos, D., & Zhao, P.(2011). Improving local search ranking through external logs. InW.-Y. Ma, J.-Y. Nie, R. A. Baeza-Yates, T.-S. Chua, and W. B.Croft, editors, SIGIR, pages 785–794. ACM.

[11] Berrueta, D. & et al. (2007). SIOC Core Ontology Specifica-tion. W3C Member Submission, W3C.

[12] Burges, C., Scholkopf, B., & Smola, A. (1999). Advances inKernel Methods: Support Vector Learning. MIT press.

[13] Celino, I., Dell’Aglio, D., Della Valle, E., Huang, Y., Lee,T., Park, S., & Tresp, V. (2011). Towards BOTTARI: UsingStream Reasoning to Make Sense of Location-based Micro-Posts.In R. G.-C. et al, editor, ESWC 2011 Workshops, pages 80–87.LNCS 7117, Springer-Verlag Berlin Heidelberg.

[14] Cheptsov, A. & et al. (2011). Large Knowledge Collider. AService-oriented Platform for Large-scale Semantic Reasoning.In Proceedings of WIMS 2011.

Page 18: IOS Press Reality Mining on Micropost Streams · BOTTARI is an augmented reality application for personalized and localized restaurant recommendation in Seoul. At a first look, it

18 M. Balduini et al. / BOTTARI: Reality Mining through Continuous Semantic Social Media Analysis

[15] Della Valle, E., Ceri, S., van Harmelen, F., & Fensel, D. (2009).It’s a Streaming World! Reasoning upon Rapidly Changing Infor-mation. IEEE Intelligent Systems, 24(6), 83–89.

[16] Do, T., Loke, S., & Liu, F. (2011). Answer set programmingfor stream reasoning. In C. Butz and P. Lingras, editors, Ad-vances in Artificial Intelligence, volume 6657 of Lecture Notes inComputer Science, pages 104–109. Springer Berlin / Heidelberg.10.1007/978-3-642-21043-3_13.

[17] Domingos, P. & Richardson, M. (2007). Markov logic: A uni-fying framework for statistical relational learning. In L. Getoorand B. Taskar, editors, Introduction to Statistical RelationalLearning. MIT Press.

[18] Dou, Z., Song, R., & Wen, J.-R. (2007). A large-scale eval-uation and analysis of personalized search strategies. In C. L.Williamson, M. E. Zurko, P. F. Patel-Schneider, and P. J. Shenoy,editors, WWW, pages 581–590. ACM.

[19] Eagle, N. & Pentland, A. (2006). Reality mining: sensing com-plex social systems. Personal and Ubiquitous Computing, 10(4),255–268.

[20] Fabret, F., Jacobsen, H.-A., Llirbat, F., Pereira, J., Ross, K. A.,& Shasha, D. (2001). Filtering algorithms and implementationfor very fast publish/subscribe. In SIGMOD Conference, pages115–126.

[21] Fensel, D., van Harmelen, F., Andersson, B., Brennan, P., Cun-ningham, H., Della Valle, E., Fischer, F., Huang, Z., Kiryakov,A., il Lee, T. K., School, L., Tresp, V., Wesner, S., Witbrock, M.,& Zhong, N. (2008). Towards LarKC: a Platform for Web-scaleReasoning. In Proc. of ICSC 2008.

[22] Froehlich, J., Chen, M. Y., Smith, I. E., & Potter, F. (2006).Voting with your feet: An investigative study of the relationshipbetween place visit behavior and preference. In P. Dourish andA. Friday, editors, Ubicomp, volume 4206 of Lecture Notes inComputer Science, pages 333–350. Springer.

[23] Gaerdenfors, P., editor (2003). Belief Revision. CambridgeUniversity Press.

[24] Garofalakis, M., Gehrke, J., & Rastogi, R. (2007). Data StreamManagement: Processing High-Speed Data Streams. Springer-Verlag New York, Inc.

[25] Getoor, L., Friedman, N., Koller, D., Pferrer, A., & Taskar,B. (2007). Probabilistic relational models. In L. Getoor andB. Taskar, editors, Introduction to Statistical Relational Learning.MIT Press.

[26] Huang, Y., Tresp, V., Bundschus, M., Rettinger, A., & Kriegel,H.-P. (2010a). Multivariate prediction for learning on the seman-tic web. In P. Frasconi and F. A. Lisi, editors, ILP, volume 6489of Lecture Notes in Computer Science, pages 92–104. Springer.

[27] Huang, Y., Nickel, M., Tresp, V., & Kriegel, H.-P. (2010b). Ascalable kernel approach to learning in semantic graphs with ap-plications to linked data. In Proc. of the 1st Workshop on Miningthe Future Internet.

[28] Kamvar, M. & Baluja, S. (2006). A large scale study of wire-less search behavior: Google mobile search. In Proceedings ofthe SIGCHI conference on Human Factors in computing systems,

CHI ’06, pages 701–709, New York, NY, USA. ACM.[29] Kamvar, M. & Baluja, S. (2007). Deciphering trends in mobile

search. IEEE Computer, 40(8), 58–62.[30] Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T., &

Ueda, N. (2006). Learning systems of concepts with an infiniterelational model. In Poceedings of the National Conference onArtificial Intelligence (AAAI).

[31] Kolda, T. G. & Bader, B. W. (2009). Tensor decompositionsand applications. SIAM Review.

[32] Koller, D. & Pfeffer, A. (1998). Probabilistic frame-based sys-tems. In Proceedings of the National Conference on ArtificialIntelligence (AAAI).

[33] Lenzerini, M. (2002). Data integration: A theoretical perspec-tive. In L. Popa, editor, PODS, pages 233–246. ACM.

[34] Luckham, D. (2008). The power of events: An introductionto complex event processing in distributed enterprise systems.In N. Bassiliades, G. Governatori, and A. Paschke, editors, RuleRepresentation, Interchange and Reasoning on the Web, volume5321 of Lecture Notes in Computer Science, pages 3–3. SpringerBerlin / Heidelberg. 10.1007/978-3-540-88808-6_2.

[35] Lymberopoulos, D., Zhao, P., König, A. C., Berberich, K., &Liu, J. (2011). Location-aware click prediction in mobile localsearch. In C. Macdonald, I. Ounis, and I. Ruthven, editors, CIKM,pages 413–422. ACM.

[36] Rendle, S., Freudenthaler, C., & Schmidt-Thieme, L. (2010).Factorizing personalized markov chains for next-basket recom-mendation. In World Wide Web Conference.

[37] Salakhutdinov, R. & Mnih, A. (2008). Bayesian probabilisticmatrix factorization using markov chain monte carlo. In ICML.

[38] Shen, X., Tan, B., & Zhai, C. (2005). Context-sensitive infor-mation retrieval using implicit feedback. In R. A. Baeza-Yates,N. Ziviani, G. Marchionini, A. Moffat, and J. Tait, editors, SIGIR,pages 43–50. ACM.

[39] Takacs, G., Pilaszy, I., Nemeth, B., & Tikk, D. (2007). On thegravity recommendation system. In Proceedings of KDD Cupand Workshop 2007.

[40] Taskar, B., Abbeel, P., & Koller, D. (2002). Discriminativeprobabilistic models for relational data. In Uncertainty in Artifi-cial Intelligence (UAI).

[41] Tresp, V., Huang, Y., Bundschus, M., & Rettinger, A. (2009).Materializing and querying learned knowledge. In Proc. of IRM-LeS 2009.

[42] Tresp, V., Huang, Y., Jiang, X., & Rettinger, A. (2011). Graph-ical models for relations - modeling relational context. In Inter-national Conference on Knowledge Discovery and InformationRetrieval.

[43] Xu, Z., Tresp, V., Yu, K., & Kriegel, H.-P. (2006). Infinitehidden relational models. In Uncertainty in Artificial Intelligence(UAI).

[44] Yi, J., Maghoul, F., & Pedersen, J. O. (2008). Decipheringmobile search patterns: a study of yahoo! mobile search queries.In J. Huai, R. Chen, H.-W. Hon, Y. Liu, W.-Y. Ma, A. Tomkins,and X. Zhang, editors, WWW, pages 257–266. ACM.