applications of machine learning to location-based social networks

93
Applications of Machine Learning to Location-based Social Networks Joan Capdevila Pujol e-mail: [email protected] website: http://people.ac.upc.edu/jc Advisors: Jordi Torres Viñals, Jesús Cerquides Bueno

Upload: joan-capdevila-pujol

Post on 21-Jan-2017

558 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Applications of Machine Learning to Location-based Social Networks

Joan Capdevila Pujol e-mail: [email protected]

website: http://people.ac.upc.edu/jc

Advisors: Jordi Torres Viñals, Jesús Cerquides Bueno

2

Table of contents

Motivation

Location-based Social Networks (LBSNs)

App 1: GeoSRS: A social recommender system

App 2: Tweet-SCAN: An event discovery technique

Conclusions and future trends

MOTIVATION

3

HOW

FACEBOOK STARTED

?

5

AS A “HOT OR NOT”

GAME

6

7

A ML geek might have thought:

“With all this tagged data, I am going to build a classifier to decide whether the person in the pic is hot or not.”

8

Mark Zuckerberg probably thought: “I’d rather prefer to keep playing to scale up the network and then…”

9

1.500 users

10

1.500.000.000 users

11

User engagement through several social networking services:   Linking to friends, colleagues, etc.   Setting school/college   Tagging friends to pictures   Liking publications   Geolocating content   Reviewing business   Expressing how one feels   …

12

User engagement through several social networking services:   Linking to friends, colleagues, etc. à Social graphs   Setting school/college à User profiles   Tagging friends to pictures à Tagged images   Liking publications à Rating information   Geolocating content à Geolocated content   Reviewing business à Textual comments   Expressing how one feels à People feelings   …

13

Community detection

Content-based Recommender

Sentiment Analysis

Image recognition Topic

Modeling

LOCATION-BASED SOCIAL NETWORKS (LBSNS)

14

Social Networks (SNs) V

IRTU

AL W

OR

LD

2004 - 2010

Location-based Social Networks (LBSNs) V

IRTU

AL W

OR

LD

Mobile communication + Positioning technologies P

HY

SIC

AL W

OR

LD

2010 - …

17

Locations

  Location-acquisition technologies –  Outdoor: GPS, GSM, etc. –  Indoor: Wi-Fi, RFID, etc.

  Representation of locations

–  Absolute (e.g. latitude-longitude coordinates) –  Symbolic (e.g. at Pl. Catalunya, at Aeroport Girona-Costa Brava )

  Forms of locations

–  Point locations (e.g. Foursquare venues) –  Regions (e.g. Twitter places) –  Trajectories (e.g. Strava)

18

Research lines

  Understanding users –  User similarity/link prediction –  Experts/influencers detection –  Community discovery

  Understanding locations –  Generic recommendation

•  Most interesting locations and travel routes •  Itinerary planning •  Location-activity recommenders

–  Personalized recommendation: GeoSRS [Capdevila et al. 2015]   Understanding events

–  Anomaly detection: Tweet-SCAN [Capdevila et al. 2015] –  Crowd behavioral patterns

Zheng, Y. 2011

19

Research lines

  Understanding users –  User similarity/link prediction –  Experts/influencers detection –  Community discovery

  Understanding Locations –  Generic recommendation

•  Most interesting locations and travel routes •  Itinerary planning •  Location-activity recommenders

–  Personalized recommendation: GeoSRS [Capdevila et al. 2015]   Understanding events

–  Anomaly detection: Tweet-SCAN [Capdevila et al. 2015] –  Crowd behavioral patterns

Zheng, Y. 2011

20

Research lines

  Understanding users –  User similarity/link prediction –  Experts/influencers detection –  Community discovery

  Understanding locations –  Generic recommendation

•  Most interesting locations and travel routes •  Itinerary planning •  Location-activity recommenders

–  Personalized recommendation: GeoSRS [Capdevila et al. 2015]   Understanding events

–  Anomaly detection: Tweet-SCAN [Capdevila et al. 2015] –  Crowd behavioral patterns

Zheng, Y. 2011

GEOSRS: A SOCIAL RECOMMENDER SYSTEM

21

Joan Capdevila, Marta Arias, and Argimiro Arratia. "GeoSRS: A hybrid social recommender system for geolocated data." Information Systems (2015).

Motivation

USERS

Motivation

USERS VENUES

Motivation

USERS VENUES TIPS

Motivation

USERS VENUES TIPS

Motivation

USERS VENUES TIPS

27

System block diagram

DATA EXTRACTION

DATA PREPROCESSING

TEXT MODELING RECOMMENDATION MIXER

28

System block diagram

DATA EXTRACTION

DATA PREPROCESSING

TEXT MODELING RECOMMENDATION MIXER

29

Data Extraction

30

Data Extraction

RESTful API

31

Data Extraction

RESTful API

AP

Ps

32

Foursquare API

  HTTP METHODS –  GET, POST, PUT ,DELETE

  RESOURCES –  Venue, tip, user e.g. GET https://api.foursquare.com/v2/venues/40a55d80f964a52020f31ee3

  ASPECTS –  Tips of a venue, friends of a user e.g. GET https://api.foursquare.com/v2/venues/40a55d80f964a52020f31ee3/tips

  ACTIONS –  Approve a friendship, like a venue e.g. POST https://api.foursquare.com/v2/venues/40a55d80f964a52020f31ee3/like

33

Foursquare API

  App registration https://foursquare.com/developers/apps Obtain the Foursquare API credentials (Client ID and Client Secret)

  Access token Allows apps to make requests to Foursquare on behalf of a user

Userless request Specify consumer key’s Client ID and Client Secret https://api.foursquare.com/v2/venues/search?ll=40.7,-74&client_id=XX&client_secret=ZZ&v=20151125

  Authenticated request Specify access token https://api.foursquare.com/v2/users/self/checkins?oauth_token=AA

34

Foursquare API

Technical Limitations

Userless requests to venues/ resource = 5.000 request/hour

Userless requests to other resources = 500 request/hour   Authenticated requests = 500 request/hour*token

35

Foursquare API

Legal Limitations

36

Data Extraction

  Goal: extract all tips from venues in Manhattan (New York)

  Medium: –  aspect: venues/VENUE_ID/tips –  resource: venues/search(sw, ne)

  Limitations:

–  5000 request/hour –  at most 50 venues per request

SW

NE

37

Quadtree algorithm

38

Quadtree algorithm

39

Quadtree algorithm

40

Quadtree algorithm

  In each Quadcell at the tree leaves, there are at most 50 venues.

  Through venues/VENUE_ID/tips, we now retrieve the tips for this venue

  Each tip is linked to a VENUE_ID and USER_ID

  We now have a database of triplets (USER, TIPS, VENUE) to perform recommendation

41

Recommendation

Positive Negative

Neutral

Content Sentiment

42

Collaborative Recommendation

  Collaborative recommendation based on tips’ sentiment

Positive

43

Content-based Recomendation

  Content-based recommendation based on tips’ content

recommend

44

Content-based Recomendation

  Content-based recommendation based on tips’ content

Not recommend

45

Hybridization

  Simple weighted hybridization technique

Collaborative Branch

fCOL

Content-based Branch fCONT

Hybrid f: fCOL +α fCONT

46

Evaluation

0 1 2 3 …

Nposk-1

47

Evaluation

0 1 2 3 …

Nposk-1

48

Evaluation

0 1 2 3 …

Nposk-1

49

Results

  Evaluation in terms of cumulated density functions (cdf) of the recommendation error

TWEET-SCAN: AN EVENT DISCOVERY TECHNIQUE

50

Joan Capdevila Jesús Cerquides Jordi Nin Jordi Torres. “Tweet-SCAN: An event discovery technique for geo-located tweets”. Proceedings of the 18th International Conference of the Catalan Association for Artificial Intelligence, 2015

Motivation

52

Motivation

CAN WE

UNCOVER PHYSICAL WORLD EVENTS FROM

TWEETS

?

53

Examination of data

  We looked at several tweet dimension separately

… from a dataset of tweets collected during “la Mercè” 2014

Spatial Temporal Textual

54

Examination of data

  We looked at several tweet dimension separately

… from a dataset of tweets collected during “la Mercè” 2014

Spatial Temporal Textual

55

Examination of data

Spatial and temporal

56

Examination of data

Spatial and temporal

57

Tweet-SCAN

WHAT ABOUT

USING ALL 3 DIMENSIONS

AT ONCE

?

58

Tweet-SCAN

  Tweet-SCAN is a technique to discover events from geolocated Tweets.   It allows to discover dense groups of Tweets which are close in space, time and textual meaning.   These dense groups of Tweets are linked to physical world events   Textual meaning is represented through probabilistic topic models   Tweet-scan can be seen as an extension of the popular DBSCAN algorithm or a particular case of GDBSCAN

59

Probabilistic topic modeling

Fig. - Xuriguera et al. 2013 LDA - Blei et al. 2003

HDP - Teh et al. 2006

  Latent Dirichlet Allocation (LDA)

  Hierarchical Dirichlet Process (HDP) –  Non-parametric version of LDA

60

Probabilistic topic modeling

VAN VAN MARKET - 🚐🚎🍤🍱🍜 Mercat gastronòmada #lepetitbangkok @ Parc de la Ciutadella http://t.co/5CvnUFoIDa

Topic Proportions: [(1, 0.30002802458675537), (11,0.58330530874655417)]

61

DBSCAN

  Density-based Spatial Clustering for Applications with Noise

Noise points

params: Minpts=4, ε =

Ester et al. 1996

62

DBSCAN

  Density-based Spatial Clustering for Applications with Noise

Core points

params: Minpts=4, ε =

ε

Ester et al. 1996

63

DBSCAN

  Density-based Spatial Clustering for Applications with Noise

Core points

params: Minpts=4, ε =

ε

Ester et al. 1996

64

DBSCAN

  Density-based Spatial Clustering for Applications with Noise

Border points

params: Minpts=4, ε =

ε

Ester et al. 1996

65

DBSCAN

  Density-based Spatial Clustering for Applications with Noise

Border points

params: Minpts=4, ε =

ε

Ester et al. 1996

66

DBSCAN

  Density-based Spatial Clustering for Applications with Noise

ε

Noise point Border point Core point

params: Minpts=4, ε =

Ester et al. 1996

67

Tweet-SCAN

  Neighborhood identification –  ε1: spatial (m) – Euclidean distance –  ε2: time (sec) – Euclidean distance –  ε3: text – Jensen-Shannon distance (proper metric for prob. dist.)

  Cardinality of the neighborhood

–  MinPts – minimum number of neighbors (Tweets) –  µ – minimum percentage of unique users in the neighborhood.

68

Experimentation

  We used 45.623 tweets to unsupervisedly discover event-related tweets by means of Tweet-SCAN.

  We seek to understand the parameters role by comparing the resulting

clusters against the 1.163 tagged event-related Tweets.

69

Evaluation

Extrinsic clustering metrics

Amigo et al. 2008

Purity =Ci

nmax Pr ecision Ci, Lj( )( )

i∑ → Pr ecision Ci, Lj( ) =

Ci∩L j

Ci

3

Precision(C,L)=9/10 Precision(C,L)=1/10 Precision(C,L)=0/10

Precision(C,L)=0/9 Precision(C,L)=8/9 Precision(C,L)=1/9

Precision(C,L)=0/9 Precision(C,L)=0/9 Precision(C,L)=9/9

C C C L L L

70

Evaluation

Extrinsic clustering metrics

Amigo et al. 2008

Purity =Ci

nmax Pr ecision Ci, Lj( )( )

i∑ → Pr ecision Ci, Lj( ) =

Ci∩L j

Ci

Purity = 0.92 3

C C C L L L

Precision(C,L)=9/10 Precision(C,L)=1/10 Precision(C,L)=0/10

Precision(C,L)=0/9 Precision(C,L)=8/9 Precision(C,L)=1/9

Precision(C,L)=0/9 Precision(C,L)=0/9 Precision(C,L)=9/9

71

Evaluation

Extrinsic clustering metrics

Amigo et al. 2008

Purity =Ci

nmax Pr ecision Ci, Lj( )( )

i∑ → Pr ecision Ci, Lj( ) =

Ci∩L j

Ci

3 Purity = 1

72

Evaluation

Extrinsic clustering metrics

Amigo et al. 2008

Recall(C,L)=9/9 Recall(C,L)=0/9 Recall(C,L)=0/9

Recall(C,L)=1/9 Recall(C,L)=8/9 Recall(C,L)=0/9

Recall(C,L)=0/10 Recall(C,L)=1/10 Recall(C,L)=9/10

InvPurity =Linmax Recall Cj,Li( )( )

i∑ →Recall Cj,Li( ) =

Li∩C j

Li

3

C C C L L L

73

Evaluation

Extrinsic clustering metrics

Amigo et al. 2008

InvPurity =Linmax Recall Cj,Li( )( )

i∑ →Recall Cj,Li( ) =

Li∩C j

Li

Recall(C,L)=9/9 Recall(C,L)=0/9 Recall(C,L)=0/9

Recall(C,L)=1/9 Recall(C,L)=8/9 Recall(C,L)=0/9

Recall(C,L)=0/10 Recall(C,L)=1/10 Recall(C,L)=9/10

InvPurity = 0.92 3

C C C L L L

74

Evaluation

Extrinsic clustering metrics

Amigo et al. 2008

InvPurity =Linmax Recall Cj,Li( )( )

i∑ →Recall Cj,Li( ) =

Li∩C j

Li

InvPurity = 0.1 3

75

Evaluation

Extrinsic clustering metrics

Amigo et al. 2008

InvPurity =Linmax Recall Cj,Li( )( )

i∑ →Recall Cj,Li( ) =

Li∩C j

Li

InvPurity = 1

3

76

Evaluation

Extrinsic clustering metrics which is the harmonic mean F(Li,Cj) is the harmonic mean of Precision(Cj,Li) and Recall(Cj,Li)

Amigo et al. 2008

F =Linmax F Li, Cj( )( )

i∑ →F Li, Cj( ) =

2×Recall(Cj,Li )×Pr ecision(Cj,Li )Recall(Cj,Li )+Pr ecision(Cj,Li )

77

Evaluation results

ε1 = 250m, ε2 = 3600s, MinPts = 10, µ = 0.5

78

Evaluation results

ε1 = 250m, ε2 = 3600s, ε3 = 1, MinPts = 10, µ = 0.5

79

Evaluation results

ε1 = 250m, ε2 = 3600s, ε3 = 0.8, MinPts = 10, µ = 0.5

80

Evaluation results

Discovered Events ε3 = 1 Tagged Events Mostra Vins (20-09-2014)

Vanvan market (20-09-2014)

81

Evaluation results

Tagged Events Mostra Vins (20-09-2014)

Vanvan market (20-09-2014)

Discovered Events ε3 = 0.8

82

Evaluation results ε

1 = 250m, ε

2 = 3600s ε 1 =

250

m, ε

2 =

1800

s

ε 1 =

500

m, ε

2 =

1800

s ε

1 = 500m, ε

2 = 3600s

Fopt = 0.64 ε3 = 1

Fopt = 0.693 ε3 = 0.9

Fopt = 0.62 ε3 = 0.65

Fopt = 0.53 ε3 = 0.9

CONCLUSIONS

83

Conclusions

  The birth of social networks is one of the major causes of current levels of digitalized personal data.   Social networks have kept the doors opened to the developer community in order to stimulate the creation of apps.   This “openness” has been materialized with RESTful APIs, that enables communication between third party apps and social networks.   Through these APIs we are able to access vast amounts of data, develop and validate machine learning tools.   However, technical and legal limitations have to be taken into account to build functional applications.

85

Conclusions

  Location-based social networks enable to bridge the virtual and physical world.   Classical application such as recommender systems have to be reconsidered to take into account this new dimension.   Recommendation from textual reviews is feasible and hybridization improves performance.   Data from SN can be very biased by their own services in the SN (e.g. by their own RS).   Other novel application, such as event discovery, gain meaning with LBSNs.   Event discovery has to consider textual dimension to uncover meaningful events

FUTURE TRENDS

86

87

Internet of Things

2003 2010 2015

2020

In 2008, the number of things connected to the Internet exceeded the number of people on earth

M. Swan 2014

88

The Social Internet of Things (SIoT)

Atzori 2012

89

The internet of nano-things

Akyildiz et al. 2010

90

Haplotype social network?

91

References

Zheng, Yu. "Location-based social networks: Users." Computing with Spatial Trajectories. Springer New York, 2011. 243-276. Joan Capdevila, Marta Arias, and Argimiro Arratia. "GeoSRS: A hybrid social recommender system for geolocated data." Information Systems (2015). Joan Capdevila, Jesús Cerquides Jordi Nin Jordi Torres. “Tweet-SCAN: An event discovery technique for geo-located tweets”. Proceedings of the 18th International Conference of the Catalan Association for Artificial Intelligence, 2015 Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.

92

References

Teh, Yee Whye, et al. "Hierarchical dirichlet processes." Journal of the american statistical association 101.476 (2006). Ester, Martin, et al. "A density-based algorithm for discovering clusters in large spatial databases with noise." Kdd. Vol. 96. No. 34. 1996. Amigó, Enrique, et al. "A comparison of extrinsic clustering evaluation metrics based on formal constraints." Information retrieval 12.4 (2009): 461-486. Melanie Swan. “Quantified Self Ideology. Personal data becomes Big Data” February 2014. Université Paris Descartes Akyildiz, Ian F., and Josep Miquel Jornet. "The internet of nano-things." Wireless Communications, IEEE 17.6 (2010): 58-63. Luigi Atzori A presentation on THE SOCIAL INTERNET OF THINGS University of Cagliari, Italy 2012

ACKNOWLEDGMENTS:

93

Many Thanks! Questions?