applications of machine learning to location-based social networks
TRANSCRIPT
Applications of Machine Learning to Location-based Social Networks
Joan Capdevila Pujol e-mail: [email protected]
website: http://people.ac.upc.edu/jc
Advisors: Jordi Torres Viñals, Jesús Cerquides Bueno
2
Table of contents
Motivation
Location-based Social Networks (LBSNs)
App 1: GeoSRS: A social recommender system
App 2: Tweet-SCAN: An event discovery technique
Conclusions and future trends
7
A ML geek might have thought:
“With all this tagged data, I am going to build a classifier to decide whether the person in the pic is hot or not.”
8
Mark Zuckerberg probably thought: “I’d rather prefer to keep playing to scale up the network and then…”
11
User engagement through several social networking services: Linking to friends, colleagues, etc. Setting school/college Tagging friends to pictures Liking publications Geolocating content Reviewing business Expressing how one feels …
12
User engagement through several social networking services: Linking to friends, colleagues, etc. à Social graphs Setting school/college à User profiles Tagging friends to pictures à Tagged images Liking publications à Rating information Geolocating content à Geolocated content Reviewing business à Textual comments Expressing how one feels à People feelings …
13
Community detection
Content-based Recommender
Sentiment Analysis
Image recognition Topic
Modeling
Location-based Social Networks (LBSNs) V
IRTU
AL W
OR
LD
Mobile communication + Positioning technologies P
HY
SIC
AL W
OR
LD
2010 - …
17
Locations
Location-acquisition technologies – Outdoor: GPS, GSM, etc. – Indoor: Wi-Fi, RFID, etc.
Representation of locations
– Absolute (e.g. latitude-longitude coordinates) – Symbolic (e.g. at Pl. Catalunya, at Aeroport Girona-Costa Brava )
Forms of locations
– Point locations (e.g. Foursquare venues) – Regions (e.g. Twitter places) – Trajectories (e.g. Strava)
18
Research lines
Understanding users – User similarity/link prediction – Experts/influencers detection – Community discovery
Understanding locations – Generic recommendation
• Most interesting locations and travel routes • Itinerary planning • Location-activity recommenders
– Personalized recommendation: GeoSRS [Capdevila et al. 2015] Understanding events
– Anomaly detection: Tweet-SCAN [Capdevila et al. 2015] – Crowd behavioral patterns
Zheng, Y. 2011
19
Research lines
Understanding users – User similarity/link prediction – Experts/influencers detection – Community discovery
Understanding Locations – Generic recommendation
• Most interesting locations and travel routes • Itinerary planning • Location-activity recommenders
– Personalized recommendation: GeoSRS [Capdevila et al. 2015] Understanding events
– Anomaly detection: Tweet-SCAN [Capdevila et al. 2015] – Crowd behavioral patterns
Zheng, Y. 2011
20
Research lines
Understanding users – User similarity/link prediction – Experts/influencers detection – Community discovery
Understanding locations – Generic recommendation
• Most interesting locations and travel routes • Itinerary planning • Location-activity recommenders
– Personalized recommendation: GeoSRS [Capdevila et al. 2015] Understanding events
– Anomaly detection: Tweet-SCAN [Capdevila et al. 2015] – Crowd behavioral patterns
Zheng, Y. 2011
GEOSRS: A SOCIAL RECOMMENDER SYSTEM
21
Joan Capdevila, Marta Arias, and Argimiro Arratia. "GeoSRS: A hybrid social recommender system for geolocated data." Information Systems (2015).
32
Foursquare API
HTTP METHODS – GET, POST, PUT ,DELETE
RESOURCES – Venue, tip, user e.g. GET https://api.foursquare.com/v2/venues/40a55d80f964a52020f31ee3
ASPECTS – Tips of a venue, friends of a user e.g. GET https://api.foursquare.com/v2/venues/40a55d80f964a52020f31ee3/tips
ACTIONS – Approve a friendship, like a venue e.g. POST https://api.foursquare.com/v2/venues/40a55d80f964a52020f31ee3/like
33
Foursquare API
App registration https://foursquare.com/developers/apps Obtain the Foursquare API credentials (Client ID and Client Secret)
Access token Allows apps to make requests to Foursquare on behalf of a user
Userless request Specify consumer key’s Client ID and Client Secret https://api.foursquare.com/v2/venues/search?ll=40.7,-74&client_id=XX&client_secret=ZZ&v=20151125
Authenticated request Specify access token https://api.foursquare.com/v2/users/self/checkins?oauth_token=AA
34
Foursquare API
Technical Limitations
Userless requests to venues/ resource = 5.000 request/hour
Userless requests to other resources = 500 request/hour Authenticated requests = 500 request/hour*token
36
Data Extraction
Goal: extract all tips from venues in Manhattan (New York)
Medium: – aspect: venues/VENUE_ID/tips – resource: venues/search(sw, ne)
Limitations:
– 5000 request/hour – at most 50 venues per request
SW
NE
40
Quadtree algorithm
In each Quadcell at the tree leaves, there are at most 50 venues.
Through venues/VENUE_ID/tips, we now retrieve the tips for this venue
Each tip is linked to a VENUE_ID and USER_ID
We now have a database of triplets (USER, TIPS, VENUE) to perform recommendation
45
Hybridization
Simple weighted hybridization technique
Collaborative Branch
fCOL
Content-based Branch fCONT
Hybrid f: fCOL +α fCONT
TWEET-SCAN: AN EVENT DISCOVERY TECHNIQUE
50
Joan Capdevila Jesús Cerquides Jordi Nin Jordi Torres. “Tweet-SCAN: An event discovery technique for geo-located tweets”. Proceedings of the 18th International Conference of the Catalan Association for Artificial Intelligence, 2015
53
Examination of data
We looked at several tweet dimension separately
… from a dataset of tweets collected during “la Mercè” 2014
Spatial Temporal Textual
54
Examination of data
We looked at several tweet dimension separately
… from a dataset of tweets collected during “la Mercè” 2014
Spatial Temporal Textual
58
Tweet-SCAN
Tweet-SCAN is a technique to discover events from geolocated Tweets. It allows to discover dense groups of Tweets which are close in space, time and textual meaning. These dense groups of Tweets are linked to physical world events Textual meaning is represented through probabilistic topic models Tweet-scan can be seen as an extension of the popular DBSCAN algorithm or a particular case of GDBSCAN
59
Probabilistic topic modeling
Fig. - Xuriguera et al. 2013 LDA - Blei et al. 2003
HDP - Teh et al. 2006
Latent Dirichlet Allocation (LDA)
Hierarchical Dirichlet Process (HDP) – Non-parametric version of LDA
60
Probabilistic topic modeling
VAN VAN MARKET - 🚐🚎🍤🍱🍜 Mercat gastronòmada #lepetitbangkok @ Parc de la Ciutadella http://t.co/5CvnUFoIDa
Topic Proportions: [(1, 0.30002802458675537), (11,0.58330530874655417)]
61
DBSCAN
Density-based Spatial Clustering for Applications with Noise
Noise points
params: Minpts=4, ε =
Ester et al. 1996
62
DBSCAN
Density-based Spatial Clustering for Applications with Noise
Core points
params: Minpts=4, ε =
ε
Ester et al. 1996
63
DBSCAN
Density-based Spatial Clustering for Applications with Noise
Core points
params: Minpts=4, ε =
ε
Ester et al. 1996
64
DBSCAN
Density-based Spatial Clustering for Applications with Noise
Border points
params: Minpts=4, ε =
ε
Ester et al. 1996
65
DBSCAN
Density-based Spatial Clustering for Applications with Noise
Border points
params: Minpts=4, ε =
ε
Ester et al. 1996
66
DBSCAN
Density-based Spatial Clustering for Applications with Noise
ε
Noise point Border point Core point
params: Minpts=4, ε =
Ester et al. 1996
67
Tweet-SCAN
Neighborhood identification – ε1: spatial (m) – Euclidean distance – ε2: time (sec) – Euclidean distance – ε3: text – Jensen-Shannon distance (proper metric for prob. dist.)
Cardinality of the neighborhood
– MinPts – minimum number of neighbors (Tweets) – µ – minimum percentage of unique users in the neighborhood.
68
Experimentation
We used 45.623 tweets to unsupervisedly discover event-related tweets by means of Tweet-SCAN.
We seek to understand the parameters role by comparing the resulting
clusters against the 1.163 tagged event-related Tweets.
69
Evaluation
Extrinsic clustering metrics
Amigo et al. 2008
Purity =Ci
nmax Pr ecision Ci, Lj( )( )
i∑ → Pr ecision Ci, Lj( ) =
Ci∩L j
Ci
3
Precision(C,L)=9/10 Precision(C,L)=1/10 Precision(C,L)=0/10
Precision(C,L)=0/9 Precision(C,L)=8/9 Precision(C,L)=1/9
Precision(C,L)=0/9 Precision(C,L)=0/9 Precision(C,L)=9/9
C C C L L L
70
Evaluation
Extrinsic clustering metrics
Amigo et al. 2008
Purity =Ci
nmax Pr ecision Ci, Lj( )( )
i∑ → Pr ecision Ci, Lj( ) =
Ci∩L j
Ci
Purity = 0.92 3
C C C L L L
Precision(C,L)=9/10 Precision(C,L)=1/10 Precision(C,L)=0/10
Precision(C,L)=0/9 Precision(C,L)=8/9 Precision(C,L)=1/9
Precision(C,L)=0/9 Precision(C,L)=0/9 Precision(C,L)=9/9
71
Evaluation
Extrinsic clustering metrics
Amigo et al. 2008
Purity =Ci
nmax Pr ecision Ci, Lj( )( )
i∑ → Pr ecision Ci, Lj( ) =
Ci∩L j
Ci
3 Purity = 1
72
Evaluation
Extrinsic clustering metrics
Amigo et al. 2008
Recall(C,L)=9/9 Recall(C,L)=0/9 Recall(C,L)=0/9
Recall(C,L)=1/9 Recall(C,L)=8/9 Recall(C,L)=0/9
Recall(C,L)=0/10 Recall(C,L)=1/10 Recall(C,L)=9/10
InvPurity =Linmax Recall Cj,Li( )( )
i∑ →Recall Cj,Li( ) =
Li∩C j
Li
3
C C C L L L
73
Evaluation
Extrinsic clustering metrics
Amigo et al. 2008
InvPurity =Linmax Recall Cj,Li( )( )
i∑ →Recall Cj,Li( ) =
Li∩C j
Li
Recall(C,L)=9/9 Recall(C,L)=0/9 Recall(C,L)=0/9
Recall(C,L)=1/9 Recall(C,L)=8/9 Recall(C,L)=0/9
Recall(C,L)=0/10 Recall(C,L)=1/10 Recall(C,L)=9/10
InvPurity = 0.92 3
C C C L L L
74
Evaluation
Extrinsic clustering metrics
Amigo et al. 2008
InvPurity =Linmax Recall Cj,Li( )( )
i∑ →Recall Cj,Li( ) =
Li∩C j
Li
InvPurity = 0.1 3
75
Evaluation
Extrinsic clustering metrics
Amigo et al. 2008
InvPurity =Linmax Recall Cj,Li( )( )
i∑ →Recall Cj,Li( ) =
Li∩C j
Li
InvPurity = 1
3
76
Evaluation
Extrinsic clustering metrics which is the harmonic mean F(Li,Cj) is the harmonic mean of Precision(Cj,Li) and Recall(Cj,Li)
Amigo et al. 2008
F =Linmax F Li, Cj( )( )
i∑ →F Li, Cj( ) =
2×Recall(Cj,Li )×Pr ecision(Cj,Li )Recall(Cj,Li )+Pr ecision(Cj,Li )
80
Evaluation results
Discovered Events ε3 = 1 Tagged Events Mostra Vins (20-09-2014)
Vanvan market (20-09-2014)
81
Evaluation results
Tagged Events Mostra Vins (20-09-2014)
Vanvan market (20-09-2014)
Discovered Events ε3 = 0.8
82
Evaluation results ε
1 = 250m, ε
2 = 3600s ε 1 =
250
m, ε
2 =
1800
s
ε 1 =
500
m, ε
2 =
1800
s ε
1 = 500m, ε
2 = 3600s
Fopt = 0.64 ε3 = 1
Fopt = 0.693 ε3 = 0.9
Fopt = 0.62 ε3 = 0.65
Fopt = 0.53 ε3 = 0.9
Conclusions
The birth of social networks is one of the major causes of current levels of digitalized personal data. Social networks have kept the doors opened to the developer community in order to stimulate the creation of apps. This “openness” has been materialized with RESTful APIs, that enables communication between third party apps and social networks. Through these APIs we are able to access vast amounts of data, develop and validate machine learning tools. However, technical and legal limitations have to be taken into account to build functional applications.
85
Conclusions
Location-based social networks enable to bridge the virtual and physical world. Classical application such as recommender systems have to be reconsidered to take into account this new dimension. Recommendation from textual reviews is feasible and hybridization improves performance. Data from SN can be very biased by their own services in the SN (e.g. by their own RS). Other novel application, such as event discovery, gain meaning with LBSNs. Event discovery has to consider textual dimension to uncover meaningful events
87
Internet of Things
2003 2010 2015
2020
In 2008, the number of things connected to the Internet exceeded the number of people on earth
M. Swan 2014
91
References
Zheng, Yu. "Location-based social networks: Users." Computing with Spatial Trajectories. Springer New York, 2011. 243-276. Joan Capdevila, Marta Arias, and Argimiro Arratia. "GeoSRS: A hybrid social recommender system for geolocated data." Information Systems (2015). Joan Capdevila, Jesús Cerquides Jordi Nin Jordi Torres. “Tweet-SCAN: An event discovery technique for geo-located tweets”. Proceedings of the 18th International Conference of the Catalan Association for Artificial Intelligence, 2015 Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.
92
References
Teh, Yee Whye, et al. "Hierarchical dirichlet processes." Journal of the american statistical association 101.476 (2006). Ester, Martin, et al. "A density-based algorithm for discovering clusters in large spatial databases with noise." Kdd. Vol. 96. No. 34. 1996. Amigó, Enrique, et al. "A comparison of extrinsic clustering evaluation metrics based on formal constraints." Information retrieval 12.4 (2009): 461-486. Melanie Swan. “Quantified Self Ideology. Personal data becomes Big Data” February 2014. Université Paris Descartes Akyildiz, Ian F., and Josep Miquel Jornet. "The internet of nano-things." Wireless Communications, IEEE 17.6 (2010): 58-63. Luigi Atzori A presentation on THE SOCIAL INTERNET OF THINGS University of Cagliari, Italy 2012