towards context-aware search and analysis on social media data
DESCRIPTION
Social media has changed the way we communicate. Social media data capture our social interactions and utterances in machine readable format. Searching and analysing massive and frequently updated social media data brings significant and diverse rewards across many different application domains, from politics and business to social science and epidemiology.A notable proportion of social media data comes with explicit or implicit spatial annotations, and almost all social media data has temporal metadata. We view social media data as a constant stream of data points, each containing text with spatial and temporal con-texts. We identify challenges relevant to each context, which we intend to subject to context aware querying and analysis, specifically including longitudinal analyses on social media archives, spatial keyword search, local intent search, and spatio-temporal intent search. Finally, for each context, emerging applications and further avenues for investigation are discussed.TRANSCRIPT
Towards Context-Aware Search and Analysis on
Social Media Data
Leon DerczynskiBin Yang 杨彬
Christian S. Jensen
Evolution of communication
Functional utterances
Vowels
Velar closure: consonants
Speech
New modality: writing
Digital text
Social media
Increased
machine-readable
information??
Social Media = Big Data
Gartner ''3V'' definition:
1.Volume
2.Velocity
3.Variety
High volume & velocity of messages:
Twitter has ~20 000 000 users per monthThey write ~500 000 000 messages per day
Massive variety: Stock markets;Earthquakes;Social arrangements;… Bieber
What is machine-readable now?
Messages now contain
- not only linguistic content
- but also:Links (e.g. URI)Topic markers (e.g. hashtags)Meta-information
What kind of meta-information?
User profile (including home location)ImagesMessages replied toMessage language
Time of messageLocation of message
What resources do we have now?
Large, content-rich, linked, digital streams of human communication
We transfer knowledge via communication
Sampling communication gives a sample of human knowledge
''You've only done that which you can communicate''
The metadata (time – place – imagery) gives a richer resource:
→ A sampling of human behaviour
What can we do with this resource?
Context increases the data's richness
Increased richness enables novel applications
Time and Place are interesting parts of message context
1.What kinds of applications are there?
2.What are the practical challenges?
Temporal Context
Messages have timestamps:
Two temporal retrieval scenarios:
1. Historical analyses
2. Emerging data
+
Historical search
Ability to retrieve from archives: Longitudinal query mode 0
Retrieve information on:
● Lifecycle of socially connected groups
● Analyse precursors to events, post-hoc
2008 2011
0. Weikum et al. 2011: ''Longitudinal analytics on web archive data: It’s about time'', Proc. CIDR
Historical search
Retrospective analyses into cause and effect
Social media mentions of dead crows predict WNV in humans 1
''There's a dead crow in my garden''
1. Sugumaran & Voss 2012: ''Real-time spatio-temporal analysis of West Nile Virus using Twitter Data'', Proc. Int'l conference on Computing for Geospatial Research and Applications
Emerging search
Data emerging at high velocity:
185 000 documents per minute
Gives a high temporal density
Search over this info enables:
● Live coverage of events
● Realtime identification of emerging events 2
2. Cohen at al. 2011: ''Computational journalism: A call to arms to database researchers'', Proc. CIDR
Temporal indexing
What are our requirements?
● High-frequency document creation
● Temporal cross-sections of varying size
● Time-sensitive TF/IDF: stopwords are fluid
How can we do this? - Open challenge
● Tree indexing hard to distribute
● Maybe with adaptive multi-resolution grids?
Spatial Context
Demand for spatial information:
20% of all Google searches
53% of Bing mobile searches
Heterogeneous spatial context sources
GPS locations (most reliable)
Origin bounding boxes (e.g. city)
User profile text??? 3
Author's friends' locations 4
3. Hecht at al. 2011: ''Tweets from Justin Bieber’s Heart: The Dynamics of the “Location” Field in User Profiles'', Proc. ACM CHI ; 4. Rout et al. 2013: ''Where's @wally? A Graph Based Method for Geolocating Users in Social Networks'', Proc. ACM Hypertext
Spatial Keyword Search
How can we query a set of social media messages?
Treat as a a set of objects, each havingText Location
Query parameters:Query textQuery location
Given query and set of messages, rank by similarity:
Text similarity (Cosine, Siamese Learning Net, Oriented PCA)Separating distance (Haversine, Manhattan, Eco-routed)Blend this with balancing coeff
(just like conventional spatial keyword search)
Spatial Keyword Search
Query: ''good bar in north copenhagen''
Issued from location
Five candidate messages
Query region established
Rank by blend of location and textual similarity
A
B
C
D
E
Message loca text
A So drunk last night at @BarSyv 0.7 0.6
B Out shoe shopping!!! #louboutintime 0.9 0.0
C Who pays $9 for a beer?! 0.6 0.5
D wow found cph's greatest cocktail bar lol 0.1 1.0
E Traffic. Traffic everywhere. Need a drink. 0.4 0.2
Continuous Spatial Queries
Social media scenario characterised by:
Streaming data
New spatial objects constantly appearing
Two new spatial keyword query types:
Static Continuous (SCSKQ)- Fixed query location- Tracks newly appearing objects
Moving Continuous (MCSKQ)- Query location transits locus- Result updated with new objects
Novel part: fresh objects continuously introduced
Location Diversity
Location data unreliable
Reliability of location data... is also unreliable
''There are known knowns.. we also know there are known unknowns.. but there are also unknown unknowns'' – Donald Rumsfeld
Text mentions require disambiguation
● In profile● In messages● In queries
Requirement is to rank vague points given vague query
Willingness to travel
Determines useful search radius
Based on mode of transport:
Different for varying classes of Point Of Interest?
ST Social media = huge dataset
Easy data collection
Useful for e.g. town planning
14.9km22.0km
40.6km61.5km
>100km
Spatio-temporal Challenges
We've seen temporal and spatial challenges; let's combine!
Given all these spatio-temporal utterances, what can we do?
- Spatial gives relevance from physical or travel proximity
- Temporal gives relevance from recency and historical
Adding text to the spatio-temporal points gives
explicit semantic context
Not only are ST patterns in the data, we are told what they mean!
Topic-based Retrieval
Retrieving results on a topic is useful; ''Tell me about X''
Specific terms vary between places and over time
… Spatio-temporally sensitive indexing?
2007
2011
en.wikipedia.org/wiki/President_of_the_United_States
England English
US English
''Jelly''
Sentiment Monitoring
Measure how attitudes change over time and over location
Business uses: where to send marketing
Political uses: data-driven democratic.. campaigning
Governance uses: what are citizen priorities in a region
Temporal dimension enables tracking of trends and reactions
red = upbeat;
blue = complaint.
- no normalisation for vocality!
Local Computational Journalism
Social media is quick
Social media is uncurated
''Citizen Journalism''
News has relevance scope:RecencyProximity
Different events relevant in different contexts:Rain in LondonRain in Addis Ababa
Automatic event detection5 - and also reporting!
5. Ritter at al. 2012: 'Open domain event extraction from Twitter'', Proc. ACM SIGKDD
Summary
Social media is a rich source of ''big data''
A small sampling of all human discourse
It comes with temporal and spatial context
Context-aware search and analysis is very demanding!
- Novel, powerful applications
- Wide variety of domains
- An open set of challenges
Thank you!
Thank you for listening!
Do you have any questions?