identifying prominent life events on twitter - k-cap 2015

Identifying Prominent Life Events on TwitterSPEAKER – TOM DICKINSON

AUTHORS – TOM DICKINSON (OU - KMI) , MIRIAM FERNANDEZ (OU - KMI) , L ISA THOMAS (NORTHUMBRIA ) , PAUL MULHOLLAND (OU - KMI) , PAM BRIGGS (NORTHUMBRIA) AND HARITH ALANI (OU - KMI)

Quick OverviewSome Background

What we did

Discussion

Some Background

Why are we doing this?As content creators, we post a lot of stuff on social media

This content can range from silly cats, to important life events that have happened to us

However, as users, we effectively lose access to this information about ourselves and forget what’s there

By being able to mine and present this data to users, we can look at giving users a tool to aid in self reflection over their own online digital presence

So what are we doing?As part of the Reellives project, we are looking at making short “Reels” from a users social media content

These reels are intended as mini documentaries about a users life on social media

This presents two main problems for us to solve:◦ R.Q. 1) How can we extract meaningful events about ourselves from our social media data?

◦ R.Q. 2) How can we present these events in a cohesive narrative?

R.Q. 1 is being tackled by KMI, where we are looking at event extraction.

R.Q. 2 is being tackled by Edinburgh, who are looking at taking our output, as their input, to construct narratives.

So what are we doing?

Social Media

Storage

Extraction Events

Story Generation

“Reel”

LifeEvent

Detection

StoryFabula Narrative

What is a life event?There already exists a large body of research of event detection on social media.

However, not much has been done on focusing on life or personal events.

Semantically, they are no different:◦ Both types will have a time and a location

◦ An action occurs

◦ The event is experienced by one or more agents

However:◦ With general events we care more about the broader social and political significance

◦ With life events we care more about the personal significance

What is a Life Event?We can also get some intuition for life events from Autobiographical Memory.

Autobiographical memory is type of memory system that deals with specific events that happened to us

◦ This is opposed to semantic memory which is our knowledge of things

It can be modelled with three separate layers◦ Lifetime periods

◦ When I was at school I had my first kiss

◦ General Events◦ I got married

◦ Event-Specific Knowledge◦ My tie was red at the wedding

In our work, we can consider the event-specific knowledge to be reflected in social media posts

What We Did

Types of life eventsTo start off our research, we looked at identifying a finite number of life events.

The types of life events we chose are inspired by work done in Autobiographical Memory◦ S. M. Janssen and D. C. Rubin. Age effects in cultural life scripts. Applied Cognitive Psychology

Their research showed a common consensus, amongst different age groups, of 48 life events that would happen to a fictional child over the course of their life.

From this study, we selected 5 of the top events mentioned in a paper◦ Getting Married◦ Having Children◦ Starting School◦ A Parents Death◦ First Love

We also look at combining all positive “about an event” into a training set to create a more general “Is this about an event” classifier.

What we did – Data CollectionWe chose Twitter due to ease of use for extracting large datasets.

Our selection methodology was based around a simple keyword search, where we considered the root concepts for each of our events, and enhanced with synonyms from WordNet.

We extracted Tweets from Twitter’s front-end search, as opposed to their API◦ This is due to their API having a 7 day limit

◦ Twitter now indexes every tweet, making it available to scrape from their front-end search application

Additional details were extracted for each Tweet, using their Lookup API with the extracted Tweet ID.

What we did - AnnotationsTo annotate our dataset, we turned to CrowdFlower

To start with, we ran several small trials of annotation exercises on CrowdFlower to make sure our questions were satisfactory

We initially had 7 questions:◦ Is this tweet about Getting Married?

◦ Is this tweet about an event?

◦ Was the tweet before, during, or after the event?

◦ Is the author of the tweet experiencing the event?

◦ Is anyone else experiencing the event with the author?

◦ Is anyone else named in the tweet experiencing the event?

◦ Did the event happen where it was tweeted?

This did not prove too popular as we had large number of quiz failures

What we did - AnnotationsObvious failure for this initial test run were too many questions and possible subjectivity for our given definition of an event.

After another trial, we finally settled on only asking two questions:◦ Q1 - Is this tweet related to a particular topic theme? (Topic theme is the cluster we extracted from)

◦ Q2 - Is this tweet about an important life event?

We also provided users a list of the 46 life events that Jansen and Rubin identified, as a way to get them to understand what we were after.

This ran much better, and our final agreement ratings were 89.5% and 87.17% respectively

What we did – Feature SetsOur feature sets were divided into several groups:

◦ User features◦ H1) Certain types of users may be more prone to share life events in Twitter

◦ Content Features◦ H2) Posts written in a certain way may be related to life events

◦ Semantic Features◦ H3) Posts about life events might be semantically associated with certain entities or concepts

◦ Interaction Features◦ H4) Users who do not normally talk with the poster, might start interacting for certain types of life events

What we did - ClassifiersWe ended up just testing two classifiers, as other work had already tested a number of different classifiers on similar datasets:

◦ J48

◦ Naïve Bayes

We did try SVM’s as well, but due to poor performance, omitted it from our results.

To evaluate we used 10-fold cross validation, reporting standard classification performance measures of Precision, Recall, and F1 scores.

What we did - Resultshttps://

Discussion

Why the dominance of content features?Unigrams outstripped performance of other feature sets.

This is similar to other similar papers, and slightly disappointing.

While the classifiers were biased towards the keywords chosen, it is disappointing other feature sets did not perform well.

In the case of interaction features this might be because:◦ We were limited in what types of interaction features we could obtain, due to the limits of Twitters API

◦ The dataset might have been annotated incorrectly◦ For example, stories of other people are annotated, rather than people declaring an event about themselves

◦ Due to the nature of Twitter and it’s followers, interaction features might just not be a good discriminator.◦ Sites like Facebook though, which tends to be private, might have better performance in this area

Choice of targeting specific life eventsTargeting only five specific life events, dilutes what we can actually extract from social media

Our binary classifier worked alright, but:◦ Due to dependency on unigrams, it will probably not perform very well outside of these 5 events

This is no silver bullet for solving our research question

Collecting the datasetThe collected dataset was biased to certain words due to a keyword search

A better way to collect these datasets would be to randomly sample twitter profiles, and annotate their timelines

However, it is likely that only a small number of tweets are actually about these types of events in a users timeline

To achieve a decent training set, we would need to annotate lots of tweets which is very costly

Twitter and the annotation processUsing CrowdFlower is a great way to gain lots of annotations fast

However, with Twitter data we think the annotation is flawed for these types of questions

Lack of context◦ Is a 140 character max text string enough context to annotate these types of events

◦ Example: Is “MadJacks Forever Memories” about getting married?◦ Madjacks is a wedding venue in Las Vegas, so this might be?

First vs third party annotation◦ While lack of context for a third party is an issue, if the owner of the tweet annotated it, would we get

better results?

Extracting useful interaction features is difficult◦ There is no API to get conversations for tweets. Mining this manually is possible, but annoying.

◦ You can’t get access to which users have favourited a tweet

Facebook would be better……but it has heavy privacy controls to access user data

While this is great for users, it’s annoying for researchers

Retrieving content from Facebook all needs to be done within an application◦ These days, a User ID is hashed with your application ID

◦ If you have a standard user ID, you can’t access the Facebook graph API to retrieve information about it

Asking people to just give us their Facebook data with a single sign on approach isn’t the best approach either

◦ Users are reluctant to just give researchers their private data

◦ What do they get out of it? (besides the results of the research)

Is Instagram the middle ground?Like Twitter, there are a lot of open Instagram accounts

◦ Sites like websta.me index large numbers of users and offer tag based search

Like Twitter, it is (currently) easy to extract Instagram data◦ While the API, like Twitter, is limited, it is possible to extract full user profiles

◦ Instagram works with a REST based architecture, returning user posts in JSON feeds that can be paginated allowing full extraction of posts

◦ Using the API each post can be augmented with additional information not available in the media stream

While we think of Instagram only being photos, most photos have short captions similar to Twitter length

◦ Comments can also provide semantic context

Future WorkWe are currently looking at collecting Instagram and Facebook data for future experiments

◦ Facebook data is being collected with a trivial app that users can use

Unsupervised life event detection◦ As opposed to targeting specific events, being able to extract any type would be of more value

◦ Currently we are looking at knowledge based approaches using ConceptNet to achieve this

Graph Classification of Posts◦ So far we have employed fairly flat vectors when considering feature sets

◦ As opposed to this, an alternative is to treat posts as graphs, looking at relationships within semantic (ConceptNet, DBpedia etc), interactions, and dependency parsing

◦ Graph frequent pattern mining might identify new feature sets that we can look at using