ignorance isn't bliss: an empirical analysis of attention patterns in online communities
DESCRIPTION
presented at the ASE/IEEE International conference on Social Computing 2012 in AmsterdamTRANSCRIPT
Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities
Claudia Wagner, Matthew Rowe, Markus Strohmaier and Harith Alani
Amsterdam, 16.4.2012
with…
Matthew Rowe
Markus Strohmaier
Harith Alani
3Motivation
Which factors impact how much attention a post gets?
We use the number of replies as a proxy measurment of attention
Research Questions
Which factors impact the attention level a post gets in certain community forums?
How do these factors differ between individual community forums?
5Methodology
Empirical study of attention patterns in 20 randomly selected forums
Two-stage approach Differentiate between threadstarter posts that got at least one reply (seed posts) and threadstarter posts which got no replies at all (non-seed posts)
Predict the level of attention that seed posts will generate - i.e. the number of replies
DatasetMost popular Irish Message Boards, Boards.ie
725 Forums
Year 2005 and 2006
7
Feature Engineering
AimIdentify the features that impact upon seeding a discussion
Identify features associated with seed posts that generate the most attention
Five Feature Groups
Five Feature GroupsUser Features
user account age, post count, in-degree, out-degree, post rate
Content Featurespost length, complexity, readability, link count, time in day, informativeness, polarity
Title FeaturesLength, question marks, linguistic dimensions (LIWC)
Focus FeaturesForum entropy, forum likelihood, topic entropy, topic likelihood, topic distance
Community FeaturesTopical community fit, topical community distance, evolution score, inequity score
Feature Computation
For each threadstarter post published in one of the 20 randomly selected forums in 2006 we computed our 28 features
m1
6 month
2005 2006
Fit LDA model with standard parameterT=50, beta=0.01, alpha=50/T
11
Seed Post Identification Experiment
Identify Posts which got replies (Binary Classification Task)
Split data of each forum into train and test data (80/20)
Train a logistic regression classifier with each feature group in isolation and all features combined
Compare performance by using F1 score and the Matthews correlation coefficient (MCC)
12
Seed Post Identification Results
For these 9 forums our classifiers outperforms the random baseline:
Astronomy & Space: a classifier trained with content features aloneperforms best
Spanish: a classifier trained with title features alone performs best
13
Seed Post Identification Feature Impact
Analyze impact of individual features rather than groups
Interpret statistically significant coefficients of the best performing feature group learned by the logistic regression model
Rank the features of the best performing feature group using the Information Gain Ratio (IGR) as a ranking criterion
14
Seed Post Identification Observations
In Spanish community the title length is the most important features (IGR=0.558, coef=-0.326)
Posts with long titles are less likely to get replies
In the Bank & Insurance forum short but complex posts which are authored by newbies are most likely to get replies
Content length coef=-0.017, p< 0.05
Topic distance coef=2.890, p<0.01
Complexity has highest IGR (IGR=0.354)
15
Seed Post Identification Observations
Number of links has a negative impact in forum Work & Jobs and Golf, but a positive impact in the Astronomy & Space forum
Purpose of community Links have a positive impact in content and information driven communities
Links have a negative impact in other communities
16
Seed Post Identification Observations
Some communities require posts to fit to the topics they usually discuss (e.g., Golf) while others are more open to diverse topics (e.g., Work & Jobs)
Specificity of community’s subject Subject of Work &Jobs forum is very general high topical community distance has a positive impact
Subject of Golf forum is very specific high community distance has a negative impact
17
Activity Level PredictionExperiment
Identify the features that were correlated with lengthy discussions
Rank posts according to their attention level
Evaluate our predicted rank using normalized Discounted Cumulative Gain (nDCG) at varying rank positions i.e. top-k where k={1, 5, 10, 20, 50, 100}
nDCG = DCG of the predicted ranking divided by DCG the actual rank
18
Activity Level PredictionResults
Aver
AVERAGED NORMALISED DISCOUNTED CUMULATIVE GAINA value of 1 indicates that the predicted ranking of posts perfectly matched their real ranking.
19
Activity Level PredictionResults
Aver
For the Astronomy & Space community content features were best for identifying seed posts and are also best for ranking posts according to the attention level they will generate.
20
Activity Level PredictionResults
Aver
Golf forum (343) Combination of all features worked best for identifying seed posts.Focus features alone are best for ranking posts.
21
Activity Level PredictionResults
Aver
Bank & Insurance forum (544) Combination of all features worked best for identifying seed posts.Community features alone are best for ranking posts.
22
Activity Level PredictionSummary
Factors that impact discussion initiation often differ from the factors that impact discussion length
e.g. for the Golf community
Seed Posts = all features
Activity level = focus features
23
Activity Level PredictionSummary
Factors that are associated with lengthy discussion tend to be different for different communities
The title length is the only feature which has a slightly significant positive impact across several communities on the number of replies a post gets
Work & Jobs forum title length coef=0.034 and p<0.01
Satellite forum titles length coef =0.030 and p<0.05
24Conclusions (1)
Different community forums exhibit interesting differences in terms of how attention is generated
Most attention patterns which we identified are local and community-specific
“Global” patterns may highly depend on composition of dataset
25Conclusions (2)
Same features that have a positive impact on the start of discussions in one community can have a negative impact in another community
Example: number of links Negative impact in most communities
Positive impact in information and content driven communities
26Conclusions (3)
Purpose of community and specificity of community’s subject may impact their reply behavior
Communities which have a supportive purpose are most likely driven by different factors than communities with an informational purpose.
Communities around very specific topics require posts to fit to the topical focus. Communities around more general topics do not have this requirement.
27Limitations & Future Work
Correlation versus CausalityWe cannot answer the „what would have happened if“ question with our approach
Controlled experiments where platform is manipulated
Most attention patterns are lokal. But how lokal?Can we automatically identify the context in which attention patterns may hold?
Experimental Setup
THANK YOU
[email protected]://claudiawagner.info
src: http://adobeairstream.com/green/a-natural-predicament-sustainability-in-the-21st-century/
Attention patterns tend to be local and community-specific.Ignoring communities’ idiosyncrasies isn’t a bliss.