ignorance isn't bliss: an empirical analysis of attention patterns in online communities

28
Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities Claudia Wagner, Matthew Rowe, Markus Strohmaier and Harith Alani Amsterdam, 16.4.2012

Upload: claudia-wagner

Post on 27-Jan-2015

107 views

Category:

Technology


1 download

DESCRIPTION

presented at the ASE/IEEE International conference on Social Computing 2012 in Amsterdam

TRANSCRIPT

Page 1: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

Claudia Wagner, Matthew Rowe, Markus Strohmaier and Harith Alani

Amsterdam, 16.4.2012

Page 2: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

with…

Matthew Rowe

Markus Strohmaier

Harith Alani

Page 3: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

3Motivation

Which factors impact how much attention a post gets?

We use the number of replies as a proxy measurment of attention

Page 4: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

Research Questions

Which factors impact the attention level a post gets in certain community forums?

How do these factors differ between individual community forums?

Page 5: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

5Methodology

Empirical study of attention patterns in 20 randomly selected forums

Two-stage approach Differentiate between threadstarter posts that got at least one reply (seed posts) and threadstarter posts which got no replies at all (non-seed posts)

Predict the level of attention that seed posts will generate - i.e. the number of replies

Page 6: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

DatasetMost popular Irish Message Boards, Boards.ie

725 Forums

Year 2005 and 2006

Page 7: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

7

Page 8: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

Feature Engineering

AimIdentify the features that impact upon seeding a discussion

Identify features associated with seed posts that generate the most attention

Five Feature Groups

Page 9: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

Five Feature GroupsUser Features

user account age, post count, in-degree, out-degree, post rate

Content Featurespost length, complexity, readability, link count, time in day, informativeness, polarity

Title FeaturesLength, question marks, linguistic dimensions (LIWC)

Focus FeaturesForum entropy, forum likelihood, topic entropy, topic likelihood, topic distance

Community FeaturesTopical community fit, topical community distance, evolution score, inequity score

Page 10: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

Feature Computation

For each threadstarter post published in one of the 20 randomly selected forums in 2006 we computed our 28 features

m1

6 month

2005 2006

Fit LDA model with standard parameterT=50, beta=0.01, alpha=50/T

Page 11: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

11

Seed Post Identification Experiment

Identify Posts which got replies (Binary Classification Task)

Split data of each forum into train and test data (80/20)

Train a logistic regression classifier with each feature group in isolation and all features combined

Compare performance by using F1 score and the Matthews correlation coefficient (MCC)

Page 12: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

12

Seed Post Identification Results

For these 9 forums our classifiers outperforms the random baseline:

Astronomy & Space: a classifier trained with content features aloneperforms best

Spanish: a classifier trained with title features alone performs best

Page 13: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

13

Seed Post Identification Feature Impact

Analyze impact of individual features rather than groups

Interpret statistically significant coefficients of the best performing feature group learned by the logistic regression model

Rank the features of the best performing feature group using the Information Gain Ratio (IGR) as a ranking criterion

Page 14: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

14

Seed Post Identification Observations

In Spanish community the title length is the most important features (IGR=0.558, coef=-0.326)

Posts with long titles are less likely to get replies

In the Bank & Insurance forum short but complex posts which are authored by newbies are most likely to get replies

Content length coef=-0.017, p< 0.05

Topic distance coef=2.890, p<0.01

Complexity has highest IGR (IGR=0.354)

Page 15: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

15

Seed Post Identification Observations

Number of links has a negative impact in forum Work & Jobs and Golf, but a positive impact in the Astronomy & Space forum

Purpose of community Links have a positive impact in content and information driven communities

Links have a negative impact in other communities

Page 16: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

16

Seed Post Identification Observations

Some communities require posts to fit to the topics they usually discuss (e.g., Golf) while others are more open to diverse topics (e.g., Work & Jobs)

Specificity of community’s subject Subject of Work &Jobs forum is very general high topical community distance has a positive impact

Subject of Golf forum is very specific high community distance has a negative impact

Page 17: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

17

Activity Level PredictionExperiment

Identify the features that were correlated with lengthy discussions

Rank posts according to their attention level

Evaluate our predicted rank using normalized Discounted Cumulative Gain (nDCG) at varying rank positions i.e. top-k where k={1, 5, 10, 20, 50, 100}

nDCG = DCG of the predicted ranking divided by DCG the actual rank

Page 18: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

18

Activity Level PredictionResults

Aver

AVERAGED NORMALISED DISCOUNTED CUMULATIVE GAINA value of 1 indicates that the predicted ranking of posts perfectly matched their real ranking.

Page 19: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

19

Activity Level PredictionResults

Aver

For the Astronomy & Space community content features were best for identifying seed posts and are also best for ranking posts according to the attention level they will generate.

Page 20: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

20

Activity Level PredictionResults

Aver

Golf forum (343) Combination of all features worked best for identifying seed posts.Focus features alone are best for ranking posts.

Page 21: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

21

Activity Level PredictionResults

Aver

Bank & Insurance forum (544) Combination of all features worked best for identifying seed posts.Community features alone are best for ranking posts.

Page 22: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

22

Activity Level PredictionSummary

Factors that impact discussion initiation often differ from the factors that impact discussion length

e.g. for the Golf community

Seed Posts = all features

Activity level = focus features

Page 23: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

23

Activity Level PredictionSummary

Factors that are associated with lengthy discussion tend to be different for different communities

The title length is the only feature which has a slightly significant positive impact across several communities on the number of replies a post gets

Work & Jobs forum title length coef=0.034 and p<0.01

Satellite forum titles length coef =0.030 and p<0.05

Page 24: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

24Conclusions (1)

Different community forums exhibit interesting differences in terms of how attention is generated

Most attention patterns which we identified are local and community-specific

“Global” patterns may highly depend on composition of dataset

Page 25: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

25Conclusions (2)

Same features that have a positive impact on the start of discussions in one community can have a negative impact in another community

Example: number of links Negative impact in most communities

Positive impact in information and content driven communities

Page 26: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

26Conclusions (3)

Purpose of community and specificity of community’s subject may impact their reply behavior

Communities which have a supportive purpose are most likely driven by different factors than communities with an informational purpose.

Communities around very specific topics require posts to fit to the topical focus. Communities around more general topics do not have this requirement.

Page 27: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

27Limitations & Future Work

Correlation versus CausalityWe cannot answer the „what would have happened if“ question with our approach

Controlled experiments where platform is manipulated

Most attention patterns are lokal. But how lokal?Can we automatically identify the context in which attention patterns may hold?

Page 28: Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

Experimental Setup

THANK YOU

[email protected]://claudiawagner.info

src: http://adobeairstream.com/green/a-natural-predicament-sustainability-in-the-21st-century/

Attention patterns tend to be local and community-specific.Ignoring communities’ idiosyncrasies isn’t a bliss.