mary mcglohon, carnegie mellon* matthew hurst, microsoft *work completed at microsoft

31
Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft 1

Upload: mikaia

Post on 24-Feb-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model. Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

1

Community Structure and Information Flow in Usenet:Improving Analysis with a Thread Ownership Model

Mary McGlohon, Carnegie Mellon*Matthew Hurst, Microsoft

*work completed at Microsoft

Page 2: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

2

Motivation•Comparing communities

of online social networks may lend insight into how groups form and thrive

•We would also like to understand how information diffuses between groups

Collaborations at Santa Fe Institute(Girvan & Newman)

Page 3: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

3

Why Usenet?•We delve into these questions by

analyzing data from Usenet•Public•Can be analyzed over a long time period•Has pre-defined, hierarchical community

structure•Two main goals:

▫Compare different group activity (size, reciprocity)

▫Observe diffusion between groups

Page 4: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

4

Data•Posts from 200 politically-

oriented newsgroups (bulletin boards)▫ “polit” in name

•January 2004-June 2008•Several countries,

state/provinces, and topics.

•19.6 million unique articles, 6.2 million cross-posted

RepliesParent

Page 5: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

5

Cross-posting•A large percentage of articles are cross-

posted to multiple groups.•Somebody reading one group may “reply-

to-all”, such that all groups see it.

Major issue: many are cross-posted to multiple groups.Where is conversation truly occurring?

{alt.politics, us.politics}

{alt.politics,

us.politics, pa.politics}

{alt.politics,

us.politics, pa.politics}

{alt.politics, us.politics}

Page 6: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

6

Outline•Motivation•Data description•Structural Analysis

▫Size▫Reciprocity▫Similarity

•Ownership model▫Effects of Cross-posting▫Information Flow based on Ownership▫Similarity

Page 7: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

7

Structural Analysis•We hope to compare the structure of

communities by answering the following questions:

•How do edges form?•How does the reciprocity of groups

compare?•How can we measure similarity?

Page 8: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

8

Sizes of groups•How do edges form?

•To answer, we make a network of authors for each group

•If a1 has replied to a2 at any point, there is an edge from a1 to a2

Page 9: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

9

Sizes of groups•Power law-like

relationship between number of authors and number of edges.

•Similar to densification law [Leskovec+05], only with individual networks instead of snapshots of a network over time.

log(nodes)

log(edges)t=2004

t=2008

log(Number of authors)

log(

Num

ber

of

edge

s)

alt.politics

tw.bbs.politics

Page 10: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

10

Reciprocity• Which groups have highest reciprocity?• Reciprocity: percentage of reply-edges that are

mutual• Top 10 were European newsgroups (up to 0.58):

▫hun.politika▫relcom.politics▫hsv.politics▫italia.modena.politica▫se.politik▫ it.discussioni.leggende.metropolitane▫ukr.politics▫yu.forum.politika▫ni.politics▫swnet.politik

• Lowest reciprocity occurred in tw.bbs.* (<0.1)

Page 11: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

11

Similarity•How can we measure similarity between

groups?•Use Jaccard coefficient for cross-posts:

# Shared articles (cross-posts) between 2 groups

Total number of articles in groups

•Can do the same with shared authors•Highest similarity ~0.54 (bc.politics and

ont.politics)

Page 12: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

12

Similarity•Each group is a node•Edge drawn if similarity > 0.1 (thick

edge >0.2)•Form clusters: parties, US regional,

countries, alt.politics subgroups

Page 13: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

13

Parties/topics

Page 14: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

14

US States

Page 15: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

15

English-speaking countries

Page 16: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

16

alt.politics.*

Page 17: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

17

Outline•Motivation•Data Description•Structural Analysis

▫Size▫Reciprocity▫Similarity

•Ownership model▫Information Flow based on Ownership▫Similarity

Page 18: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

18

Problem: Excessive cross-posting•We just saw that there is significant

overlap between groups in terms of articles

•However, cross-posting occurs often between unrelated groups (“edges below threshold”)

•We would like to find out in which group the activity is truly occurring

• How can we trace this?

Page 19: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

19

Solution: Thread Ownership•Answer: Assign “ownership” based on the

authors of the posts•First, assign authors to groups based on

devotion▫Devotion(a,g): what percentage of an

author a’s posts are exclusively posted to a given group g

•For each post, normalize devotion among groups where the post occurs.▫Group with highest devotion score for the

author has more “ownership” of a post

Page 20: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

20

Example: Thread Ownership•Suppose in the data authors have the

following numbers of non-cross-posts in each group:

•Then, they form a thread:{alt.politics, us.politics}

{alt.politics, us.politics}{alt.politics, us.politics,

pa.politics}

alt.politics

us.politics pa.politics

Author 1 6 4 0Author 2 0 1 3Author 3 0 1 2

{0.6, 0.4}

{0, 1}{0, 0.25, 0.75}

Page 21: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

21

Real thread

•Initially cross-posted to several groups (including talk.politics.misc), 38 groups in total

•Ownership concentrated in seattle.politics and or.politics

•Subject: “Kiss the National Parks Good-Bye”

Page 22: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

22

Applications of thread ownership•Ownership model aids in analyzing

threads▫Influence between groups: How are

threads discovered and posted to new groups?

▫Similarity of groups: How can ownership help us more precisely state when two groups are similar?

Page 23: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

23

Information flow between groups• How are threads discovered and posted

to new groups?• Idea: Extend ownership to influence

• How often does an author in group 1 respond to a post they found in group 2?▫Author finds parent post pp by browsing group gp

▫Author writes child post pc to group gc

▫Then, we say gp influences gc

Influence(gp, gc) = Devotion(a, gp) * Devotion(a, gc)

• This helps pinpoint when an author decides to cross-post late in the thread

{alt.politics, us.politics}{alt.politics, us.politics, pa.politics}

Page 24: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

24

Example: Ownership-based influence•Author 2 sees parent post•Replies, adding pa.politics.•Since Author 2 is not devoted

to alt.politics, he was most likely influenced by us.politics

•Influence(us.politics,pa.politics) =

1 * 0.75 = 0.75

{alt.politics, us.politics}

{alt.politics, us.politics, pa.politics}

alt.politics

us.politics pa.politics

Author 1 6 4 0Author 2 0 1 3

Page 25: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

25

Who influences whom?•Information often diffuses from major to

minor groups

Page 26: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

26

Ownership-based Similarity• Q: How can ownership help us more precisely state

when two groups are similar?• A: Use “shared ownership” instead of shared posts

Western states Eastern states

Page 27: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

27

Applications and future work•Potential Applications

▫Link prediction▫Information retrieval and relevance▫Ownership for email lists

•Future Work▫Using comparative measures to predict

whether group will continue

Page 28: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

28

Related work: Discussion Groups• Backstrom, L.; Kumar, R.; Marlow, C.; Novak, J.; and

Tomkins, A. 2008. Preferential behavior in online groups. WSDM ’08

• Gomez, V.; Kaltenbrunner, A.; and Lopez, V. 2008. Statistical analysis of the social network and discussion threads in slashdot. WWW ’08

• Mishne, G., and Glance, N. 2006. Leave a reply: An analysis of weblog comments. WWE ’06

• Turner, T. C.; Smith, M. A.; Fisher, D.; and Welser, H. T. 2005. Picturing usenet: Mapping computer-mediated collective action. Journal of Computer-Mediated Communication 10(4).

• Viegas, F. B., and Smith, M. 2004. Newsgroup crowds and authorlines: visualizing the activity of individuals in conversational cyberspaces. HICSS 2004

Page 29: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

29

Related work: Information Diffusion•Kossinets, G.; Kleinberg, J.; and Watts, D.

2008. The structure of information pathways in a social communication network. KDD’08

•Leskovec, J.; Kleinberg, J.; and Faloutsos, C. 2005. Graphs over time: densification laws, shrinking diameters and possible explanations. KDD ’05

•Nowell, D. L., and Kleinberg, J. 2008. Tracing the flow of information on a global scale using Internet chain-letter data. PNAS 105(12):4633–4638.

Page 30: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

30

Conclusions•Case study of nearly 200 newsgroups,

including 19 million unique posts•Demonstrated “densification” law as

applies to different groups•Compared groups in terms of reciprocity

and shared posts/authors•Proposed thread ownership model to cut

down on “noise” from cross-posts•Applied ownership to diffusion between

groups, group similarity

Page 31: Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft

31

Contact info•Mary McGlohon•www.cs.cmu.edu/~mmcgloho

•Matthew Hurst•datamining.typepad.com

•Special thanks to Christos Faloutsos, Michael Gamon, Kathy Gill, Christian Konig, Alexei Maykov, Purna Sarkar, Hassan Sayyadi, Marc Smith