pula 5 giugno 2007

46
Complex networks in tagging systems Andrea Capocci Dipartimento di Informatica e Sistemistica Università di Roma ”Sapienza”

Upload: andrea-capocci

Post on 27-Jan-2015

116 views

Category:

Business


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Pula 5 Giugno 2007

Complex networks intagging systems

Andrea Capocci

Dipartimento di Informatica e SistemisticaUniversità di Roma ”Sapienza”

Page 2: Pula 5 Giugno 2007

Tag networks

www.citeulike.org

Users save scientific publications and tag them with tags (keywords).

Other examples:

Flickr.com (photos)del.icio.us (bookmarks)Connotea.org, BibSonomy (papers)

Page 3: Pula 5 Giugno 2007

TAGS

Page 4: Pula 5 Giugno 2007

Tagging systems astripartite networks

Tag assignmentA tagging system is a set of tag assignments. A tag assignment is a triplet

(user, resource, tag)

CiteULike550k tag assignments48k distinct tags180k distinct papers6k distinct users

Page 5: Pula 5 Giugno 2007
Page 6: Pula 5 Giugno 2007
Page 7: Pula 5 Giugno 2007

Text analysis of tagging

The stream of tags can be interpreted as a text continuously written by collaborative users.

Zipf laws, preferential attachment and Yule processes in tags streams?

del.icio.us > Cattuto et al.

Page 8: Pula 5 Giugno 2007

Sub-linear vocabulary growth

internal time

# of tags

Page 9: Pula 5 Giugno 2007

del.icio.us > x0.8

Page 10: Pula 5 Giugno 2007

Tag frequency distribution

Page 11: Pula 5 Giugno 2007

Preferential attachment

Page 12: Pula 5 Giugno 2007

Few tags per resource

Page 13: Pula 5 Giugno 2007

Where is semantics?

Such properties can be modeled by Yule-Simon processes with memory (see Cattuto et al.)

But such analysis does not capture the semantics of tags: hierarchical relations etc.

Page 14: Pula 5 Giugno 2007

Why semantics matters?

Detection of tags categories.

Understanding users' strategies to improve the system, propose new services.

Spam detection.

Page 15: Pula 5 Giugno 2007

Why semantics matters?

Detection of tags categories.

Understanding users' strategies to improve the system, propose new services.

Spam detection.

Page 16: Pula 5 Giugno 2007

Why semantics matters?

Detection of tags categories.

Understanding users' strategies to improve the system, propose new services.

Spam detection.

Page 17: Pula 5 Giugno 2007

Why semantics matters?

Detection of tags categories.

Understanding users' strategies to improve the system, propose new services.

Spam detection.

Page 18: Pula 5 Giugno 2007

Tag co-occurrence network

Tags are nodes.

If two tags are assigned to the sameresource, one puts an edge between thetwo tags.

Edges are weighted: each co-assignmentof two tags increases the edge weight byone.

Strength instead of degree.

Page 19: Pula 5 Giugno 2007

Distribution of strength

Page 20: Pula 5 Giugno 2007

Distribution of strength

?

Page 21: Pula 5 Giugno 2007

Nontrivial clustering & spam detection

Clustering coefficient C(k) Average density of triangles around nodes with degree k

Page 22: Pula 5 Giugno 2007

Nontrivial clustering & spam detection

Page 23: Pula 5 Giugno 2007

Nontrivial clustering & spam detection

k = 502

Page 24: Pula 5 Giugno 2007

Looking for a k = 502 page...

Page 25: Pula 5 Giugno 2007
Page 26: Pula 5 Giugno 2007

SPAM

Page 27: Pula 5 Giugno 2007

Nontrivial clustering & spam detection

spamk = 502

Page 28: Pula 5 Giugno 2007

Co-occurrence networksand semantics

Co-occurrence networks are scale-free ones.

The significance of such statistical property is ambiguous.

Clustering encodes semantics (?)

Clustering can be used to detect spam.

Page 29: Pula 5 Giugno 2007

Co-occurrence networksand semantics

Co-occurrence networks are scale-free ones.

The significance of such statistical property is ambiguous.

Clustering encodes semantics (?)

Clustering can be used to detect spam.

Page 30: Pula 5 Giugno 2007

Co-occurrence networksand semantics

Co-occurrence networks are scale-free ones.

The significance of such statistical property is ambiguous.

Clustering encodes semantics (?)

Clustering can be used to detect spam.

Page 31: Pula 5 Giugno 2007

Co-occurrence networksand semantics

Co-occurrence networks are scale-free ones.

The significance of such statistical property is ambiguous.

Clustering encodes semantics (?)

Clustering can be used to detect spam.

Page 32: Pula 5 Giugno 2007

Co-occurrence networksand semantics

Co-occurrence networks are scale-free ones.

The significance of such statistical property is ambiguous.

Clustering encodes semantics (?)

Clustering can be used to detect spam.

Page 33: Pula 5 Giugno 2007

Users' strategies

Do users tag resources according to tag conceptual

hierarchy?

Page 34: Pula 5 Giugno 2007

For example

”Emergence of scaling in random networks”by A.-L. Barabasi and R. Albert

Semantics and hierarchy

Page 35: Pula 5 Giugno 2007

For example

”Emergence of scaling in random networks”by A.-L. Barabasi and R. Albert

scale-free networks

Semantics and hierarchy

Page 36: Pula 5 Giugno 2007

Semantics and hierarchyFor example

”Emergence of scaling in random networks”by A.-L. Barabasi and R. Albert

scale-free networks networks

HIERARCHICAL

Page 37: Pula 5 Giugno 2007

For example

”Emergence of scaling in random networks”by A.-L. Barabasi and R. Albert

scale-free networks WWW

NON HIERARCHICAL

Semantics and hierarchy

Page 38: Pula 5 Giugno 2007

Model based on hierarchy

Conjectures

1. Tags have an underlying hierarchy.2. With high probability, users add tags hierarchically.

Can we reproduce the co-occurrence network structure based on tag hierarchy?

Page 39: Pula 5 Giugno 2007

Model based on hierarchy

The underlying hierarchy is a random tree.

At each time step, we add a new resource, with two tags.

New tags are introduced with probability Pnt.

With probability Psb

, the second tag is a ”generalization” of the first tag, otherwise it is chosen randomly.

Page 40: Pula 5 Giugno 2007

Model based on hierarchy

The underlying hierarchy is a random tree.

At each time step, we add a new resource, with two tags.

New tags are introduced with probability Pnt.

With probability Psb

, the second tag is a ”generalization” of the first tag, otherwise it is chosen randomly.

Page 41: Pula 5 Giugno 2007

Model based on hierarchy

The underlying hierarchy is a random tree.

At each time step, we add a new resource, with two tags.

New tags are introduced with probability Pnt.

With probability Psb

, the second tag is a ”generalization” of the first tag, otherwise it is chosen randomly.

Page 42: Pula 5 Giugno 2007

Model based on hierarchy

The underlying hierarchy is a random tree.

At each time step, we add a new resource, with two tags.

New tags are introduced with probability Pnt.

With probability Psb

, the second tag is a ”generalization” of the first tag, otherwise it is chosen randomly.

Page 43: Pula 5 Giugno 2007

Results: strength distribution

Page 44: Pula 5 Giugno 2007

\\

Results: clustering

Page 45: Pula 5 Giugno 2007

Conclusions

Tagging systems display non trivial statistical properties: Zipf laws.

Co-occurrence networks are a way of discovering semantic relationship between tags (?)

Clustering in co-occurrence networks encodes semantics (?) and detects spam.

Simple models based on hierarchy partially explain such properties.

Page 46: Pula 5 Giugno 2007

Thank youand thanks to...

Guido Caldarelli

The TAGORA group (Cattuto et al.)