web communities prasanna desikan (06/13/2002). 2 definition web community: groups of individuals who...

33
Web Communities Prasanna Desikan (06/13/2002)

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

Web Communities

Prasanna Desikan(06/13/2002)

Page 2: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

2

Definition

Web community: Groups of individuals who share

common interests, together with the web pages most popular among them.

Web page collections with a shared topic.

Page 3: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

3

Types of Communities Explicitly- defined.

Communities that manifest themselves as newsgroups or as resource collections on directories such as Yahoo!

Implicitly- defined. Communities that result from nature of

content-creation of the web.

Page 4: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

4

Terms and Definitions

Directed Bipartite Graph: A graph whose nodes set can be partitioned into two sets F and C, and every directed edge in the graph is from a node u in F to a node v in C.

Page 5: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

5

Terms and Definitions

Completed Bipartite Graph: A bipartite graph that contains all possible edges between a vertex of F and a vertex of C.

Core: A complete bipartite sub-graph with at least i nodes from F and at least j nodes from C. In the web world, the i pages the contains the

links are referred to as ‘fans’ and the j pages that are referenced as ‘centers’.

Page 6: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

6

Inferring Web Communities From Link Topology

Community is a core of central authoritative pages linked together by hub pages.

Identify communities corresponding to the principal and non-principal eigenvectors discovered by HITS.

For communities on broad topics: the grouping of pages discovered is relatively independent of the exact choice of root set.

Page 7: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

7

Inferring Web Communities From Link Topology

Findings on Structure of Communities. Robustness: For broad topics, HITS

produces stable, robust communities. Topic Generalization: HITS tend to

generalize topics that are not broad. “Michael Jordan” produces links to pages on

MJ and his team. “Dennis Ritchie” produces links that reference

to “C – Programming Language.”

Page 8: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

8

Inferring Web Communities From Link Topology

Other Generalization: HITS tends to converge on topics with greater density of linkage. E.g for a query on “linguistics”, the top authorities are

focused on a sub-topic “computational linguistics” because of its greater density of linkage on web.

Temporal Issues: For obtaining long-term “core” of a topic, we can superimpose the results of HITS on the same topic, spaced-out several month periods.

Page 9: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

9

Trawling the Web for Emerging Web Communities

Trawling: Systematic Enumeration of emerging communities from web crawl.

Scan through a web crawl and identify all instances of graph structures that are indicative signatures of communities.

Page 10: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

10

Trawling the Web for Emerging Web Communities

Data Source: A copy of web from Alexa.Pre-processing data. Identify potential fan pages (a page that

has links to at least six different websites) – out of 200 million pages around 24 million were extracted.

Eliminate mirrors (out of 24 million it removed around 60% of pages.

Page 11: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

11

Trawling the Web for Emerging Web Communities

Prune by in-degree. Eliminate all pages that have an in-degree

greater than a threshold value k. k is set as 50 in the experiments.

Iterative pruning. When looking for (i,j) cores any potential fan

with out-degree smaller than j can be pruned and the corresponding edges deleted from the graph.

Page 12: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

12

Trawling the Web for Emerging Web Communities

Inclusion-exclusion pruning. Let {c1,c2,…..,cj} be centers adjacent to

a fan x. N(ct) = neighborhood of ct, the set of

fans that point to ct. x is a part of core if and only if the

intersection of sets N(ct) has size at least i.

Filter nepotistic cores.

Page 13: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

13

Trawling the Web for Emerging Web Communities

Evaluation of Communities. Fossilization: 30% of communities were

fossilized. A fossil is a community core not all of whose

fans exist on the web today. Reliability: Only 4% of the trawled cores

were coincidental i.e a collection of fan pages without any cogent theme unifying them.

Page 14: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

14

Trawling the Web for Emerging Web Communities

Quality: 56% were not in Yahoo as constructed from the crawl. And 29% were not in Yahoo at the time of the paper. This indicates identification of emerging

communities by trawling.

Page 15: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

15

Self Organization and Identification of Web Communities

Web community is defined as a collection of web pages such that each member page has more hyperlinks (in either direction) within the community than outside of the community.

Approach: Maximal Flow – Minimal Cut framework.

Benefits: Focused crawling, automatic population of portal categories.

Page 16: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

16

A Simple Community Identification Example

Figure : Maximum Flow methods will separate the two subgraphs with any choice of s and t that has s on the left subgraph and t on the right subgraph, removing the three dashed links.

Page 17: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

17

Approximate Flow Community

Page 18: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

18

Exact Flow Community

Page 19: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

19

Exact Flow Community

An artificial source ‘s’, is added with infinite capacity edges routed to all seed vertices in S.

Each pre-existing edge is made bi-directional and rescaled to a constant value k.

Page 20: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

20

Exact Flow Community

All vertices except the source, sink, and seed vertices are routed to the artificial sink with unit capacity.

A residual flow graph is produced by a maximum flow procedure.

All vertices accessible from s through non-zero positive edges form the desired result and satisfy our definition of a community.

Page 21: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

21

Sample Results From Community Identification

The scores are the total number of inbound and outbound links that a web page has to other pages that are also in the community.

Page 22: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

22

Characterization of Communities

Table 3: The fifteen most significant text features for each community, sorted in descending order of the Kullback-Leibler metric.

Page 23: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

23

Discovering Seeds of New Interest Spread From Premature Pages.

A method for discovering topics, which stimulate communities of people into earnest communications on the topics’ meaning, and grow into a trend of popular interest.

Community is a group of people sharing some value.

Page 24: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

24

Agora Method on Links Archive page - Page of highest rank

according to Google in a community.

Agora Pages - Pages linked from multiple archive-pages but are not in any community themselves are taken as novel topics attracting multiple communities, called agora-topic pages.

Page 25: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

25

Agora Method on Links Step 1: A query representing user’s

interest domain is entered to a search engine (Google here, obtaining 105 to 106 pages).

Step 2: Communities, of pages obtained in Step 1, are obtained and archive-pages are selected from communities.

Page 26: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

26

Agora Method on Links Step 3: Pages, not in the

communities but linked from multiple archive-pages, are obtained as agora-pages. Having all obtained results by here, archive pages (black nodes), agora-pages (red nodes) and the links between them are visualized as in Fig.1.

Page 27: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

27

Fig: The output of Agora on Links, for domain query “Human Genome”

Page 28: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

28

Evaluation Stage 1. An interest domain is fixed, a group

of people relevant to the domain gathered, and the domain-name is input as a query (e.g. ”information retrieval”).

Stage 2. The output graph adding real and fake red nodes, as if they all were really obtained as agora-pages, is shown to the subjects. That is, some red nodes, not really obtained, were added with red links to black archive-nodes. Subjects reported individual impressions and exchanged ideas in the group.

Page 29: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

29

Sample Results Institutes in ‘red’ were the ones who

have data sources of human or mouse genomes, and is useful for researchers in other institutes to look at those data.

8 of the 12 ‘red’ nodes were termed as “interesting for thinking of future work” by the subjects.

Page 30: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

30

References [1]D. Gibson, J. Klienberg, and P.Raghavan. Inferring web

communties from link topology. In Proc. 9th ACM Conference on Hypertext and Hypermedia.

[2]Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. Trawling the web for emerging cyber-communities. In Proc 8th Int. World Wide Web Conf.,1999.

[3] Gary William Flake, Steve Lawrence, C. Lee Giles . Efficient Identification of Web Communities. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

[4] Gary William Flake, Steve Lawrence, C. Lee Giles, Frans M. Coetzee. Self-Organization and Identification of Web Communities. IEEE Computer, 35(3), 66–71, 2002.

Page 31: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

31

References [5] Naohiro Matsumura , Yukio Ohsawa , Mitsuru Ishizuka

Discovering Seeds of New Interest Spread from Premature Pages Cited by Multiple Communities, 2001 International Conference on Web Intelligence.

Page 32: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

32

Kullback-Leibler Metric

Let p and q be probability distributions with support X and Y respectively. The relative entropy or Kullback-Liebler distance between two probability distributions p and q is defined as

Back

Page 33: Web Communities Prasanna Desikan (06/13/2002). 2 Definition Web community: Groups of individuals who share common interests, together with the web pages

33Back