density link-based methods for clustering web pages

Post on 17-Jan-2016

23 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Density link-based methods for clustering web pages. Morteza Haghir Chehreghani , Hassan Abolhassani , Mostafa Haghir Chehreghani DSS, 2009 Presented by Jun-Yi Wu 2010/09/08. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

1

Density link-based methods for clustering web pages

Morteza Haghir Chehreghani, Hassan Abolhassani,Mostafa Haghir ChehreghaniDSS, 2009

Presented by Jun-Yi Wu2010/09/08

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

2

Outlines

· Motivation· Objectives· Methodology· Experiments· Conclusions· Comments

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

3

Motivation

· Web Information is very useful for supporting decision making, but the information explosion on the web makes it hard to obtain required knowledge.

· Effective web clustering facilitates relevant document retrieval that itself facilitates decision making.

· High quality clustering, assists users to access relevant information much conveniently.

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

4

Objectives

· Using both content and link information on top of density based algorithms.

· Density based methods have the advantages of creating clusters in various shapes and removing the noisy data.

· Proposing a method using web hyperlink structure to find the dense units and also improve the joining process for creating hierarchical clusters.

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

· In this paper, Proposing two methods:─ New density-based method─ Density link-based method

5

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Density-based Method: DBSCAN

· DBSCAN was the first density based algorithm, in which to create a new cluster or expand an existing one.

· A neighborhood distance with radius Eps must contain at least a minimum number of points denoted by MinPts.

6

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.New density-based method

· Clustering web data using only textual contents of documents.

· Extending the basic algorithm to use hyperlinks between the web documents.

7

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.New density-based method

· The method has some limitations including:─ A constant value for mutation to a higher level is not

appropriate. A smaller value maybe appropriate for smaller clusters, but larger ones must take larger values.

─ It is developed for web data clustering, but it doesn't use hyperlink structure of the web.

─ Setting accurate values for parameters of the proposed method maybe difficult.

8

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Link-based algorithm

· Hyperlink structure brings some interesting ideas:─ in combination with text content can help to construct

hierarchical clusters with the link-based clusters as the base clusters

─ link structure can be a good suggestion to find dense units

9

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Link-based algorithm

· First step - Finding dense units─ A subhyperlink structure is an LD_Unit if for each core

node N inside the unit there is a subset of N's neighbors that: it has at most MaxN members sum of the similarities between N and the nodes of this subset

is at least W.

· Second step – Joining dense units─ Node a is said to be external node of b if a and b do not

exist in the same LD_Unit.

10 W=2.5 and MaxN=4

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

· Comparison of density based algorithms from different aspects.

11

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

· Use of the density based method for clustering web pages.

12

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

· Examination of link-based method for clustering web pages

13

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

14

Conclusions

· The proposed method has the preference of low complexity(O(n*log n)) and the resultant clusters have high quality.

· Revealing that link-based method has some preferences over the density based method.

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

15

Comments

· Advantages─ Low Complexity─ High quality

· Applications─ Data Clustering

top related