page segmentation by web content clusteringwims.vestforsk.no/slides/alcic.pdf · web page...

40
WIMS ’11 Page Segmentation by Web Content Clustering Sadet Alcic Heinrich-Heine-University of Duesseldorf Department of Computer Science Institute for Databases and Information Systems May 26, 2011 1 / 19

Upload: others

Post on 26-Mar-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Page Segmentation by Web Content Clustering

Sadet Alcic

Heinrich-Heine-University of DuesseldorfDepartment of Computer Science

Institute for Databases and Information Systems

May 26, 2011

1 / 19

Page 2: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Outline

1 IntroductionMotivationRelated Work

2 Web Page Segmentation by ClusteringGeneral IdeaDistance functions for web contentsClustering methods

3 Evaluation StudiesDistance functionsClustering

4 Conclusion and Future work

2 / 19

Page 3: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Introduction Motivation

Motivation

3 / 19

Page 4: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Introduction Motivation

Motivation

3 / 19

Web Page is cluttered with differentcontents

I Different news articles

I Link lists

I Commercials

I Template elements

I Functional elements

Page 5: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Introduction Motivation

Motivation

3 / 19

Web Page Segmentation

I Separation of web contents intostructural and semanticcohesive blocks

Page 6: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Introduction Motivation

Motivation

3 / 19

Page 7: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Introduction Motivation

Motivation

3 / 19

Applications

I Web Content Search

I Web Page Categorization

I Web Page Adaptation forMobile Devices

I Web Image Indexing

I ...

Page 8: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Introduction Related Work

Overview of Related Work to Web Page Segmentation

I TOP-DOWN page segmentation:I KDD’02: Lin and Ho. Discovering Informative Content Blocks from

Web Documents (Table properties)I APWeb’03: Cai et al. Extracting content structure for web pages based

on visual representation (Heuristic rules on visual and DOM properties)I TKDE’05: Kao et al. Web Intrapage Informative Structure Mining

Based on DOM (Term entropy based on heuristics)

I BOTTOM-UP page segmentation:I CIKM’02: Li et al. Using Micro Information Units for Internet Search

(Heuristic rules)I WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage

segmentation (Graph partitioning)I CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web

page segmentation (Partitioning of a histogram of text density)

4 / 19

Page 9: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Introduction Related Work

Overview of Related Work to Web Page Segmentation

I TOP-DOWN page segmentation:I KDD’02: Lin and Ho. Discovering Informative Content Blocks from

Web Documents (Table properties)I APWeb’03: Cai et al. Extracting content structure for web pages based

on visual representation (Heuristic rules on visual and DOM properties)I TKDE’05: Kao et al. Web Intrapage Informative Structure Mining

Based on DOM (Term entropy based on heuristics)

4 / 19

Basic Idea

I Start with complete Page asinitial block

I Decide for each block:I should the block be

separated?I if yes, where to separate?

! Based on heuristics

Page 10: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Introduction Related Work

Overview of Related Work to Web Page Segmentation

I BOTTOM-UP page segmentation:I CIKM’02: Li et al. Using Micro Information Units for Internet Search

(Heuristic rules)I WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage

segmentation (Graph partitioning)I CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web

page segmentation (Partitioning of a histogram of text density)

4 / 19

Basic Idea

I Start with smallest contentunits (e.g., DOM leafs)

I group them to blocks ofcoherent content

I How?

Page 11: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Introduction Related Work

Overview of Related Work to Web Page Segmentation

I BOTTOM-UP page segmentation:I CIKM’02: Li et al. Using Micro Information Units for Internet Search

(Heuristic rules)I WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage

segmentation (Graph partitioning)I CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web

page segmentation (Partitioning of a histogram of text density)

4 / 19

Our Approach

I belongs to BOTTOM-UP methods

I DOM leafs are used as basic webobjects

I Idea!: group web objects to blocks byclustering

Page 12: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Web Page Segmentation by Clustering

Web Page Segmentation by Clustering

5 / 19

Page 13: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Web Page Segmentation by Clustering General Idea

Page Segmentation by Clustering

General Definition: Clustering

I Clustering is the process of organizing objects into groups whosemembers are similar in some way

I A cluster is therefore a collection of objects which are similar betweenthem and are dissimilar to the objects belonging to other clusters

Open questions addressed in this work

I How can the similarity (or dissimilarity) of web objects be estimated?

I Which representation is best suitable to represent web objects?

I Which clustering method should be applied for clustering?

6 / 19

Page 14: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Web Page Segmentation by Clustering Distance functions for web contents

Different Representations of Web objects

I Geometric RepresentationI web browser puts every object of a web page in a 2-dim planeI extract the bounding rectangle for each object

I Semantic RepresentationI elements in DOM contain some textual contentsI extract keywords from the corresponding text

I DOM-based RepresentationI each object is a node in the DOM tree of the pageI use the position of the object in DOM tree to characterize it

A B

⇒ Different distance measures are possible

7 / 19

Page 15: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Web Page Segmentation by Clustering Distance functions for web contents

Geometric Distance

I Let R = [(rx , ry ), (rx,, ry

,)] and S = [(sx , sy ), (sx,, ry

,)] be twobounding rectangles

I The geometric distance of R to S is given by

dist(R, S) =(∑

i∈x ,y ti2) 1

2, with ti =

ri − si

, if ri > si,

si − ri, if ri

, < si0 if otherwise.

I Visually:

R

S

(rx', ry')

(rx, ry)

(sx, sy)

(sx', sy')

x

y

mindist(R,

S)

(0, 0)

8 / 19

Page 16: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Web Page Segmentation by Clustering Distance functions for web contents

Semantic Distance

I Given T1 = (dog, run, street), T2 = (puppy, walk, road)I Cosine Similarity Measure (Information Retrieval)

I Lexical word-to-word matching → sim(T1, T2) = 0

I to strict: e.g. synonym and hyponym relationships are not considered

I Instead: text similarity measure based on WordNet [Corley 05]I Words are mapped to concepts in WordNet (concept-to-concept

matching)

sim(T1, T2) =

∑wi∈T1

maxSim(wi , T2) · idf (wi )∑wi∈T1

idf (wi )

9 / 19

Page 17: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Web Page Segmentation by Clustering Distance functions for web contents

DOM-based Distance

1

2

3

4

5

6 1 2 3 4 5 6

B C

Alevel = 0

level = 1

level = 2

level degree = 2

level degree = 3

level degree = 0

10 / 19

Page 18: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Web Page Segmentation by Clustering Distance functions for web contents

DOM-based Distance

1

2

3

4

5

6 1 2 3 4 5 6

B C

Alevel = 0

level = 1

level = 2

level degree = 2

level degree = 3

level degree = 0

Requirements

I Nodes under same parent are closer than nodes under different parent

I Nodes on higher tree level are closer that nodes on lower level

10 / 19

Page 19: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Web Page Segmentation by Clustering Distance functions for web contents

DOM-based Distance

1

2

3

4

5

6 1 2 3 4 5 6

B C

Alevel = 0

level = 1

level = 2

level degree = 2

level degree = 3

level degree = 0

I Traverse DOM-tree in preorder traversing:

P = ( kA , kB , k1 , k2 , k3 , kC , k4 , k5 , k6 )

I For each element in P define a weight wpi that expresses the costsneeded to reach pi from its predecessor in P

10 / 19

Page 20: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Web Page Segmentation by Clustering Distance functions for web contents

DOM-based Distance

1

2

3

4

5

6 1 2 3 4 5 6

B C

Alevel = 0

level = 1

level = 2

level degree = 2

level degree = 3

level degree = 0

I Traverse DOM-tree in preorder traversing:

P = ( kA , kB , k1 , k2 , k3 , kC , k4 , k5 , k6 )

I For each element in P define a weight wpi that expresses the costsneeded to reach pi from its predecessor in P

10 / 19

Page 21: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Web Page Segmentation by Clustering Distance functions for web contents

DOM-based Distance

1

2

3

4

5

6 1 2 3 4 5 6

B C

Alevel = 0

level = 1

level = 2

level degree = 2

level degree = 3

level degree = 0

I The distance between pa, pb ∈ P, (wlog. a < b) is defined as:

d(pa, pb) =b∑

i=a+1

wpi

Example: d( k2 , k4 ) = w3 + wC + w4

10 / 19

Page 22: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Web Page Segmentation by Clustering Distance functions for web contents

DOM-based Distance

1

2

3

4

5

6 1 2 3 4 5 6

B C

Alevel = 0

level = 1

level = 2

level degree = 2

level degree = 3

level degree = 0

I The weight wi of a node pi ∈ P depends on the level l and the leveldegree dl of pi :

w(l) =

{c : dl = 0

dl · w(l + 1) : dl > 0,(1)

e.g., w(2) = c , w(1) = 3 ∗ w(2) = 3c, w(0) = 2 ∗ w(1) = 6c ,

10 / 19

Page 23: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Web Page Segmentation by Clustering Clustering methods

Clustering Methods

I Partitioning ClusteringI k-medoid (similar as k-means, but cluster representatives are real

objects)

I Agglomerative Hierarchical ClusteringI single link method applied to compute distance between sets of objects

I Density-based ClusteringI DBSCAN variant (able to find clusters of different density levels)

11 / 19

Page 24: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies

Evaluation Studies

12 / 19

Page 25: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies Distance functions

Distance-Matrix Visualization

I A distance matrix contains all pairwise distances of the objects to beclustered, e.g.

a b c

a 0 1.9 1.1b 1.9 0 2.3c 1.1 2.3 0

13 / 19

Page 26: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies Distance functions

57-65

1-19

66-68

29-4344-54

55-56

20-28

13 / 19

Page 27: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies Distance functions

57-65

1-19

66-68

29-4344-54

55-56

20-28

13 / 19

Numbers

I indicate the position of particular webobjects in the source code

Page 28: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies Distance functions

Distance-Matrix Visualization

1

10

20

30

40

50

60

1

10

20

30

40

50

60

1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60

1

10

20

30

40

50

60

a) DOM-Distance b) Geometric-Distance c) Semantic-Distance

I left and bottom axe represent the web objects ordered by theirappearance on the web page

I each pixel represents a distance value

I white means lowest distance, black means highest distance

I bright squares in the diagonal indicate possible page blocks

13 / 19

Page 29: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies Distance functions

57-65

1-19

66-68

29-4344-54

55-56

20-28

13 / 19

Page 30: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies Distance functions

Distance-Matrix Visualization

1

10

20

30

40

50

60

1

10

20

30

40

50

60

1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60

1

10

20

30

40

50

60

a) DOM-Distance b) Geometric-Distance c) Semantic-Distance

13 / 19

Page 31: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies Distance functions

Distance-Matrix Visualization

1

10

20

30

40

50

60

1

10

20

30

40

50

60

1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60

1

10

20

30

40

50

60

a) DOM-Distance b) Geometric-Distance c) Semantic-Distance

Results:

I DOM-distance has good correspondence

I Geometric distance has some correspondence, but there are otherbright rectangles

I Semantic distance has almost no correspondence

13 / 19

Page 32: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies Distance functions

Distance-Matrix Visualization

1

10

20

30

40

50

60

1

10

20

30

40

50

60

1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60

1

10

20

30

40

50

60

a) DOM-Distance b) Geometric-Distance c) Semantic-Distance

Results:

I DOM-distance has good correspondence

I Geometric distance has some correspondence, but there are otherbright rectangles

I Semantic distance has almost no correspondence

13 / 19

Page 33: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies Distance functions

57-65

1-19

66-68

29-4344-54

55-56

20-28

1

10

20

30

40

50

60

1

10

20

30

40

50

60

1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60

1

10

20

30

40

50

60

a) DOM-Distance b) Geometric-Distance c) Semantic-Distance

13 / 19

Page 34: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies Distance functions

Distance-Matrix Visualization

1

10

20

30

40

50

60

1

10

20

30

40

50

60

1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60

1

10

20

30

40

50

60

a) DOM-Distance b) Geometric-Distance c) Semantic-Distance

Results:

I DOM-distance has good correspondence

I Geometric distance has some correspondence, but there are otherbright rectangles

I Semantic distance has almost no correspondence

13 / 19

Page 35: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies Clustering

Evaluation - Clustering Performance

Dataset

I 78 web documents from 8 different categories from Yahoo! directory

I 23,819 web contents (in average 305 per web page)

I Web contents clustered manually by 3 volunteers

I Ground truth is combination of all three proposals

MethodI For each combination of clustering method & distance function

I compute clustering of the web contents into page blocks

I Ground Truth Clustering (GT), Computed Clustering (C)

14 / 19

Page 36: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies Clustering

Evaluation - Performance Measure

Based on Contingency Table for pairs of objects:

Same cluster in C Different cluster in C

Same cluster in GT f11 f10

Different cluster in GT f01 f00

Performance Measure

Rand statistic =f00 + f11

f00 + f01 + f10 + f11

15 / 19

Page 37: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Evaluation Studies Clustering

Average Rand Statistic Results

DOM-based geometric semantic

partitioning 0.45 0.47 0.25

hierarchical 0.52 0.41 0.24

density-based 0.61 0.43 0.27

Results:I Rand statistic values are similar to the results of Distance Matrix

VisualizationI DOM distance reaches highest values with DB clustering

I the distances between together belonging objects are varying in themetrical space derived by DOM distance

I DB clustering is able to find clusters of different densities

16 / 19

Page 38: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Conclusion and Future work

Conclusion and Future Work

17 / 19

Page 39: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Conclusion and Future work

Conclusion

I Web Page Segmentation by Clustering was presented

I three different distance function for web objects based on geometric,semantic and DOM properties

I three clustering methods from different categories: partitioning,hierarchical and density-based clustering

I best clustering accordance to ground truth with DOM-Distance andDB clustering

Future Work

I combination of distance measure (linear, multiplicative, ?)

I comparison to other Web Page Segmentation methods from literature

I application of Web Page Segmentation to Web Image ContextExtraction (paper accepted, to be published)

18 / 19

Page 40: Page Segmentation by Web Content Clusteringwims.vestforsk.no/slides/alcic.pdf · Web Page Segmentation by Clustering Distance functions for web contents Di erent Representations of

WIMS ’11

Conclusion and Future work

Thank You!

19 / 19