clei 2007 1 measuring contribution of html features in web document clustering oldemar rodríguez...
TRANSCRIPT
CLEI 2007 1
Measuring Contribution of HTML Features in WebDocument Clustering
Oldemar Rodríguez
School of Mathematics, UCR
and Predisoft
Esteban Meneses
Computing Research Center, ITCR
CLEI 2007 2
Motivation
CLEI 2007 3
Motivation
Which HTML feature is the most important to provide good clustering results?
Using symbolic objects to cluster web documents.
15th World Wide Web Conference (2006)
CLEI 2007
HTML Document Clustering
Find meaningful groups from a web document collection.
Effectively represent web document clusters for further analysis.
CLEI 2007 5
HTML Document
CLEI 2007 6
CLEI 2007 7
Classical Representations
• Different approaches for representing a web document.
<5,22,19,4,...,38>
CLEI 2007 8
Vectorial Representation
• Every document is represented by a vector inn-dimensional space.
• Bag of words scheme. Each variable represents the relative weight of a term in the document.
CLEI 2007 9
Symbolic Objects
• Real-life objects are too complex to be represented by points in a vectorial space. [Bock&Diday, 2000]
• Symbolic objects overcome this limitation by representing concepts rather than individuals.
• In a symbolic data array each variable can have one of many data types: sets, intervals, histograms, trees, graphs, functions, fuzzy data, etc.
CLEI 2007
Symbolic Data Table
CLEI 2007
Multivariate Numeric Analysis
Individual Age Profession Wage Location
3457 36 Lawyer 2,500.00 San José
1251 28 Teacher 1,750.00 Alajuela
3245 39 Doctor 2,400.00 San José
7635 33 Teacher 1,900.00 Alajuela
3245 35 Engineer 1,850.00 Alajuela
5367 27 Engineer 1,900.00 Heredia
6486 34 Manager 1,600.00 Heredia
Individual Age Profession Wage
San José [36,39] {Law, 50%,Doc,50%} [2,4 – 2,5]
Alajuela [28,35] {Tea,66%,Eng,33%} [1,75 – 1,9]
Heredia [2,34] {Eng,50%,Mgn,50%} [1,6 – 1,9]
Multivariate Symbolic Analysis
Millions…
Hundreds…
Data
Concepts
From relational data bases to symbolic data bases
Symbolic Data Table
CLEI 2007 12
Relational Data Base Symbolic Data Base
100% knowledge
15 Gigabyte
90 % knowledge
10.3 Megabyte
Symbolic Data Base
CLEI 2007 13
Symbolic Representations
• A complex representation that takes into account: term frequency, word order and phrases.
CLEI 2007 14
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
The K-Means Clustering Method
CLEI 2007 15
But, there are some problems …….
CLEI 2007 16
Distance Measures
CLEI 2007 17
Teorema: Igualdad de Fisher
• Inercia total Inercia total = Inercia inter-clases Inercia inter-clases
+ +
Inercia intra-clasesInercia intra-clases
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CLEI 2007 18
1. Representar una clase por su centro de gravedad, esto es, por su vector de promedios.
2. ¿Qué es el centro de gravedad?
Problemas en el caso simbólico:
CLEI 2007
¿Qué el centro de gravedad?
CLEI 2007
CLEI 2007 21
Evaluation Criteria
1. Rand Index
2. Mutual Information
3. F-Measure
4. Entropy
CLEI 2007 22
Experiments
CLEI 2007 23
Experiments
CLEI 2007 24
Experiments
CLEI 2007 25
Experiments
Text 0.2894
Title 0.2584
Bold 0.0379
Anchor 0.1689
Header 0.1009
Graph 0.1229
Tree 0.0212
WebKB
Text 0.7035
Graph 0.2515
Tree 0.0449
20Newsgroup
CLEI 2007 26
Conclusions
• Symbolic representations are richer and more flexible than classical representations.
• The text in the HTML document seems to be the more important factor to cluster HTML documents.
CLEI 2007 27
Thank you!