exploring a hybrid of support vector machines (svms) and a heuristic based system in classifying web...

Exploring a Hybrid of Support Vector Machines (SVMs) and a

Heuristic Based System in Classifying Web Pages

Santa Clara, California, USA

Ahmad Rahman, Yuliya Tarnikova and Hassan Alam

Why Classifying Web Pages?

Web Page Classification

is often a Pre-processing

Stage in a Number of Applications

• Web SearchWeb Search• Web Page SummarizationWeb Page Summarization• Display of Web Pages in Display of Web Pages in Small Screen DevicesSmall Screen Devices• Archiving Web PagesArchiving Web Pages• Format Conversion fromFormat Conversion from HTML to other formatsHTML to other formats

Why Classifying Web Pages?

• Specific AlgorithmSpecific Algorithm• Different way to apply Different way to apply Specific parametersSpecific parameters• Local OptimizationsLocal Optimizations

Web PagesClassify

Web Pages

ApplySpecific

Algorithm

What Makes Web Pages Different from Each Other?

• Type of Content– Banking and Finance

– Programming Language

– Science

– Sport

– Others?

• Manifestation– Linguistic Difference

M. Sinha and D. Corne. A large benchmark

dataset for web document clustering. Int. Conf. on Hybrid Intelligence Systems,

2002.

Sports Page

Programming Page

Banking/Finance Page

What is this?

How Do We Use Web Classes?

• Do people writing a web page on banking/finance do it differently than people writing a sports page?

• We know there will be linguistic differences, but will there be structural differences as well?

• If there are differences, how do we characterize it?

Alternate Definitions?

• Intent of the Web Page– What is the Main Message?

• Convey Information?

• Help in Locating Information?

• Allow Specific Requests to be processed?

– Manifestation• Text/Link Mapping

• Specific Task Oriented tagset

Example 1: Informative Web Page (Primarily Textual Content)

Example 2: Locating Information(Primarily Links)

Example 3: Facilitator (Large Chunks of Forms)

Non-Linguistic Features: Structural and Hierarchical Information

• Number of large-story-type columns• Largest number of forms in one column• Text size• Number of links• Number of images• Number of columns with forms• ……and others.

Support Vector Machine

• Structural Risk Minimization– Vapnik-Chervonenkis (VC) Dimension- Property of set of functions - Maximum number of training points that can be shattered by - Ex ‘s VC dimension of the set of oriented lines

– VC Theory provides bounds on the test error, which depend on both empirical risk and capacity of function class

)}({ f

)}({ fNR

1nh

l

h

llh

llh

emp

hl

RR

)log()1(log)log(

)log(

42

),(

),()()(

Hyperplane Classification

11

11

ii

ii

yforbxw

yforbxw

SVM Implementation

We have adopted an implementation of SVMlight, which is an implementation of Vapnik's Support Vector Machine [1] for the problem of pattern recognition. The optimization algorithm used in SVMlight is described in [2].

[1] Vladimir N. Vapnik. [1] Vladimir N. Vapnik. The Nature of Statistical The Nature of Statistical

Learning Theory. Learning Theory. Springer, 1995.Springer, 1995.

[2] T. Joachims. In “Making [2] T. Joachims. In “Making large-Scale SVM Learninglarge-Scale SVM Learning

Practical”. Advances in Kernel Practical”. Advances in Kernel Methods – Support Vector Methods – Support Vector

Learning, Learning, B. Schölkopf and C. Burges B. Schölkopf and C. Burges

and A. Smola (ed.). and A. Smola (ed.).

MIT Press, 1999.MIT Press, 1999.

Initial Experiment

Database:Database:200 Randomly 200 Randomly

SelectedSelectedWeb PagesWeb Pages

Training Database:Training Database:100 100

Test Database:Test Database:100 100

Classes: Classes: 1. Story Pages1. Story Pages

2. Reference Pages2. Reference Pages3. Form Pages3. Form Pages

SVM Performance:SVM Performance:On Training Data: 95%On Training Data: 95%

On Test Data: 87%On Test Data: 87%

SVM:SVM:Dot ProductDot Product

Pair-wisePair-wise

Hybridization

Heuristic-BasedMethod

Forms

Non-Forms

References Stories

All Web Pages

SVM

Forms

Form Separation HeuristicsDefining Form Probability Score (FPS) as (F) = ∑all forms i f(i)*w(i),

where, Individual form score f(i) = #(submits & resets) * 0.2 + #(radio buttons

and check boxes) * 0.5 + #(all other active fields);

And, defining the “Weight” w(i) for the form as the following:w(i) = f(i), if f(i) є [0, 2],w(i) = 2 + (f(i) – 2)/2, if f(i) є [2, 4],w(i) = 3 + (f(i) – 4)/4, if f(i) є [2, 6],w(i) = 3.5 if f(i) > 6

Based on these two parameters, a web page is a form if: size of the text preceding first form is less then 300, and F / (#links) > 0.25

and F / (#text) > 0.01.

New Experiment (1)

Training Set: First 100Training Set: First 100Test Set: Last 100Test Set: Last 100

Story Reference

Story 42 1

Reference 1 46

Story Reference

Story 41 1

Reference 4 37

On Training Data: 97% CorrectOn Training Data: 97% Correct On Test Data: 90% CorrectOn Test Data: 90% Correct

First StageFirst Stage: (Heuristics): 100% on Train and Test Data: (Heuristics): 100% on Train and Test Data

Second StageSecond Stage Second StageSecond Stage

Combined:Combined:Training: 98%Training: 98%

Test: 95%Test: 95%

New Experiment (2)

Training Set: Last 100Training Set: Last 100Test Set: First 100Test Set: First 100

On Training Data: 98% CorrectOn Training Data: 98% Correct On Test Data: 90% CorrectOn Test Data: 90% Correct

First Stage: (Heuristics): 100% on Train and Test DataFirst Stage: (Heuristics): 100% on Train and Test Data

Second StageSecond Stage Second StageSecond Stage

Combined:Combined:Training: 99%Training: 99%

Test: 91%Test: 91%

Story Reference

Story 40 1

Reference 0 42

Story Reference

Story 39 4

Reference 5 42

Average Accuracy

HybridHybridOn Training Data: 98.5%On Training Data: 98.5%


Pair-wise SVMPair-wise SVMOn Training Data: 95%On Training Data: 95%


Future Work?

• We want to correlate different types of pages (structure) with respect to linguistic differences

• We want to characterize the structural features we used with respect to purely linguistic features

• Quantify the improvement in a secondary process due to the success/failure of web classification process

Conclusion • SVM is a very effective solution for web page classification SVM is a very effective solution for web page classification • Often the pre-defined number of web classes is smallOften the pre-defined number of web classes is small• Heuristics, if correctly applied, can be very useful in boosting Heuristics, if correctly applied, can be very useful in boosting the SVM ensemblethe SVM ensemble• For a problem of more than three classes, heuristics can be For a problem of more than three classes, heuristics can be applied in sequenceapplied in sequence• For problems of more that three classes, solving ties of the For problems of more that three classes, solving ties of the pair-wise classifiers becomes a major problem – this is pair-wise classifiers becomes a major problem – this is addressed in a later paper (MCS2003)addressed in a later paper (MCS2003)• Current applications of this include web page summarization and Current applications of this include web page summarization and re-authoringre-authoring

exploring a hybrid of support vector machines (svms) and a heuristic based system in classifying web...

Documents