exploring a hybrid of support vector machines (svms) and a heuristic based system in classifying web...
TRANSCRIPT
Exploring a Hybrid of Support Vector Machines (SVMs) and a
Heuristic Based System in Classifying Web Pages
Santa Clara, California, USA
Ahmad Rahman, Yuliya Tarnikova and Hassan Alam
Why Classifying Web Pages?
Web Page Classification
is often a Pre-processing
Stage in a Number of Applications
• Web SearchWeb Search• Web Page SummarizationWeb Page Summarization• Display of Web Pages in Display of Web Pages in Small Screen DevicesSmall Screen Devices• Archiving Web PagesArchiving Web Pages• Format Conversion fromFormat Conversion from HTML to other formatsHTML to other formats
Why Classifying Web Pages?
• Specific AlgorithmSpecific Algorithm• Different way to apply Different way to apply Specific parametersSpecific parameters• Local OptimizationsLocal Optimizations
Web PagesClassify
Web Pages
ApplySpecific
Algorithm
What Makes Web Pages Different from Each Other?
• Type of Content– Banking and Finance
– Programming Language
– Science
– Sport
– Others?
• Manifestation– Linguistic Difference
M. Sinha and D. Corne. A large benchmark
dataset for web document clustering. Int. Conf. on Hybrid Intelligence Systems,
2002.
Sports Page
Programming Page
Banking/Finance Page
What is this?
How Do We Use Web Classes?
• Do people writing a web page on banking/finance do it differently than people writing a sports page?
• We know there will be linguistic differences, but will there be structural differences as well?
• If there are differences, how do we characterize it?
Alternate Definitions?
• Intent of the Web Page– What is the Main Message?
• Convey Information?
• Help in Locating Information?
• Allow Specific Requests to be processed?
– Manifestation• Text/Link Mapping
• Specific Task Oriented tagset
Example 1: Informative Web Page (Primarily Textual Content)
Example 2: Locating Information(Primarily Links)
Example 3: Facilitator (Large Chunks of Forms)
Non-Linguistic Features: Structural and Hierarchical Information
• Number of large-story-type columns• Largest number of forms in one column• Text size• Number of links• Number of images• Number of columns with forms• ……and others.
Support Vector Machine
• Structural Risk Minimization– Vapnik-Chervonenkis (VC) Dimension- Property of set of functions - Maximum number of training points that can be shattered by - Ex ‘s VC dimension of the set of oriented lines
– VC Theory provides bounds on the test error, which depend on both empirical risk and capacity of function class
)}({ f
)}({ fNR
1nh
l
h
llh
llh
emp
hl
RR
)log()1(log)log(
)log(
42
),(
),()()(
Hyperplane Classification
11
11
ii
ii
yforbxw
yforbxw
SVM Implementation
We have adopted an implementation of SVMlight, which is an implementation of Vapnik's Support Vector Machine [1] for the problem of pattern recognition. The optimization algorithm used in SVMlight is described in [2].
[1] Vladimir N. Vapnik. [1] Vladimir N. Vapnik. The Nature of Statistical The Nature of Statistical
Learning Theory. Learning Theory. Springer, 1995.Springer, 1995.
[2] T. Joachims. In “Making [2] T. Joachims. In “Making large-Scale SVM Learninglarge-Scale SVM Learning
Practical”. Advances in Kernel Practical”. Advances in Kernel Methods – Support Vector Methods – Support Vector
Learning, Learning, B. Schölkopf and C. Burges B. Schölkopf and C. Burges
and A. Smola (ed.). and A. Smola (ed.).
MIT Press, 1999.MIT Press, 1999.
Initial Experiment
Database:Database:200 Randomly 200 Randomly
SelectedSelectedWeb PagesWeb Pages
Training Database:Training Database:100 100
Test Database:Test Database:100 100
Classes: Classes: 1. Story Pages1. Story Pages
2. Reference Pages2. Reference Pages3. Form Pages3. Form Pages
SVM Performance:SVM Performance:On Training Data: 95%On Training Data: 95%
On Test Data: 87%On Test Data: 87%
SVM:SVM:Dot ProductDot Product
Pair-wisePair-wise
Hybridization
Heuristic-BasedMethod
Forms
Non-Forms
References Stories
All Web Pages
SVM
Forms
Form Separation HeuristicsDefining Form Probability Score (FPS) as (F) = ∑all forms i f(i)*w(i),
where, Individual form score f(i) = #(submits & resets) * 0.2 + #(radio buttons
and check boxes) * 0.5 + #(all other active fields);
And, defining the “Weight” w(i) for the form as the following:w(i) = f(i), if f(i) є [0, 2],w(i) = 2 + (f(i) – 2)/2, if f(i) є [2, 4],w(i) = 3 + (f(i) – 4)/4, if f(i) є [2, 6],w(i) = 3.5 if f(i) > 6
Based on these two parameters, a web page is a form if: size of the text preceding first form is less then 300, and F / (#links) > 0.25
and F / (#text) > 0.01.
New Experiment (1)
Training Set: First 100Training Set: First 100Test Set: Last 100Test Set: Last 100
Story Reference
Story 42 1
Reference 1 46
Story Reference
Story 41 1
Reference 4 37
On Training Data: 97% CorrectOn Training Data: 97% Correct On Test Data: 90% CorrectOn Test Data: 90% Correct
First StageFirst Stage: (Heuristics): 100% on Train and Test Data: (Heuristics): 100% on Train and Test Data
Second StageSecond Stage Second StageSecond Stage
Combined:Combined:Training: 98%Training: 98%
Test: 95%Test: 95%
New Experiment (2)
Training Set: Last 100Training Set: Last 100Test Set: First 100Test Set: First 100
On Training Data: 98% CorrectOn Training Data: 98% Correct On Test Data: 90% CorrectOn Test Data: 90% Correct
First Stage: (Heuristics): 100% on Train and Test DataFirst Stage: (Heuristics): 100% on Train and Test Data
Second StageSecond Stage Second StageSecond Stage
Combined:Combined:Training: 99%Training: 99%
Test: 91%Test: 91%
Story Reference
Story 40 1
Reference 0 42
Story Reference
Story 39 4
Reference 5 42
Average Accuracy
HybridHybridOn Training Data: 98.5%On Training Data: 98.5%
On Test Data: 93%On Test Data: 93%
Pair-wise SVMPair-wise SVMOn Training Data: 95%On Training Data: 95%
On Test Data: 87%On Test Data: 87%
Future Work?
• We want to correlate different types of pages (structure) with respect to linguistic differences
• We want to characterize the structural features we used with respect to purely linguistic features
• Quantify the improvement in a secondary process due to the success/failure of web classification process
Conclusion • SVM is a very effective solution for web page classification SVM is a very effective solution for web page classification • Often the pre-defined number of web classes is smallOften the pre-defined number of web classes is small• Heuristics, if correctly applied, can be very useful in boosting Heuristics, if correctly applied, can be very useful in boosting the SVM ensemblethe SVM ensemble• For a problem of more than three classes, heuristics can be For a problem of more than three classes, heuristics can be applied in sequenceapplied in sequence• For problems of more that three classes, solving ties of the For problems of more that three classes, solving ties of the pair-wise classifiers becomes a major problem – this is pair-wise classifiers becomes a major problem – this is addressed in a later paper (MCS2003)addressed in a later paper (MCS2003)• Current applications of this include web page summarization and Current applications of this include web page summarization and re-authoringre-authoring