identification of relevant sections in web pages using a machine learning approach
DESCRIPTION
A brief introduction about Machine Learning, Supervised and Unsupervised Learning, and Support Vector Machines. Application of a Supervised Algorithm to identify relevant sections of webpages obtained in search results using an SVM.TRANSCRIPT
Identification of Relevant Sections in Web Pages Using aMachine Learning Approach
Jerrin Shaji George
NIT Calicut
November 8, 2012
Introduction
� There is a massive amount of data available on the internet.
� Extracting only the relevant content has become very important.
� A Machine Learning approach is suitable as it can adapt to therapidly changing dynamics of the internet.
2 of 28
Machine Learning
� The science of getting computers to act without being explicitlyprogrammed.
� A method of teaching computers to make and improve predictionsor behaviors based on some data.
� Machine Learning Algorithms :� Supervised Machine Learning
� Unsupervised Machine Learning
3 of 28
Supervised Learning
� Machine learning task of inferring a function from labeled trainingdata.
Figure: Supervised Learning Model (courtesy scikit-learn)4 of 28
Supervised Learning
� Example of a classification problem - discrete valued output.
Figure: Copyright c©Victor Lavrenko
5 of 28
Supervised Learning
� Example of a regression problem - continuous valued output.
Figure: Copyright c©Victor Lavrenko
6 of 28
Unsupervised Learning
� The data has no labels. The algorithm tries to find similaritiesbetween the objects in question.
Figure: Unsupervised Learning Model (courtesy scikit-learn)
7 of 28
Unsupervised Learning
� Example of a clustering problem
Figure: Copyright c©Victor Lavrenko8 of 28
Support Vector machines (SVM)
� A supervised learning model.
� Used for classification and regression analysis.
� The basic SVM:� A non-probabilistic binary linear classifier.
� Classifies each given input into one of the two possible classes whichforms the output.
9 of 28
The SVM Algorithm
� Inputs are formulated as feature vectors.
� The feature vectors are mapped into a feature space by using akernel function.
� A division is computed in the feature space to optimally separatethe classes of training vectors.
10 of 28
The SVM Algorithm
φ: The Kernel Function
11 of 28
Formal Definition of SVM
� An SVM constructs a hyperplane or set of hyperplanes in a high-or infinite-dimensional space.
� It can be used for classification and regression.
� A good separation is achieved by the hyperplane that has thelargest distance to the nearest training data point of any class(called the functional margin).
12 of 28
Optimal Separating Hyperplane
Figure: Courtesy Steve Gunn
13 of 28
Functional Margin
� The vectors (points) that constrain the width of the margin are thesupport vectors.
Figure: Image from scikit-learn14 of 28
Mapping to Higher Dimensions
� Sometime data is not linearly separable.
� If the original finite-dimensional space is mapped into a muchhigher-dimensional space, the separation is made easier in thatspace.
� This is achieved by the SVM using the Kernel Trick.
15 of 28
Mapping to Higher Dimensions
� Mapping from 1D to 2D
� Mapping from 2D to 3D
Figure: Coutesy Steve Gunn16 of 28
Identification of Relevant Sections in a Web Page forWeb Search
� Shallow techniques like keyword matching gives unsatisfactoryresults.
� Search methodologies must focus more on contextual informationthan just keyword occurrences.
� Search term might not a be very differentiating term.
� It might not appear in the section at all.
� SQUINT : an SVM based approach to identify sections of a Webpage relevant to a Web Search.
17 of 28
Overall Architecure
18 of 28
Feature Generation
� Word Rank Based Features
� Bigram Rank Based Features
� Coverage of Top Ranked Tokens
� Query Word Frequency
� Distance from the Query
19 of 28
Word Rank Based Features
� The rank of a word is defined to be its position in the list if thewords were ordered by frequency of occurrence across all searchresults.
� The value of this feature is the frequency of the particular word inthe given section.
� Bucketing can be used to reduce dimensionality.
20 of 28
Bigram Rank Based Features
� A bigram is defined to be two consecutive words occurring in asection.
� Eg. Machine learning may be more important than machine andlearning separately.
� The value of the feature is calculated same as Word Rank BasedFeatures.
21 of 28
Coverage of Top Ranked Tokens
� Relevance may also be determined by the number of top rankedwords which occur in the section.
� The value of this feature is the coverage of top ranked words perbucket.
22 of 28
Distance from the Query
� The intuition here is that the closer a section is to the query in theWeb page, the more likely it is to be relevant.
� The value of this feature is the section-wise distance between thesection in question and the nearest section which contains thequery.
23 of 28
Query Word Frequency
� The value of this feature is the frequency of the query word in thesection.
� The value is normalized by the number of words in the section.
24 of 28
Training Set Generation
� Query Google to get a set of pages
� Clean each page remove scripts, pictures, links etc.
� Break each page into sections.
� Label each section of every page.
25 of 28
Learning Algorithm
� An Support Vector Machine with a linear kernel is used.
� Given the relatively high dimensionality of the feature vector, it is areasonable choice to use an SVM.
� The predicted margins of each sample are used to get a non-binarymetric of how relevant each sections are.
26 of 28
Conclusion
� Support Vector Machines are an attractive approach to datamodelling.
� Evaluations suggest that using information retrieval inspiredfeatures and some basic hints from summarization give respectableaccuracy with respect to detecting the most relevant section in apage.
� Thus SQUINT can have a large impact on the user’s overall searchexperience.
27 of 28
References
� Cristianini, Nello; and Shawe-Taylor, John; An Introduction toSupport Vector Machines and other kernel-based learning methods,Cambridge University Press, 2000.
� Siddharth Jonathan J.B., Riku Inoue and Jyotika Prasad. SQUINTSVM for Identification of Relevant Sections in Web Pages for WebSearch.
� Wikipedia article on Machine Learning,http://en.wikipedia.org/wiki/Support vector machine
� Machine Learning Course on Coursera,https://class.coursera.org/ml-2012-002/class/index
28 of 28