identification of relevant sections in web pages using a machine learning approach

Identification of Relevant Sections in Web Pages Using aMachine Learning Approach

Jerrin Shaji George

NIT Calicut

November 8, 2012

Introduction

� There is a massive amount of data available on the internet.

� Extracting only the relevant content has become very important.

� A Machine Learning approach is suitable as it can adapt to therapidly changing dynamics of the internet.

2 of 28

Machine Learning

� The science of getting computers to act without being explicitlyprogrammed.

� A method of teaching computers to make and improve predictionsor behaviors based on some data.

� Machine Learning Algorithms :� Supervised Machine Learning

� Unsupervised Machine Learning

3 of 28

Supervised Learning

� Machine learning task of inferring a function from labeled trainingdata.

Figure: Supervised Learning Model (courtesy scikit-learn)4 of 28

Supervised Learning

� Example of a classification problem - discrete valued output.

Figure: Copyright c©Victor Lavrenko

5 of 28

Supervised Learning

� Example of a regression problem - continuous valued output.

Figure: Copyright c©Victor Lavrenko

6 of 28

Unsupervised Learning

� The data has no labels. The algorithm tries to find similaritiesbetween the objects in question.

Figure: Unsupervised Learning Model (courtesy scikit-learn)

7 of 28

Unsupervised Learning

� Example of a clustering problem

Figure: Copyright c©Victor Lavrenko8 of 28

Support Vector machines (SVM)

� A supervised learning model.

� Used for classification and regression analysis.

� The basic SVM:� A non-probabilistic binary linear classifier.

� Classifies each given input into one of the two possible classes whichforms the output.

9 of 28

The SVM Algorithm

� Inputs are formulated as feature vectors.

� The feature vectors are mapped into a feature space by using akernel function.

� A division is computed in the feature space to optimally separatethe classes of training vectors.

10 of 28

The SVM Algorithm

φ: The Kernel Function

11 of 28

Formal Definition of SVM

� An SVM constructs a hyperplane or set of hyperplanes in a high-or infinite-dimensional space.

� It can be used for classification and regression.

� A good separation is achieved by the hyperplane that has thelargest distance to the nearest training data point of any class(called the functional margin).

12 of 28

Optimal Separating Hyperplane

Figure: Courtesy Steve Gunn

13 of 28

Functional Margin

� The vectors (points) that constrain the width of the margin are thesupport vectors.

Figure: Image from scikit-learn14 of 28

Mapping to Higher Dimensions

� Sometime data is not linearly separable.

� If the original finite-dimensional space is mapped into a muchhigher-dimensional space, the separation is made easier in thatspace.

� This is achieved by the SVM using the Kernel Trick.

15 of 28

Mapping to Higher Dimensions

� Mapping from 1D to 2D

� Mapping from 2D to 3D

Figure: Coutesy Steve Gunn16 of 28

Identification of Relevant Sections in a Web Page forWeb Search

� Shallow techniques like keyword matching gives unsatisfactoryresults.

� Search methodologies must focus more on contextual informationthan just keyword occurrences.

� Search term might not a be very differentiating term.

� It might not appear in the section at all.

� SQUINT : an SVM based approach to identify sections of a Webpage relevant to a Web Search.

17 of 28

Overall Architecure

18 of 28

Feature Generation

� Word Rank Based Features

� Bigram Rank Based Features

� Coverage of Top Ranked Tokens

� Query Word Frequency

� Distance from the Query

19 of 28

Word Rank Based Features

� The rank of a word is defined to be its position in the list if thewords were ordered by frequency of occurrence across all searchresults.

� The value of this feature is the frequency of the particular word inthe given section.

� Bucketing can be used to reduce dimensionality.

20 of 28

Bigram Rank Based Features

� A bigram is defined to be two consecutive words occurring in asection.

� Eg. Machine learning may be more important than machine andlearning separately.

� The value of the feature is calculated same as Word Rank BasedFeatures.

21 of 28

Coverage of Top Ranked Tokens

� Relevance may also be determined by the number of top rankedwords which occur in the section.

� The value of this feature is the coverage of top ranked words perbucket.

22 of 28

Distance from the Query

� The intuition here is that the closer a section is to the query in theWeb page, the more likely it is to be relevant.

� The value of this feature is the section-wise distance between thesection in question and the nearest section which contains thequery.

23 of 28

Query Word Frequency

� The value of this feature is the frequency of the query word in thesection.

� The value is normalized by the number of words in the section.

24 of 28

Training Set Generation

� Query Google to get a set of pages

� Clean each page remove scripts, pictures, links etc.

� Break each page into sections.

� Label each section of every page.

25 of 28

Learning Algorithm

� An Support Vector Machine with a linear kernel is used.

� Given the relatively high dimensionality of the feature vector, it is areasonable choice to use an SVM.

� The predicted margins of each sample are used to get a non-binarymetric of how relevant each sections are.

26 of 28

Conclusion

� Support Vector Machines are an attractive approach to datamodelling.

� Evaluations suggest that using information retrieval inspiredfeatures and some basic hints from summarization give respectableaccuracy with respect to detecting the most relevant section in apage.

� Thus SQUINT can have a large impact on the user’s overall searchexperience.

27 of 28

References

� Cristianini, Nello; and Shawe-Taylor, John; An Introduction toSupport Vector Machines and other kernel-based learning methods,Cambridge University Press, 2000.

� Siddharth Jonathan J.B., Riku Inoue and Jyotika Prasad. SQUINTSVM for Identification of Relevant Sections in Web Pages for WebSearch.

� Wikipedia article on Machine Learning,http://en.wikipedia.org/wiki/Support vector machine

� Machine Learning Course on Coursera,https://class.coursera.org/ml-2012-002/class/index

28 of 28

identification of relevant sections in web pages using a machine learning approach

Technology

unsupervised learning

machine learning algorithms

basic svm

feature space

feature vectors

svm algorithm inputs

unsupervised machine

d mapping