web content summarization

WEB CONTENT SUMMARIZATION

Timothy Washington

A Look at Algorithms, Methodologies, and Live Systems

Why is web content summarization important?

The quantity of information available on the Internet has grown vastly and this trend will continue for years to come.

In contrast, however, the amount of time that web users desire to spend looking through this information continues to decrease.

Additionally, the development of handheld web devices (e.g. smartphones and, iPads) creates the need to reduce text for display on such small devices.

Two Main Phases of Web Content Summarization

Content Selection - weighing the importance of information throughout web documents and determining what is crucial to the web user’s general understanding of the material; determining what information is redundant and unnecessary.

Content Generation - structuring the selected content with proper syntactic structure and semantic representation

Algorithms and Methodologies used in Content Selection

TF*IDF (term frequency inverse document frequency) Lexical Chains Machine Learning (e.g. Noisy Channel Models and N-

grams) Clickthrough Data Evaluation

TF*IDF

TF*IDF is a method used for summarizing multiple web documents. It selects sentences for a summary based on the statistical scores of words they contain. This score is computed from the combination of two equations.For term frequency we count the number of occurrences of a given word in the current web document being evaluated.

With inverse document frequency we check to see how many web pages (out of the group that is being summarized) contain a given word. This number is divided into the number of web pages in the group. Next the log of the quotient is taken.

After this is computed for each word, sentences with the highest scoring words are selected for the summary until some threshold is reached.

Lexical Chains

Lexical chaining is used in the summarization of single and multiple web documents. In lexical chaining words that have similar meaning are linked together through a data structure.

WordNet, an online database that groups words into sets of synonyms is a primary tool used in determining which words should be “chained”.

The strength of a lexical chain is based on the number of terms contained within it.

Usually, the first sentence from each strong chain is selected for the summary until some threshold is reached.

An arctic cold wave, the worst in 10 years, hit parts of Europe, leaving many residents dead. Hardest hit by the sub-zero temperatures were Poland, Bulgaria, and Romania. Rare snowfall occurred in southern Italy. In Poland, three weeks of chilling temperatures killed at least 85 people in November, 29 more citizens than in all of the previous winter …

Noisy Channel Model

In the noisy channel model, a summarization system is trained on a set of documents and handwritten summaries, recording features of sentences selected (as well as features of those not selected) for the summary.

These features may include sentence length, average TF*IDF per word, sentence position in a paragraph, etc. They are represented in vectors as binary numbers, decimals, and strings.

Noisy Channel Model (continued)

When web documents are run through the summarizer feature vectors are created for each sentence. Afterwards a machine learning algorithm such as Baye’s Rule (shown below) is applied for each sentence. A threshold can be used to place the N number of sentences with the highest values in the summary.

P(s <S|F1,…,FN) = P(F1,F2,…,FN|s S) * P (s S) / P (F1, F2,…,FN)∈ ∈ ∈

s – the given sentence S – the summaryF1, F2,…,FN – sentence feature vectorP(s <S|F1,…,FN) – ∈ the probability that s should be chosen for the summaryP (s S) – ∈ the statistical probability ( not based on any features) of a sentence being chosen for a summary.P (F1, F2,…,FN) – the probability of a sentence vector in the document set matching that of the given sentence P(F1,F2,…,FN|s S) ∈ – the probability of a sentence vector in the summary set matching that of the given sentence

N-Grams

N-grams are sequences of N words in a given sentence that are evaluated based on the probability that they will occur in a given order.

The most popular forms of N-grams are unigrams (single words), bigrams (a set of two words), and tri-grams (a set of three words).

In the case of summarization, N-gram probability is computed based on the statistics collected from the document-summary training set.

For example, the probability for sets of bigrams in a candidate sentence appearing within sentences selected for the summary could be evaluated.

Once again this statistical probability can be compared to a threshold value used to determine whether or not a sentence is fit for summarization.

The shortcoming of such an implementation is that the subject matter of the document-summary pairs must be closely related to that of the web content.

Clickthrough Data Evaluation

Clickthrough data is created by recording data sets from search engines to represent user activity.The data sets are represented as set <u, q, p>, where user (u) enters data in a query (q) and from the results selects pages (p). Once this data has been collected over a given period of time there will be millions of these data sets available for evaluation.Sentences can then be selected based on their number of significant words. In this case the significance of each word is determined by the number of times a user selected the given web page from the search engine after having entered a query containing the given word (select all q from the data set with value current p). One obvious issue with using the clickthrough data is that certain web pages may be summarized that users have not accessed through a search engine. In these cases it is useful to employ some backup method (such as TF-IDF) of selecting useful material for the summary.

News In Essence

Online news article summarization system developed at the University of Michigan Ann ArborUnique in that it uses a Trolling Algorithm to determine the group of documents to summarize itself rather than taking a preset group of documents.

News In Essence (continued)

Input : SeedUrl, SitesToSearch, ExitConditionsOutput : ClusterCluster<-SeedUrlWeightedKeywords<-get_common_keywords(SeedUrl,

SeedUrl)LinkedUrls<-get_links(SeedUrl)//primary searchwhile UrlToTest<- next(LinkedUrls) && PrimaryExitCondition !

= trueif follows_useful_rules(UrlToTest)LinkedUrls<- LinkedUrls + get_links(UrlToTest)if follows_article_rules(UrlToTest)&& (similarity(SeedUrl, UrlToTest) > threshold)Cluster<- Cluster + UrlToTestWeightedKeyWords<- WeightedKeyWords +get_common_keywords(SeedUrl, UrlToTest)SecSearchKeyWords<- max_n(WeightedKeyWords)//secondary searchwhile SearchSite<-next(SitesToSearch) &&SecondaryExitCondition != trueSearchPage<- generate_search(SearchSite,SecSearchKeyWords)LinkedUrls<- get_links(SearchPage)while UrlToTest<- next(LinkedUrls) &&SecondaryExitCondition != trueif follows_useful_rules(UrlToTest)LinkedUrls<- LinkedUrls + get_links(UrlToTest)if follows_article_rules(UrlToTest) &&(similarity(SeedUrl, UrlToTest) > threshold)Cluster<- Cluster + UrlToTestReturn Cluster

Starts by looking through a single web page, known as the SeedUrl. Searches this page for links to other articles that share the same topic. After adding the linked pages to the document set, it uses TF*IDF to compute the words of most importance between the web documents. In the next phase it takes these words and uses them to query several news search engines to find additional articles and add them to the document set. It additionally follows links from these search results to see if there are additional pages that can be used for summarization. Uses a series of rules to determine whether linked pages should be excluded from the set. One such rule is that if a URL ends with “.jpg” then this is evidence of a page that contains no textual content, therefore deeming it unfit for the document set.

Web Page Summarization using Clickthrough Data

System developed at Tsing Hua University. Determines the importance of a word using weighted term

frequency within web pages and term frequency within query data.

Each of the frequencies is weighted based on the variable between 0 and 1.

As shown above the closer this variable is to 0 the more the word importance will be based on term frequency within the web pages. Likewise, the closer the variable is to 1 the more this ranking will be based on clickthrough data.

Web Page Summarization using Clickthrough Data (continued)

After the system determines the important terms it applies Luhn’s algorithm (shown below) to each sentence.

1) Set a limit L for the distance at which any two significant words could be considered related.

2) Finds a portion in the sentence that is bracketed by significant words not more than L non-significant words apart.

3) Counts the number of significant words contained in this portion and divides the square of this number by the total number of words in the portion. The result is the significant factor of each sentence.

Sample: A default could devastate the economy, leading to a crashing dollar and skyrocketing interest rates, among other things.

L =4Left bracket word = economyRight bracket word = interestSignificant words in portion = 3Total words in portion = 7Significance factor of sentence = sqrt(3)/7 = 0.247

Web Page Summarization using Clickthrough Data (continued)

NewsBlaster

NewsBlaster (continued)

Weights sentences in web pages based on various features. Some of which include the following:

Location: If a sentence appears late in a document then its weight is reduced as these sentences are more likely to contain information that is redundant or not as important.

Publication Date: The more recent an article that the sentence belongs to the greater the weight that it is given. This is to help ensure that web users receive the most up to date information.

Length: Sentences were weighted negatively if they have a length of more than 30 words or less than 15 words. Long sentences are thought to contain redundant information whereas short sentences are thought to be limited to supporting information.

Pronoun: A negative weight is given for sentences that start with pronouns as these sentences usually contain supporting information and not primary information.

Sentences with the highest weights are selected. Afterwards they are fused together using a separate algorithm.

Tests Run on NewsBlaster

Three document summary sets from NewsBlaster were tested based on the following criteria:Summary Length – the length of the summary compared to the average length of the original documentsPrecision – portion of information in the summary that belongs in the summary (in this case this was based on the number of sentences in each summary relevant to the title of the summarySentence structure - since this is an abstractive system (constructing and generating new sentences) each sentence was tested for proper grammatical structure by running it through the Stanford parser

Results for Summary Length

Result Set for Precision

Conclusion

Many advances have been made in this field with the development of algorithms an methodologies such as TF-IDF, clickthrough data evaluation, and machine learning.

It is also clear that further research is needed in this area; mainly in improving precision beyond all else as many summarization systems have issues in this area.

web content summarization

Documents

multiple web documents

number of web pages

current web document

selected content

content selectiontf

content generation

lexical chaining words

noisy channel models