semantic html page segmentation using type analysis

Semantic HTML Page Segmentation using Type Analysis

Xin Yang, Peifeng Xiang, Yuanchun Shi Department of Computer Science and Technology, Tsinghua University, Beijing, P.R. China

{yang-x02, xpf97}@mails.tsinghua.edu.cn; [email protected]

Abstract

Semantic information is necessary for Semantic Web processing and is useful to Web adaptation services such as personalization of users’ browsing activities on small screen devices. However, semantic information is always implicitly encoded in most existing HTML documents. This paper describes a page segmentation method to parse Web pages into rectangular segments containing some semantic information, namely blocks. Existing page segmentation techniques are mainly built on HTML DOM structure or purely vision based, not accurate enough either in visual presentation or in semantic sense. Our approach is automatic, and based on a refined typing system which tightly couples type analysis with indispensable visual cues to generate blocks into the tree structure, aiming to achieve high degree of coherence in both semantic and visual views. Experimental results show better accuracy and completeness of our method over existing ones. Keywords: Page Segmentation, Block, Visual Cues, Type Recognition, Pattern Discovery, Semantic Structural Tree. 1. Introduction

Semantic Web is likely to be the next-generation Web. Its basic infrastructure encompasses both online and offline databases filled with enormous semantic objects. However, as a necessary part of online resources, most existing pages are originally encoded in HTML documents, in which semantic information implicitly hides but visually presents in a structural way. For example, in Figure 1, information in the red rectangle together represents the topic of “Headline News” and information in the blue rectangle represents a sub topic, each associated with a piece of news.

Essentially Web pages are composed of several such rectangular areas, each of which contains some useful semantic information with the same topic, namely block as in [3]. Semantic page segmentation is a preliminary step for advanced Semantic Web processing, and [7][13][14][15] show its great potential and possibilities. For example, information retrieval and extraction can achieve much better results by regarding sets of blocks

as basic processing objects instead of the whole page, e.g., [13][15]. Besides, specific Web adaptation services such as the personalization of users’ browsing activities on small screen devices can also benefit a lot by directly using semantic blocks as input units, e.g., [7][14].

Existing solutions to page segmentation fall into two categories. The first class is based on some non-visual cues such as HTML DOM tags, content, links, etc, e.g. [1][2][4][5][6][8][9][11]. Methods of this class often achieve low accuracy because of overlooking visual cues. The second class suggests an opposite solution, e.g., [3] proposed a purely vision-based method, but often achieve a limited degree of semantic coherence because of relying on visual cues too much and failing in making full use of them.

In human’s view, each Web page is a set of semantic blocks separate in visual presentation but semantically related to each other. [12] stresses the simple observation that semantically related items exhibit consistency in presentation style and spatial locality. It is especially useful to template based Web pages such as news front pages and e-commerce sites. We take three further notes (Section 3.1) and use them to guide type analysis through the semantic page segmentation process, with both non-visual and visual cues taking effect. Additionally, Web pages may contain some semantic free items except blocks, such as blank tables and white separators. We consider filtering them out to tidy the tree structure and simplify the algorithm.

We utilize the idea of pattern discovery in [12] but implement it in an essentially different way, mainly by

Figure 1. A fragment of News front page

2006 1st International Symposium on Pervasive Computing and Applications

669

huangxin

线条

huangxin

文本框

1-4244-0326-X/06/$20.00 ©2006 IEEE.

taking into account some indispensable visual cues. The contributions include:

Defining a refined typing system built on basic types.

Filtering out semantic free items through type recognition.

Coupling type analysis with visual cues by dynamically inserting and removing separator items and adjusting relationship between adjacent items.

Next, Section 2 presents a brief overview of related work. Then Section 3 describes our technique in detail with some experimental results following in Section 4. Finally Section 5 gives discussions. 2. Related Work

Recently Semantic Web has drawn more and more attention from researchers. Many contributions have emerged in such areas as Web page segmentation and information extraction, both related to this issue.

On one hand, many approaches have been provided for Web page segmentation.

[8] and [11] both use HTML tag information as cues, while [2][4][5][9] focus on content and link information. [1] even tries to detect specific templates by making use of link information. In [6] a new model called FOM (Function-based Object Model) is proposed to construct hierarchical structures for Web pages.

Methods above all try to directly explore semantics from Web pages, but ignore the actual visual presentation style. [3] discusses their limitations respectively and presents a vision-based algorithm, so-called VIPS (Vision-based Page Segmentation), to extract the semantic structure of Web pages. It is based on the assumption that human unconsciously divide Web pages into semantic segments in virtue of visual cues. Being a tag-tree free approach, it works well even when the HTML structure is quite different from the actual layout structure. However, no semantic cues are taken into account, and visual cues are not utilized completely, thus leading to the limited degree of semantic coherence within blocks.

On the other hand, some information extraction techniques are concerned about algorithms related to page segmentation.

In [10], a flexible algorithm called MDR (Mining Data Records in Web Pages) is used to mine data records in Web pages. Data records are lists of regularly structured objects containing some information, which are somewhat similar to blocks. Compared with earlier automatic techniques, this algorithm works more accurately and effectively and can discover non-continuous data records. However, because its original objective is to fill database tables, it overlooks the structural relationship among different data records and therefore is not suitable for general use.

In [12], a framework coupling structural analysis of documents with semantic analysis using domain ontology is developed to partition HTML documents into unlabeled partition trees by grouping together elements with related semantics. It exploits the key observation that semantically related items exhibit consistency in presentation style and spatial locality and tries to discover structural recurrence patterns for semantically related items under each sub tree through a bottom-up process. However, it has two inherent limitations. First, it uses specified HTML tag path as the type of each node, making it time consuming and not suitable for Real-time processing. Second, it relies on pattern discovery but overlooks visual cues, yet is not accurate enough and can hardly achieve completeness.

Our approach is unique as utilizing the idea of pattern discovery for reference and making it work in parallel with visual cues in type analysis process. Meanwhile, we consider filtering out semantically free items through type recognition process. Therefore, page segmentation can be achieved more accurately and comprehensively both in visual and semantic sense. 3. Semantic Segmentation 3.1. The Basic Idea

Our technique is originally based on the simple observation mentioned in Section 1. When dipping into the relationship between HTML DOM structure and the actual representation style, we take three further notes, leading arising of the basic idea:

- Items with similar semantics usually have similar HTML tags. This gives rise to a refined typing system built on basic types. Each item is bound with a basic Atomic Type according to its tag and location in HTML DOM tree. Then semantic free items can be filtered out through a type recognition process.

- Similar semantic blocks usually contain items with similar HTML tag sequences. Then the typing system can be enlarged by binding each semantic block with a sequence of atomic types, namely Composite Type. This is done in parallel with the pattern discovery algorithm.

- Similar semantic blocks usually locate in the same sub tree structure and have the same parent. This gives birth to the idea of using visual cues as assistant in our type analysis. We take two measures and they both work effectively:

- Dynamically inserting and removing separator items during pattern discovery process.

- Adjusting the relationship between adjacent items.

Given a HTML document, we get its DOM tree, and then parse it into a semantic structural tree through a

2006 1st International Symposium on Pervasive Computing and Applications

670

semantic html page segmentation using type analysis

Documents