chapter 3 comparative study and feasibility analysis of...

Chapter 3

Comparative Study and Feasibility Analysis of Existing Systems and

Customization of Flash Web pages into HTML Structures

3.1 Preamble

In the previous chapter, we have discussed enhancement techniques to create

datasets and analyzed the issues encountered while dataset creation. In this

chapter, we describe about the working procedure of existing systems such as

VIPS (Vision Based Web page Segmentation), Boilerpipe, Gestalt Theory etc., on

HTML Web page structures. Further, we discuss the working procedure of

existing systems on our XML and Flash datasets and we discuss about

performance of each system. After realizing that the existing systems are not

viable for XML and Flash kind of Web structures, we describe their semantic

Some parts of the material in this chapter have appeared in the following research article (a) Krishna Murthy. A and Suresha, “Comparative Study on Browsing on SSD's”, in International Journal of Machine Intelligence (IJMI), Volume 3, pp 354-358, 2011.

(b) Krishna Murthy. A, Suresha and Anil Kumar K M. “Challenges and Issues in Adapting Web contents on Small Screen Devices”, in International Journal of Information Processing (IJIP), vol. 8 issue 1 of IJIP, Banglore-2013.

(c) Krishna Murthy. A, Suresha, Anil Kumar K M. “Analysis and Issues of Adapting Web pages on Small Screen Devices”, in International Conference on Information Processing (ICIP) page 75-85, Elsevier

publications, Bangalore - 2013.

Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures

60

structure analysis. In addition to this, we present a novel scheme to customize

the Flash Web pages using Tansym Optical Character Recognizer (T-OCR) tool

by applying the Internet information Service (IIS) server concepts. The results

obtained are tested on various small screen devices. Further, the accessing time

of customized Web page results are compared with accessing time of original

Web pages using time analysis factor tools.

The organization of the chapter is as follows: after giving the overview in Section

3.1, we present the comparative performance analysis of existing systems in

Section 3.2. In Section 3.3, we present the analysis of working methods of VIPS

and Boilerpipe system on our datasets. In Section 3.4, we propose a novel

algorithm to customize the Flash Web page contents and finally the important

results are discussed in Section 3.5. We present the summary of this chapter in

Section 3.6

3.2 Comparative Analysis of Existing Systems

Web contents are currently designed for the Large Screen Devices (LSD) and rich

memory resources. LSD users can use convenient input devices such as a mouse,

keyboard to retrieve any Web page from any Website. Downloading time is

rarely a problem as the Personal Computers (PCs) are usually connected to the

Internet through high capacity lines and the large screen allows much irrelevant

(noise) information such as advertisements to be placed on the screen without

distracting the user. At present, experiencing the internet on Small Screen

terminals such as Mobile, Personal Digital Assistants (PDA) etc., is becoming

very popular. The current Web pages are intended for LSDs are not suitable for

Small Screen Devices (SSDs).

The straight forward solution for browsing on SSDs is to re-design the Web

pages using specific markup languages such as WML, XHTML. Some notable

sites, which have already implemented this include Yahoo, CNN, and Google

among others. Nonetheless, the vast majority of sites on the Web have not


61

customized Web pages for SSDs, because it is a time consuming process and not

economical [32]. Compared to LSDs, SSDs are not ideal platforms for surfing the

Web. Because in SSDs, wireless bandwidth is quite limited, it is very expensive

and screen size varies for different devices such as Mobiles, PDA’s etc., and

devices such as Mobile phones have limited memory capabilities. Normally, the

content of a single Web page will be larger than what a Mobile phone can hold.

Therefore, methods to reconstruct LSDs optimized Web pages for Small Screen

terminals are essential.

In this section, the experimental results of Web page Segmentation, Noise

Removal and re-arranging retrieved information to fit on SSDs are discussed in

detail. Table 3.1, [33] gives the performance comparison of query expansion

using different page segmentation methods. The average retrieval precision can

be improved after partitioning pages into blocks, no matter which segmentation

algorithm is used. In the case of FULLDOC, the maximal average precision is

19.10% when the top 10 documents are used to expand the query. DOMPS

obtains 19.67% when the top 50 blocks are used, a little more than FULLDOC.

VIPS gets the best result 20.98% when the top 20 blocks are used and achieves

26.77% improvement [33]. It achieves best precision when no. of segments less

than 30. Figure 3.1 and Figure 3.2(a) show the processing time of Gestalt Theory

method [34]. About 87% of pages are processed in less than 2 seconds. The time

is almost in proportion to the size of the layout tree, which is about one third of

the size of the DOM tree. Author has used recall to evaluate the performance of

Gestalt method with VIPS [33] and PAV [18]. Here recall is the fraction of

correctly recognized blocks over the standard blocks marked manually. N is the

number of result blocks. As N grows, a page is broken into more and more

smaller blocks. It is observed that Gestalt method always outperforms than PAV.

But VIPS achieves the best results Figure 3.2(b) when N is small because

proximity plays the key role at that time [33].


62

Three learning methods are used [38] to learn the models such as SVM, non-

linear SVM with RBF kernel and a RBF network. The performances obtained by

these methods are reported in Table 2. SVM with RBF kernel achieved the best

performance with Micro-F1 80.2% and Micro-Acc 86.8%. The linear SVM

performed worse than both SVM with RBF kernel and RBF network.

The results indicate that a nonlinear combination of the features is better than a

linear combination. In WPC [39] experiment, Naïve Baye’s text classification is

applied on three different data sets, cleaned using three approaches of Not

Cleaned (NC), Template (TPL) [29] and Web Page Cleaner (WPC) to check the

performance. Table 3.3 shows average classification (Naïve Baye’s text

classification) accuracy and standard error on four-fold cross validation for

method NC (79.68, 4.07), for method TPL [29] (96.23, 1.18) and for method WPC

(98.12, 0.29).

Table 3.1: Performance comparison using different page segmentation methods [33]

Number of

Segments

Baseline

(%)

FULL DOC (%) DOMPS

(%)

VIPS (%)

3

16.55

17.56 (+6.10)

17.94 (+8.40)

18.01 (+8.82)

5 17.46 (+5.50)

18.15 (+9.67)

19.39 (+17.16)

10 19.10 (+15.41)

18.05 (+9.06)

19.92 (+20.36)

20 17.89 (+8.10)

19.24 (+16.25)

20.98 (+26.77)

30 17.40 (+5.14)

19.32 (+16.74)

19.68 (+18.91)

40 15.50 (6.34)

19.57 (+18.25)

17.24 (+4.17)

50 13.82 (-16.50)

19.67 (+18.85)

16.63 (+0.48)

60 14.40 (-12.99)

18.58 (+12.27)

16.37 (-1.09)


63

Table 3.2: Comparison of learning methods [38]

Methods Level 1 Level 2 Level 3 Micro-F1 Micro-Acc

SVM

(RBF)

0.787 (P)

0.813 (R)

0.807 (P)

0.808 R)

0.837 (P)

0.754 (R)

0.802 0.868

SVM

linear

0.691 (P)

0.740 (R)

0.745 (P)

0.737 (R)

0.823 (P)

0.675 (R)

0.731 0.821

RBF

Network

0.727 (P)

0.717(R)

0.752 (P)

0.755(R)

0.777 (P)

0.799 (R)

0.746

0.830

Table 3.3: WPC average accuracy and standard error [39]

Test Case

Train Test Methods Avg Accuracy (%)

Standard Error

1-1 25

(5 per class)

2475 NC 79.41 2013

TPL 88.63 1.41

WPC 91.10 0.69

1-2 50 (10 per class)

2450 NC 90.52 1.01

TPL 92.42 0.96

WPC 95.44 0.40

1-3 75

(15 per class)

2425 NC 95.44 0.42

TPL 94.19 0.70

WPC 97.05 0.20

1-4 100

(20 per class)

2400 NC 94.89 0.43

TPL 94.45 0.33

WPC 97.11 0.20

1-5 250

(50 per class)

2250 NC 97.40 0.37

TPL 97.33 0.21

WPC 98.64 0.12

1-6 500

(100 per class)

2000 NC 97.97 0.27

TPL 98.09 0.09

WPC 99.00 0.06


64

Figure 3.1: Average precision vs. Number of segments [33]

The result [42] in Figure 3.3(a) indicates that, for “Bottom” target content

elements, the system is four times more efficient than the Google Wireless Trans-

coder, and twice as efficient for “Middle” elements. These results prove that the

proposed method significantly improves the usability of Web browsing on the

mobile phone.

Segmentation results of E-Gestalt are compared [43] with VIPS [33] and Gestalt

[21]. As illustrated in Figure 3.3(b), E-Gestalt produces the most PERFECT blocks

and the least ERROR blocks.

The count of NOT-BAD blocks is much higher in other two methods, mainly

because: large blocks tend to be split into sub-blocks much smaller than screen

size by VIPS and Gestalt. This reversely demonstrates the effectiveness of the

fine-tuned dividing process for generating mobile-fitted blocks.


65

Figure 3.2(a): Time taken for page segmentation by GESTALT [21]

Figure 3.2(b): Recall of Gestalt, PAV and VIPS [21]


66

Figure 3.3(a): Usability Evaluation Results [42]

Figure 3.3(b): Block content of E-Gestalt, Gestalt and VIPS [43]


67

3.3 Working with Existing Systems

In this section, we describe the issues on adaptation of the existing systems on

SSDs. And in the next section, we propose a method to translate Flash Web pages

to be made browsable on SSDs after analyzing the semantic structure.

Normally, the content of a single Web page will be much larger than contents

that can be accommodated in a mobile screen. Therefore, methods to translate

LSDs optimized Web pages for Small Screen terminals are essential. Several

attempts have been made to enable Web pages to be browsable on SSDs. All the

methods were designed based on HTML tags as there is a vast availability of

HTML pages on Web [26][29].

In order to study the feasibility of the existing approaches on Web pages created

using XML and Flash, we began the process of creating data set for XML and

Flash Web page, since there are no such data sets. To check the feasibility of the

existing systems, we set up few existing approaches which predominantly

dealing with the Web page segmentation to display the Web contents on SSDs.

For this, we have considered two such popular approaches namely, VIPS [33]

and Boilerpipe [27] systems on our Datasets.

3.3.1 Testing VIPS system on our Data sets

VIPS is a Web content structure analyzer/Web page analyzer based on visual

representation algorithm. Many Web applications such as information retrieval,

information extraction and automatic page adaptation can benefit from this

structure [33]. It discusses about an automatic top-down, tag-tree independent

approach to detect Web content structure and simulates user understanding of

Web layout structure based on visual perception. After setting-up VIPS system,

we tested system feasibility on our dataset’s, but the pages does not get

segmented. Figure 3.4 depicts successful segmentation of normal HTML Web

page (ex: http://www.uni-mysore.ac.in) after applying PDoC value (Permitted


68

Degree of Coherence – which controls the partition granularity); here the left part

shows the segmented blocks (different visual blocks). The block marked with red

rectangle is the selected content structure VB1-1-3 (Figure 3.4).

Figure 3.4: VIPS page analyzer on HTML Web page

Figure 3.5: VIPS page analyzer on Flash Web page

The upper right part shows the PDoC value and vision based content structure

(hierarchical structure of different blocks – VIPS tree) of Web page, bottom right


69

part gives the features of selected content structure. In different applications, we

can control the partition granularity by setting PDoC value, whereas VIPS

system fails to segment Flash Web pages (ex: http://www.isim.ac.in/mlw) after

entering PDoC value. Figure 3.5 depicts failure of segmentation of Flash Web

pages after applying PDoC value.

It segments entire Web page as a single block, but it fails to segment as

individual blocks. The reason behind this is Flash Web pages are semantically

different from HTML pages. VIPS is developed based on pre-defined HTML

tags.

Figure 3.6: VIPS page analyzer on XML Web page

Similarly, we have tested on XML Dataset (ex:

http://shrubbery.mynetgear.net/c/disply/ W/Web.xml). VIPS system fails to

display the Web contents in display region. Figure 3.6 depicts the pictorial view

of XML Web pages on VIPS system. The reason for failure is that the semantic

structure of XML Web pages is different from HTML Web pages. Hence, we have

concluded that the existing VIPS model fails to segment the Flash Webpage.


70

Table 3.4: Feasibility analysis of VIPS system on our Data sets

Figure 3.7: Feasibility analysis on VIPS systems on HTML, Flash and XML Web pages


71

Table 3.4 and Figure 3.7 depict the feasibility analysis of VIPS systems on various

kinds of Web pages (HTML, Flash and XML respectively).

3.3.2 Testing Boilerpipe system on our Data sets

Boilerpipe provides algorithms to segment and remove the surplus “clutter”

(boilerplate, templates) around the main textual content of a Web page [27]. We

have tested the Boilerpipe model on our datasets, for all cross combinations of

extractor and output mode.

Figure 3.8: Boiler Pipe page analysis on HTML Web page

Figure 3.9: Boiler Pipe page analysis on Flash Web page


72

They are employed to segment and retrieve informative contents from Web

pages. The model works fine for HTML Web pages, but fails to segment and

extract informative contents from Flash and XML Web pages. The model shows

the result as blank screen or returns ‘code error’ message. Figure 3.8 depicts

successful segmentation and information extraction of HTML Web page.

Figure 3.9 depicts the failure of segmentation and information extraction of flash

Web pages. Figure 3.10 depicts the failure on XML Web pages after applying

Extractor and Output mode options of Boilerpipe model.

Figure 3.10: Boiler Pipe page analysis on XML Web page

From this study, we have realized that the Boilerpipe model also fails to segment

and extract informative contents from Flash and XML Web pages. Therefore,

from the analysis of the existing systems, we have shown that these systems i.e.,

segmentation, noise removal and mobile browsing work well on Web pages

using pre-defined HTML tags.

Because of this failure, informative contents from Flash and XML Web pages face

a few limitations in real time applications such as inability to display these

contents on SSDs, inability to share contents on social Web sites, inability to

index these Web pages by Web search engines resulting in lack of its use in


73

information retrieval, information extraction etc., and citation rate of these Web

pages are also reduced due to afore mentioned limitations.

In order to address these limitations, we worked on semantic structure of XML

and Flash Web pages to explore the segmentation technique. After analyzing the

structure of XML and Flash, we proposed a system to translate Flash Web pages

into HTML Structure format which is suitable and viable on SSDs. We have

analyzed the browsing performance using two metrics namely content coverage

and response time on different mobile phones.

3.4 Translation of Flash into HTML Structure

In this approach, after downloading the Flash Web pages in the format of .mht

(MHTML), each tab page is converted into images using snipping tool (this is an

inbuilt tool in windows 7 OS). The images are saved in bitmap format and text

contents of each image extracted by using the Transym Optical Character

Recognition (TOCR) tool. Then the corresponding HTML Web pages are

constructed similar to original Flash Web pages. Figure 3.11 shows the

architecture of proposed system.

Figure 3.11: Architecture of Proposed System

After the translation of HTML pages, it is hosted using reverse proxy concepts

with the help of Internet Information Service (IIS) Web server. A very common


74

reverse proxy scenario is to make available several internal Web applications

over the Internet. An Internet-accessible Web server is used as a reverse-proxy

server that receives Web requests and then forwards them to several intranet

applications for processing.

We set up a Reverse Proxy server using the Application Request Routing (ARR)

and the URL Rewrite Module 2.0. After setting-up the reverse proxy concepts,

translated Web domains are hosted by assigning an individual port number to

respective domain names for easier access. Testing was done on different SSDs

based on content coverage and response time.

3.5 Experimental Results

Web pages are hosted after translation into HTML format (traditional Web

pages). The results have been analyzed based on accessing the Conventional

Flash Web pages (C-Web pages) as well as Traditional Web pages (T-Web pages).

Here, system performance analyzed by comparing the time factors of

downloading time on SSDs (based on system specification) and content

coverage.

We have performed the downloading time analysis on three different type of

hand held devices such as Sony Xperia X10, Sony Ericsson WT13i and LG

Optimus Net based on Wi-Fi connection. We obtain a better performance in

accessing T-Web pages compared to its corresponding C-Web pages. Table 3.5

and Table 3.6 provide results of response time and content coverage respectively

on our approach on Flash data set.

Similar processes were carried out on LG-Optimus hand held device. Two Web

sites namely www.noleath.com and www.marvismint.com are get downloaded

completely. It was also observed that very rarely Home Page (HP) of a few sites

were getting downloaded and the rest required Flash Player (FP) to display the

contents. In our approach, T-Web pages were displayed between 2.61 to 6.21


75

seconds as shown in Table 3.5. Figures 3.12, 3.13 and 3.14 are depicts pictorial

view of difference between response time of C and T Web pages (Sony Xperia

X10, Sony Ericsson WT13i and LG Optimus Net respectively).

Table 3.5: Response Time analysis on various SSDs

Sl.No

Websites

Sony Xperia X10

Sony Ericsson WT13i

LG Optimus Net

T-Web pages

(Sec)

C-Web pages

(Sec)

T-Web pages

(Sec)

C-Web pages

(Sec)

T-Web pages

(Sec)

C-Web pages

(Sec)

1 www.isim.ac.in/mlw 1.06 32.47 14.59 ND 6.21 FP

2 www.comunicacion.com

1.3 30.77 59.56 HP+FP

2.95 HP+FP

3 www.urbansurvivours.org

1.22 170 7.57 ND 3.14 FP

4 www.asual.com/swfa

ddress/ samples/flash

1.28 27.87 4.77 JS+FP 3.32 HP+FP

5 www.noleath.com 2 100 5.47 ND 2.61 23.12

6 www.sensisoft.com 1.28 389.82 5.96 FP 3.24 FP

7 www.marvismint.com 1.27 107.88 5.96 68.58 2.95 4.71

Figure 3.12: Performance Analysis on SonyXPeria10

0

100

200

300

400

1 2 3 4 5 6 7

Tim

e(S

eco

nd

s)

Flash Websites

HTTPS

HTTP

http://www.urbansurvivours.org/


http://www.asual.com/swfaddress/samples/flash



http://www.noleath.com/

http://www.sensisoft.com/

http://www.marvismint.com/


76

Figure 3.13: Performance Analysis on Sony Ericson WT13i

Figure 3.14: Performance Analysis on LG Optimus Net

After experimental analysis, we have performed the content coverage analysis

based on human perception as against the system view. The algorithm adopts

kappa[23] statistics to quantitatively measure the degree of the agreement as to

how Web pages share the similarity terms. Kappa result ranges from 0 to 1. The

higher the value of Kappa, the stronger the similarity. Kappa more than 0.7

typically indicates that similarity of two Web pages is strong. Kappa values

greater than 0.9 are considered excellent. Boolean values are used to represent

the content coverage (0100% content loss, 0.550% content loss, 10% content

0

20

40

60

80

1 2 3 4 5 6 7

Tim

e (S

eco

nd

s)

Flash Websites

HTTP

HTTPS

0

10

20

30

1 2 3 4 5 6 7

Tim

e(S

eco

nd

s)

Flash Websites

HTTPS

HTTP


77

loss). Table 3.6 depicts the clear view of kappa value relation between human

perception (Human View- HV) and system view (SV).

Pr(a) Pr(e)K =

1 Pr(e)

Where Pr(a) is the relative observed agreement among raters, and Pr(e) is the

hypothetical probability of chance agreement, using the observed data to

calculate the probabilities of each observer randomly saying each category. If the

raters are in complete agreement then k = 1 (If there is no agreement among the

raters other than what would be expected by chance as defined by Pr(e)), κ = 0).

Table 3.6: Content coverage analysis on various SSDs

S.No

Websites

Sony Xperia X10

Sony Ericsson WT13i

LG Optimus Net

T-Web pages (HV)

C-Web pages (SV)

T-Web pages (HV)

C-Web pages (SV)

T-Web pages (HV)

C-Web pages (SV)

1 www.isim.ac.in/mlw 1 1 1 0 1 0.1

2 www.comunicacion.com

1 1 1 0.2 1 0.2

3 www.urbansurvivours.org

1 1 1 0 1 0.1

4 www.asual.com/swfad

dress/ samples/flash

1 1 1 0.2 1 0.2

5 www.noleath.com 1 1 1 0 1 1

6 www.sensisoft.com 1 1 1 0.2 1 0.2

7 www.marvismint.com 1 1 1 1 1 1

3.6 Summary

The results obtained by the existing systems are compared with one and other

existing systems with respect to segmentation ratio. After choosing the best

performance algorithms such as VIPS and Boilerpipe, they are used to test the

feasibility analysis on our Data sets (Flash and XML). We observed that both the






http://www.noleath.com/

http://www.sensisoft.com/

http://www.marvismint.com/


78

predominantly leading algorithms fail to segment and retrieve contents from our

Data sets. Result tables clearly show that the existing systems are not viable for

XML and Flash kind of Web pages. Therefore, to propose the system to segment

and retrieve the contents, we performed the semantic structure analysis on both

the Data sets. After analyzing the semantic structure, Translation system has

been proposed to customize the Flash contents into HTML structure to make

them viable on SSDs. Here, the results obtained are tested on various hand held

devices such as Sony Xperia X10, Sony Ericsson WT13i and LG Optimus Net

based on various specification details. After achieving the better response time

for translated Web pages, obtained resultant vectors are compared with the

response time of respective conventional Web pages and Kappa statistical

analysis also has been performed to analyze the content coverage on SSDs.

chapter 3 comparative study and feasibility analysis of...

Documents