“an approach to identify duplicated web pages” g. lucca, m. penta, a. fasolino compsac’02...

“An Approach to Identify Duplicated Web Pages”

G. Lucca, M. Penta, A. Fasolino

Compsac’02 pp.481-486

Today presented by Kenny Kwok

Why need to do that?

Web pages are loosely organized Usually coded in incremental way Reuse code of existing pages to write new

pages (copy & paste) Lack of inline documentation usually

Why need to do that?

With techniques to identify duplicated web pages: Feasible to carry out testing Web pages maintenance more efficient Possible to detect possible plagiarism

Duplicated code => clones Two or more pages are considered as clones if,

They have the same, or a very similar, structure, or They are characterized by the same values of the

defined metrics

Types of Web Pages

Server Pages Pages stored in the web server May contain server-side scripts

Client Pages Static pages

Saved in file with permanent content Dynamic pages

Built by server at run time

That paper only covered static pages and server-side scripts Since the result on server-side scripts is not conclusive,

we discuss the former type only.

How to detect duplicated Web Pages?

Two proposed approaches:

Levenshtein distance (Edit distance)

Occurrence frequency

Levenshtein distance

A.k.a. Edit distance The minimal transformation distance between

two strings Requires O(n2) computation time

where n is the size of the longer string For example, the strings u, v are

– ABCDEFG– A DE G

The Levenshtein distance between the strings u, v is: D(u, v) = 3

Levenshtein distance of Web Pages Alphabet Symbols:

HTML tags (/div, /td, td, img, div, …, etc.) Extract those tags and replace with alphabet. (e.g.

/div -> a, /td -> b, …)

Translate the web page into “HTML-string” that compose of those symbols

Levenshtien distance of pages is then the distance of their corresponding HTML-strings

Leveshtein distance (example) With the following HTML alphabet table:

HTML-string u = hifgieb

HTML-string v = hidcfgieab

<td width=“18%”><img src=“../images/Nuovo.jpg” width=“92” height=“27”></td>

<td width=“35%”><div align=“right”> <img src =“ ../pic1.jpg” width=“92” height=“27”> </div> </td>

Leveshtein distance (example)

The optimal alignment of u and v is:

The Levenshtein distance D(u, v) = 3 They are considered as duplicated pages

(similar pages) if their distance is small But the paper has not quantitatively defined

what is mean by “small”.

Problems and possible improvements May detect misleading similarities

Due to sequence of HTML attributes False positive, different page has small

distance value

Suggestion: Substitute each composite tag in alphabet A

with its equivalent tag in new set of alphabet A’– But the paper does not mention any further about the

A’ alphabet set

Problems and possible improvements May not detect meaning similarities

Due to different tag with similar nature e.g. formatting tag (H1, H2, H3)

Suggestion: Define alphabet of formatting tags in A’’. Eliminate the HTML-string symbols that

contains alphabet A’’.– Again, the paper does not mention any further about

the A’’ alphabet set

Occurrence frequency

Make use of HTML-array

Compare the Euclidean distance of their HTML-array ED(u, v) = 1.732

Much faster in computation Make identify all clones in previous method More likely to detect false positive clones

The paper, again, does not describe the criteria of clone and the value of ED. Not clue of how “small” it should be

Experiment Result Levensthein:

– Accurate– Slow

Frequency measure:– Introduce false positive– Much faster

Suggestions:– Frequency measure

method to extract candidates, use Levensthein distance to verify the result

Conclusion

Two web page clones detection method are proposes and evaluated

Each has its strength and weaknesses but possible to combine into refinement process

Clone detection techniques is useful in: Identify a case of plagiarism Highlight reuse of pattern of HTML tags Facilitates Web maintenance Facilitates testing process of web applications

Final Note

It has not mentioned the translation alphabet table and how to obtain it correctly

The paper does not mention the distance similarity criteria for the experiment

The experiment does not cover the detection of plagiarism although it may be possible

Q&A

Thank You

“an approach to identify duplicated web pages” g. lucca, m. penta, a. fasolino compsac’02...

Documents