“an approach to identify duplicated web pages” g. lucca, m. penta, a. fasolino compsac’02...
TRANSCRIPT
“An Approach to Identify Duplicated Web Pages”
G. Lucca, M. Penta, A. Fasolino
Compsac’02 pp.481-486
Today presented by Kenny Kwok
Why need to do that?
Web pages are loosely organized Usually coded in incremental way Reuse code of existing pages to write new
pages (copy & paste) Lack of inline documentation usually
Why need to do that?
With techniques to identify duplicated web pages: Feasible to carry out testing Web pages maintenance more efficient Possible to detect possible plagiarism
Duplicated code => clones Two or more pages are considered as clones if,
They have the same, or a very similar, structure, or They are characterized by the same values of the
defined metrics
Types of Web Pages
Server Pages Pages stored in the web server May contain server-side scripts
Client Pages Static pages
Saved in file with permanent content Dynamic pages
Built by server at run time
That paper only covered static pages and server-side scripts Since the result on server-side scripts is not conclusive,
we discuss the former type only.
How to detect duplicated Web Pages?
Two proposed approaches:
Levenshtein distance (Edit distance)
Occurrence frequency
Levenshtein distance
A.k.a. Edit distance The minimal transformation distance between
two strings Requires O(n2) computation time
where n is the size of the longer string For example, the strings u, v are
– ABCDEFG– A DE G
The Levenshtein distance between the strings u, v is: D(u, v) = 3
Levenshtein distance of Web Pages Alphabet Symbols:
HTML tags (/div, /td, td, img, div, …, etc.) Extract those tags and replace with alphabet. (e.g.
/div -> a, /td -> b, …)
Translate the web page into “HTML-string” that compose of those symbols
Levenshtien distance of pages is then the distance of their corresponding HTML-strings
Leveshtein distance (example) With the following HTML alphabet table:
HTML-string u = hifgieb
HTML-string v = hidcfgieab
<td width=“18%”><img src=“../images/Nuovo.jpg” width=“92” height=“27”></td>
<td width=“35%”><div align=“right”> <img src =“ ../pic1.jpg” width=“92” height=“27”> </div> </td>
Leveshtein distance (example)
The optimal alignment of u and v is:
The Levenshtein distance D(u, v) = 3 They are considered as duplicated pages
(similar pages) if their distance is small But the paper has not quantitatively defined
what is mean by “small”.
Problems and possible improvements May detect misleading similarities
Due to sequence of HTML attributes False positive, different page has small
distance value
Suggestion: Substitute each composite tag in alphabet A
with its equivalent tag in new set of alphabet A’– But the paper does not mention any further about the
A’ alphabet set
Problems and possible improvements May not detect meaning similarities
Due to different tag with similar nature e.g. formatting tag (H1, H2, H3)
Suggestion: Define alphabet of formatting tags in A’’. Eliminate the HTML-string symbols that
contains alphabet A’’.– Again, the paper does not mention any further about
the A’’ alphabet set
Occurrence frequency
Make use of HTML-array
Compare the Euclidean distance of their HTML-array ED(u, v) = 1.732
Much faster in computation Make identify all clones in previous method More likely to detect false positive clones
The paper, again, does not describe the criteria of clone and the value of ED. Not clue of how “small” it should be
Experiment Result Levensthein:
– Accurate– Slow
Frequency measure:– Introduce false positive– Much faster
Suggestions:– Frequency measure
method to extract candidates, use Levensthein distance to verify the result
Conclusion
Two web page clones detection method are proposes and evaluated
Each has its strength and weaknesses but possible to combine into refinement process
Clone detection techniques is useful in: Identify a case of plagiarism Highlight reuse of pattern of HTML tags Facilitates Web maintenance Facilitates testing process of web applications
Final Note
It has not mentioned the translation alphabet table and how to obtain it correctly
The paper does not mention the distance similarity criteria for the experiment
The experiment does not cover the detection of plagiarism although it may be possible
Q&A
Thank You