Journal of Information Science and Engineering, Vol. 27 No. 6, pp. 1787-1822 (November 2011)

An Experimental Approach to Detect Similar Web Pages Based on 3-Levels of Similarity Clues*

1Software Capability Development Center
LG Electronics
Seoul, 137-130 Korea
2School of Computer Science and Engineering
Seoul National University
Seoul, 151-742 Korea
+School of Computer Science and Engineering
Kyungpook National University
Daegu, 702-701 Korea

It is hard to maintain web applications due to rapid changes and the proliferation of various techniques applied to web applications. Several approaches, such as clustering or refactoring web applications, have been suggested to improve their maintainability. The similarity measure is one of the principal criteria in these approaches. Existing studies on web similarity focused on semantic or context similarity. Most of the existing clone detection techniques concentrated on general applications, not web applications. In this paper, WSIM has been suggested to measure similarity in web applications, based on the usage degree of clues and two linking directions. The similarity clues include page relations, source and target entities, and parameters. WSIM can be classified in three levels and two directions. Six kinds of WSIMs are defined, and each WSIM has its own purpose. Finally, several experiments were conducted on simulated data and real open sources to validate the proposed WSIM.

Keywords: web application, similarity, page clone, clues, maintainability

Full Text () Retrieve PDF document (201111_01.pdf)

Received December 3, 2010; revised May 6, 2011; accepted June 15, 2011.
Communicated by Chih-Ping Chu.
* This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology(2011-0005632, NRF-2007- 331-D00407).
+ Corresponding author.