Previous [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19] [ 20]

@

Journal of Information Science and Engineering, Vol. 25 No. 2, pp. 435-464 (March 2009)

Fast Extraction of Maximal Frequent Subtrees Using Bits Representation*

Juryon Paik, Junghyun Nam+, Dongho Won and Ung Mo Kim
Department of Computer Engineering
Sungkyunkwan University
Gyeonggi-do, 440-746 Korea
+Department of Computer Science
Konkuk University
Seoul, 143-701 Korea

With the continuous growth in XML data sources over the Internet, the discovery of useful information from a collection of XML documents is currently one of the main research areas occupying the data mining community. The most commonly adopted approach to this task is to extract frequently occurring subtree patterns from XML trees. But, the number of frequent subtrees usually grows exponentially with the size of trees, and therefore, mining all frequent subtrees becomes infeasible for large size trees. A more practical and scalable alternative is to use maximal frequent subtrees, the number of which is much smaller than that of frequent subtrees. Handling the maximal frequent subtrees is an interesting challenge, though, and represents the core of this paper. We present a novel, conceptually simple, yet effective algorithm, called EXiT-B, that significantly simplifies the process of mining maximal frequent subtrees. This is achieved by two distinct features. First, EXiT-B represents all of string node labels of trees by some specified length of bits. Through fast bitwise operations, the process of deciding on which paths of trees contain a given node is accelerated. Second, EXiT-B avoids time-consuming tree join operations by using a specially devised data structure called PairSet. To the best of our knowledge, EXiT-B is the first algorithm that discovers maximal frequent subtrees adopting bits representation. We also demonstrate the performance of our algorithm through extensive experiments using synthetic datasets which were generated artificially by a randomized tree-structure generator.

Keywords: XML mining, tree mining, mining methods and algorithms, bits representation, maximal frequent subtree

Full Text () Retrieve PDF document (200903_07.pdf)

Received August 2, 2007; revised November 28, 2007; accepted February 22, 2008.
Communicated by Chin-Teng Lin.
* This research was supported in part by the Ubiquitous Autonomic Computing and Network Project, 21st Century Frontier R&D Program and by the ITRC (Information Technology Research Center) support program supervised by the IITA (Institute of Information Technology Advancement) (IITA-2008-C1090-0801-0028), both funded by the MKE (Ministry of Knowledge Economy), Korea.