| Previous | [ 1] | [ 2] | [ 3] | [ 4] | [ 5] | [ 6] | [ 7] | [ 8] | [ 9] | [ 10] | [ 11] | [ 12] | [ 13] | [ 14] | [ 15] | [ 16] | [ 17] | [ 18] | [ 19] | [ 20] |
¡@
Juryon Paik, Junghyun Nam+, Dongho Won and Ung Mo Kim
Department of Computer Engineering
Sungkyunkwan University
Gyeonggi-do, 440-746 Korea
+Department of Computer Science
Konkuk University
Seoul, 143-701 Korea
With the continuous growth in XML data sources over the Internet, the discovery of
useful information from a collection of XML documents is currently one of the main research
areas occupying the data mining community. The most commonly adopted approach
to this task is to extract frequently occurring subtree patterns from XML trees. But, the
number of frequent subtrees usually grows exponentially with the size of trees, and therefore,
mining all frequent subtrees becomes infeasible for large size trees. A more practical and
scalable alternative is to use maximal frequent subtrees, the number of which is much
smaller than that of frequent subtrees. Handling the maximal frequent subtrees is an interesting
challenge, though, and represents the core of this paper. We present a novel, conceptually
simple, yet effective algorithm, called EXiT-B, that significantly simplifies the process
of mining maximal frequent subtrees. This is achieved by two distinct features. First,
EXiT-B represents all of string node labels of trees by some specified length of bits. Through
fast bitwise operations, the process of deciding on which paths of trees contain a given node
is accelerated. Second, EXiT-B avoids time-consuming tree join operations by using a specially
devised data structure called PairSet. To the best of our knowledge, EXiT-B is the first
algorithm that discovers maximal frequent subtrees adopting bits representation. We also
demonstrate the performance of our algorithm through extensive experiments using synthetic
datasets which were generated artificially by a randomized tree-structure generator.
Received August 2, 2007; revised November 28, 2007; accepted February 22, 2008.
Communicated by Chin-Teng Lin.
* This research was supported in part by the Ubiquitous Autonomic Computing and Network Project, 21st
Century Frontier R&D Program and by the ITRC (Information Technology Research Center) support program
supervised by the IITA (Institute of Information Technology Advancement) (IITA-2008-C1090-0801-0028), both funded by the MKE (Ministry of Knowledge Economy), Korea.