Previous [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] [ 12]

@

Journal of Information Science and Engineering, Vol. 21 No. 1, pp. 109-128 (January 2005)

Fast Discovery of Sequential Patterns through
Memory Indexing and Database Partitioning

Ming-Yen Lin and Suh-Yin Lee*
*Department of Computer Science and Information Engineering
National Chiao Tung University
Hsinchu, 300 Taiwan
E-mail: sylee@csie.nctu.edu.tw
Department of Information Engineering and Computer Science
Feng Chia University
Taichung, 407 Taiwan
E-mail: linmy@fcu.edu.tw

Sequential pattern mining is a challenging issue because of the high complexity of temporal pattern discovering from numerous sequences. Current mining approaches either require frequent database scanning or the generation of several intermediate databases. As databases may fit into the ever-increasing main memory, efficient memory-based discovery of sequential patterns is becoming possible. In this paper, we propose a memory indexing approach for fast sequential pattern mining, named MEMISP. During the whole process, MEMISP scans the sequence database only once to read data sequences into memory. The find-then-index technique is recursively used to find the items that constitute a frequent sequence and constructs a compact index set which indicates the set of data sequences for further exploration. As a result of effective index advancing, fewer and shorter data sequences need to be processed in MEMISP as the discovered patterns get longer. Moreover, we can estimate the maximum size of the total memory required, which is independent of the minimum support threshold, in MEMISP. Experimental results indicate that MEMISP outperforms both GSP and PrefixSpan (general version) without the need for either candidate generation or database projection. When the database is too large to fit into memory in a batch, we partition the database, mine patterns in each partition, and validate the true patterns in the second pass of database scanning. Experiments performed on extra-large databases demonstrate the good performance and scalability of MEMISP, even with very low minimum support. Therefore, MEMISP can efficiently mine sequence databases of any size, for any minimum support values.

Keywords: data mining, sequential patterns, memory indexing, find-then-index, database partitioning

Full Text () Retrieve PDF document (200501_06.pdf)

Received December 12, 2002; revised June 25 & December 24, 2003; accepted April 27, 2004.
Communicated by Ming-Syan Chen.