Project : Frequent Itemset Mining
 
 
 
Frequent Itemset Mining, or called FIM, is the most important task in association mining proposed by [1].
Given a set of transactions, each transaction contains several items. A transaction is defined as a set of items have occurred together.
For example, in the supermarker, when a custom buys milk, beef and beer, the transaction generated by the custom contains three item (milk, beef and beer). The goal of FIM is try to generate all the frequent-item whose freqeuncy are more than the predefined threshold. The predefined threahold is given by the ratio of the number of transaction.
Agrawal[1] propose a novel idea that the frequent-items at stage K are generated from the freqeunt-items at stage K-1. The frequent-items at stage K represents the number of items K. For example, {milk, beef} is a 2 stage frequent-items. The intuition of Agrawal proposed idea is that a k-stage frequent-items indicates all of its (k-1)-stage items are frequent. Based on this intuition, lots of reduntant items are pruned to reduce extra computation cost. In [1], the candidate items (or called Ck) at stage K are generated from the frequent items (or called Lk-1) at stage at K-1; then the Ck are "checked" to be Lk. The term "checked" means the frequency of Ck is evaluated by re-scanning the database, and Ck with less frequence than the predefined threshold are pruned, and the remain Ck are called Lk. For more detail example, please refer to [2].
The main costs of apriori's approach have two points: One is the cost on the candidate generation and the other is on the cost of the re-scanning the database. In the candidate generation, each frequent items at K-1 have to check each other to generate candidate itemset at K which needs O(N2) where N is the number of freqeunt items at K-1. However, the cost of generating candidate can be reduced by hashing technique. On the other side, the cost of re-scanning database, especially x-TB dataset, can be reduced by bitmap-based[3] technique. In the following source code, we release two version apriori based approach: One is general apriori approach and the other is bitmap-based approach.



Both of these two version are built on Visual Studio project. Feel free to use and report errors for me, thanks.
 
 
 
FAQ on the source code
how to use???? in command line: xxxx.exe input_dataset minisup
how to rebuild????? just open the project file by visual studio and rebuild it.
What is the format of input file???? each line is treated as a transaction, and each item in a transaction is separated by space.For 983 example
What is the output??? each line lists the number of frequent items. For example
general apriori
bitmap-based apriori
challenge dataset1
challenge dataset1
 
 
If you need more efficient approaches, such as hmine, fp*, op or GPU-based approach, please mail to me, thanks.
 
 
Reference:
[1] Agrawal R, Imielinski T, Swami AN. "Mining Association Rules between Sets of Items in Large Databases." SIGMOD. June 1993, 22(2):207-16
[2] Apriori_algorithm. http://en.wikipedia.org/wiki/Apriori_algorithm.
[3] Liu Yang, Mei Qiao, "A Bitmap Compression Algorithm for Vertical Association Rules Mining," iscsct, vol. 2, pp.101-104, 2008 International Symposium on Computer Science and Computational Technology, 2008