Previous [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11]


Journal of Information Science and Engineering, Vol. 29 No. 2, pp. 209-225 (March 2013)

Inverse-Category-Frequency Based Supervised Term Weighting Schemes for Text Categorization*

State Key Laboratory of Software Development Environment
Beihang University
Beijing, 100191 P.R. China
E-mail: {dqwang; hzhang}

Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval (IR) field. The intuition behind idf for text categorization seems less reasonable than IR. In this paper, we introduce inverse category frequency (icf) into term weighting scheme and propose two novel approaches, i.e., tf.icf and icf-based supervised term weighting schemes. The tf.icf adopts icf to substitute idf factor and favors terms occurring in fewer categories, rather than fewer documents. And the icf-based approach combines icf and relevance frequency (rf) to weight terms in a supervised way. Our cross-classifier and cross-corpus experiments have shown that our proposed approaches are superior or comparable to six supervised term weighting schemes and three traditional schemes in terms of macro-F1 and micro-F1.

Keywords: unsupervised term weighting schemes, supervised term weighting schemes, inverse category frequency, text categorization, term weighting

Full Text () Retrieve PDF document (201303_02.pdf)

Received December 28, 2011; revised February 24, 2012; accepted June 5, 2012.
Communicated by Hsin-Min Wang.
* This work was supported by the 863 High-Tech Program under Grant No. 2007AA010403.