Journal of Inforamtion Science and Engineering, Vol.11 No.1, pp.35-49 (March 1995)
Approximating False Hits of Disyllabic Terms
in a Chinese Signature File

Tyne Liang, Suh-Yin Lee and Wei-Pang Yang*
Institute of Computer Science and Information Engineering
National Chiao Tung University
Hsinchu, Taiwan 300, R.O.C.
*Institute of Computer and Information Science
National Chiao Tung University
Hsinchu, Taiwan 300, R.O.C.

The signature access method is a well-proven technique in text retrieval systems. However, the drawback of signature file is the inherent false hits during the filtering process. In this paper, we discuss the problems of false hits for a Chinese disyllabic query. We find two kinds of false hits. The first is called random false hits which are attributed by the accidental setting of signature bits. The second kind of false hits, which we call adjacency false hits, is due to the lack of character sequence information in signature files. Since many Chinese query terms are disyllabic (composed of two characters), we particularly formulate the false hit probability for disyllabic query based on statistical theories. Our theoretical model has been tested in experiments using a real corpus. Satisfactory agreement of the predictions for both kinds of false hits with the experimental results have been obtained.

Keywords: Chinese text retrieval, signature file, false hits, adjacency false hit, disyllabic terms

Received August 18, 1994; revised April 15, 1995.
Communicated by Hsi-Jian Lee.