Journal of Inforamtion Science and Engineering, Vol.17 No.5, pp.805-824 (September 2001)

Extracting Chinese Frequent Strings Without a Dictionary
From a Chinese Corpus and its Applications

Yih-Jeng Lin and Ming-Shing Yu*
Departmnet of Applied Mahtematics
National Chung-Hsing Univeristy
Taichung, 402 Taiwan

This paper describes how to extract Chinese frequent strings without using a dictionary. In this paper, we generalize the notations of words and unknown words to those of frequent strings. The Chinese frequent strings (CFSs) we define include words, unknown words, and other strings that are frequently used. Some examples of CFSs are “只得將 (can only let)”, “分分秒秒 (every minute and every second)”, “為對方著想 (bearing in mind the interest of each other)”, and “並沒有人 (and nobody)”. A CFS is very useful in Chinese natural language processing and its related applications. We show its application to the following three tasks: Chinese phoneme-to-character conversion, Chinese character-to-phoneme conversion, and the determination of prosodic segments in a Chinese sentence for text-to-speech output. We have also developed a simple method to extract CFSs from a corpus. The method we propose can automatically detect such strings without the use of any lexicon, and no word segmentation is needed. We also can extract unknown words in a corpus which consist of three of more words. Such words (e.g. 網際網路) usually cannot be extracted by using a concatenation approach.

Keywords: CFS, normalized perplexity, phoneme-to-character, character-to-phoneme, prosodic segment

Full Text (全文檔) Retrieve PDF document (200109_07.pdf)

Received April 18, 2000; revised September 13, 2000; accepted October 30, 2000.
Communicated by Wen-Lian Hsu.