Previous [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19] [ 20] [ 21] [ 22] [ 23] [ 24]

@

Journal of Information Science and Engineering, Vol. 26 No. 4, pp. 1491-1507 (July 2010)

Development of a Mandarin-English Bilingual Speech Recognition System with Unified Acoustic Models*

QING-QING ZHANG, JIE-LIN PAN AND YONG-HONG YAN
ThinkIT Speech Laboratory
Institute of Acoustics
Chinese Academy of Sciences
Beijing, 100190 China

This paper presents our recent work on the development of a grammar-constrained, Mandarin-English bilingual Speech Recognition System (MESRS) for real-world music retrieval. Two of the main difficult issues in handling the bilingual speech recognition for realworld applications are tackled: One is to balance the performance and the complexity of the bilingual speech recognition system; the other is to effectively deal with the matrix language accents in embedded language. A unified bilingual acoustic model, which is derived by the novel Two-pass phone-clustering method based on the Confusion Matrix (TCM), is developed to solve the first problem. To deal with the second problem, several nonnative model modification approaches are investigated on the unified acoustic models. Compared to the existing log-likelihood phone-clustering method, the proposed TCM method with effective incorporation of limited amounts of nonnative adaptation data and adaptive modification, relatively reduces the Phrase Error Rate (PER) by 10.9% for nonnative English phrases and the PER on Mandarin phrases decreases favorably, and besides, the recognition rate for bilingual code-mixing phrases achieves an 8.9% relative PER reduction.

Keywords: bilingual speech recognition, two-pass phone clustering, confusion matrix, nonnative adaptation, model retraining

Full Text () Retrieve PDF document (201007_21.pdf)

Received September 26, 2008; revised June 2 & July 16 & September 8, 2009; accepted January 5, 2010.
Communicated by Suh-Yin Lee.
* This work is partially supported by The National Science and Technology Pillar Program (2008BAI50B03), National Natural Science Foundation of China (No. 10925419, 90920302, 10874203, 60875014), and has been presented in the ICASSP (International Conference on Acoustics, Speech, and Signal Processing), March 30-April 4, 2008, Las Vegas, Nevada, U.S.A.