Institute of Information Science Academia Sinica
Topic: The Concept of Spoken Language Processing and Problems of Automatic Speech Recognition
Speaker: Prof. Hiroya Fujisaki (The University of Tokyo)
Date: 2012-06-26 (Tue) 10:30 – 12:00
Location: Auditorium 106 at new IIS Building
Host: Hsin-Min Wang


This talk consists of two parts.

In the first part, I will talk about the historical background, definition, problems and prospects of the field of Spoken Language Processing (SLP). The concept of SLP was originally introduced by the present author in 1986 in a Japanese National Project with the title “Advanced Human-Machine Interface through Spoken Language”. Prior to that time, it was generally accepted that tasks such as ‘text-to-speech synthesis’ and ‘automatic speech recognition’ could be accomplished by combining technologies of ‘speech signal processing’ and ‘natural language processing (NLP, which actually deals only with written language)’. It is the author’s belief that speech is not merely an acoustic signal, but is also a form of language (hence the term ‘Spoken Language’) that contains linguistic information often missing in the written language, so that we need a new concept to deal with both aspects of speech – as a signal and as a linguistic code system. The term was almost immediately adopted by the DARPA community, which re-defined one of its  research targets to be ‘Spoken Language Systems’. Since then, it was quickly accepted world-wide, and many projects, laboratories, and institutes were born or re-named with Spoken Language or SLP, and a large number of papers, tutorials and books were published. In spite of this world-wide acceptance, however, the author’s original idea seems to be less well understood, and most people still rely on the combination of speech signal processing and NLP in trying to solve their problems. Here I will try to clarify the concept, and give my personal view on the future of the field.

In the second part, I will discuss the problems that are inherent in the prevailing methods of automatic speech recognition (ASR). Since the mid-70s, technologies for ASR are said to have made a great progress by the introduction of the so-called statistical methods based on a communication channel model (F. Jelinek). It is, however, also true that the performance of ASR is still quite insufficient as compared with human performance in similar tasks, so that the problems of ASR are not yet solved. I will give a critical look at the widely accepted model, indicate its shortcomings that are generally overlooked, and propose a new approach. In the first place, let us look into the language model. At first sight, it looks like a sound formulation of the process of producing linguistic messages. However, the only usable model is the word trigram model, which is a very primitive and insufficient way for modeling the actual process of message generation. Since, however, it is almost impossible to obtain higher-level statistics beyond word trigrams, its sets a limitation to the performance of the conventional statistical approach. Let us further look more critically into the task of the so-called ASR. In reality, it is the task of converting a message (or a series of messages) of a spoken language into a message (or a series of messages) of a written language, namely deriving orthographic representation of the spoken massage(s). If we examine the process by which a human being performs the same task, it is apparent that one can transcribe speech only if one understands it. Even a well-trained phonetician cannot write down the spoken message into orthographic (not phonetic) notation if he/she does not know the language and hence does not understand the spoken message. Thus it is evident that speech understanding must precede speech transcription (i.e., speech recognition) by humans, and it is also true that automatic speech understanding (ASU) is the prerequisite to ASR. This is contrary to what is believed by most experts in ASR. I will explain my idea on how this could be accomplished, however paradoxical it may seem to be.