Institute of Information Science
 
research
  :::Print :::Chinese :::Site Map :::Home
 
Language and Knowledge Processing
Principal Investigators:
Keh-Jiann Chen (Chair) Fu Chang Lee-Feng Chien
Chun-Nan Hsu Wen-Lian Hsu Hsin-Min Wang
Juang, Der-Ming

[ Language and Knowledge Processing ]

We focus on problems concerning knowledgebased information processing. This area of research is strongly motivated by the over flooding information on WWW for which effective and autonomous information processing tools are still lacking. In order to achieve high-level intelligent information processing, many most challenging research problems in the areas of knowledge acquisition, knowledge representation, and knowledge utilization have to be addressed.

(1) Knowledge Acquistion
For the task of acquiring linguistic and common sense knowledge, we will focus on strategies and methodologies of automating knowledge acquisition processes. We expect that in the future, enhancement of knowledge bases will be carried out automatically by using the established and yet to be developed processing technologies to extract new knowledge from WWW and from various text sources such as XML documents and tagged corpus.
  • Construction of ontology, linguistic and common sense knowledge bases
    For the task of acquiring linguistic and common sense knowledge, we will focus on strategies and methodologies of automating knowledge acquisition processes. We expect that in the future, enhancement of knowledge bases will be carried out automatically by using the established and yet to be developed processing technologies to extract new knowledge from WWW and from various text sources such as XML documents and tagged corpus.
  • Machine learning and data mining
    Regarding the basic research in machine learning, we have been focusing on the theories and applications of learning from sparse data, which is a key issue in the emerging applications of machine learning and data mining. In particular, we study how to accelerate the EM algorithm. We have developed a triple jump extrapolation approach to accelerating the EM algorithm and other bound optimization methods for learning from sparse data. We show that when the convergence of EM becomes slower, the distance of the triple jump will be longer, and thus produce a higher speedup for data sets where EM converges slowly. Experimental results show that the triple jump approach significantly outperforms EM and other acceleration methods for EM for a variety of probabilistic models.

(2) Knowledge Utilization
We have designed a Chinese input system--GOING, which automatically translates a phonetic (or Pinyin) sequence into characters with a hit ratio close to 96%, is widely used in Taiwan. It received the Distinguished Chinese Information Product Award(中文傑出資訊產品獎)in 1993. In PC Home software download area, GOING has been downloaded about one million times. It is one of two software developed domestically within the top 20 download software. Our knowledge representation kernel, InfoMap, has been applied to a wide variety of application systems in natural language processing, biological knowledge base, automation of pipeline experiment, and e-learning. Our model for concept understanding can utilize heterogeneous knowledge representation systems, which has been successfully applied to an educational tutoring system that can automatically detect errors in mathematics problems of primary schools (grade 3). In the future, we will design a knowledge based parsing system, which is the key technology for language understanding and also acts as a major building block of our learning system. The parsing system will utilize our Knowledge Web to understand Chinese language. We will also develop basic technologies for processing spoken languages and support various applications. The major research topics include: knowledge-based language processing, information extraction and retrieval from text, audio, and video, intelligent search, cross-language information retrieval, computer processing for Taiwanese, and intelligent tutoring, etc.
  • Knowledge-based (Chinese language) processing
    We will focus our attention on conceptual processing of Chinese documents. The design of knowledge- based language processing systems will utilize the statistical, linguistic, and common sense knowledge, which is provided by our evolving Knowledge Web, to parse conceptual structures of sentences and to interpret meanings of sentences. The knowledgebased language processing systems incorporate the knowledge bases to form a learning system such that the language processing systems would increase their processing power due to enhancement of knowledge bases. On the other hand the knowledge bases are evolving due to the automatic knowledge extraction made by language processing systems.
  • User-oriented intelligent search
    The research focuses on the exploitation of intelligent techniques to learn more about what users search, and to make use of user's information in the development of high-performance retrieval techniques. The research goals include: (1) developing techniques to organize user's query vocabularies into a wellformed topic hierarchy and (2) developing user-oriented information retrieval techniques to provide more accurate retrieval performance. Our previous research has established an initial base for the study including approaches developed for auto-generation of query taxonomy [ACM TOIS'04], query log mining [JASIS' 02], Web-based text classifiers [WWW'04].
  • Cross-language information retrieval using Web mining
    The research focuses on the exploitation of cross-language Web search techniques through dynamic mining of Web resources and without relying on dictionary lookup. The research goals include: (1) developing query translation techniques to cope with the difficulties of proper name translations, which are not included in common dictionaries, (2) constructing Livetrans query translation server with learning ability to process the translation of unknown cross-lingual queries for connected content clients, and (3) building up a high-performance system to provide real English- Chinese cross-language Web search services. In our previous research we have developed several query translation approaches using Web mining techniques, including anchor text mining [ACM TOIS'04] and search result mining [ACM SIGIR'04].
  • Audio (speech / music / song) processing & retrieval
    We aim to develop methods for analyzing, extracting, recognizing, indexing, and retrieving information from the audio of multimedia contents. Our research has been focused on speech recognition and speech information retrieval for many years. Textto- speech synthesis and speaker identification/verification are ongoing research topics, too. We have proposed several new ideas for improving recognition accuracy and retrieval performance and successfully implemented several prototype systems such as a TV news retrieval system and a Mandarin text-tospeech system. We have published papers in IEEE TSAP, Speech Communication, ACM TALIP, etc. More recently, we have also extended our studies to music/song processing and retrieval. Our research has been focused mainly on query by singing/humming, melody extraction, solo vocal modeling, etc. We have successfully implemented several prototype systems such as a music retrieval system and a singer identification system and published papers in IEEE TASLP, Computer Music Journal, JCDL'05, etc.
  • Visual and auditory textual information retrieval & skimming
    The goal of this project is to develop methods for analyzing, recognizing and retrieving information from the symbolic or natural objects from multimedia contents. We have developed methods for enhancing binary quality of document images with non-uniform luminance, methods for segmenting textual and nontextual objects from document images using learninginduced rules, and various decomposition methods to accelerate vector matching for large-scale pattern recognition tasks. The ongoing work is to further lay down the theoretical foundation for many of these methods and also to extend their applications to other fields such as data mining, knowledge discovery, biological computing, etc.
  • Chinese question answering system
    Current search engines are based on keyword extraction and only returns ranked documents. The user needs to manually filter out irrelevant documents and sentences. In a natural language question answering system, the user can ask question in an ordinary fashion, such as "Who is the President of the United States?"  Once the system understands the question, it could answer concisely, "The President of the United States is Bush."  Such a system would greatly enhance search efficiency. Chinese question answering (QA) is a very challenging research topic. Our lab integrates several Chinese NLP techniques, such as question type classification, passage retrieval, named entity recognition and answer ranking, to construct a Chinese QA system. Our system won the first place in the CLQA contest of NTCIR 2005 held in Tokyo with an accuracy of 44.5%. We have also applied this technology to construct a ChatBot and a TutorBot in MSN instant messaging system.
(3) Knowledge Representation
For the task of knowledge representation we study the logical foundation of ontology as well as the fine-grain semantic representation. We study nearsynonyms to see fine-grain differences between synonyms. The above processes enable us to know better about meaning representation and meaning composition. We will remodel the current ontology structures of WordNet, HowNet, and FrameNet to achieve a better and more unified representation. We will study modal logics and integrate modal logic systems into a unified framework and develop automated inference and theorem proving methods based on the logical framework.
  • Knowledge web
    In the last five years, we have developed an ontology called InfoMap, which is designed to unify linguistic knowledge, common sense knowledge and various domains knowledge. InfoMap can be used to perform complex and fuzzy structural matching. It has been successfully adopted to construct question answering system, intelligent tutoring agent system and English grammar checker. In the future, we will remodel the current ontology structures of InfoMap, WordNet, HowNet, and FrameNet to form a new Knowledge Web, called E-HowNet, in hopes to achieve a better and unified representation to perform knowledge-based extraction, search, parsing and logical inference under a unified framework.
  • Resolving the encoding problem of Chinese characters
    The research focuses on the exploiting knowledge of Philology to resolve the encoding problem of Chinese characters. The reason why missing character problems exist all the time is because the existing Chinese character coding systems assumed that the set of Chinese characters is closed and finite, just like sets of alphabets, and totally ignored the fact that each Chinese character is composed of limited basic meaningful components. In our proposed encoding system, the set of glyphs and operators for Chinese characters are coded and the problem of missing characters are resolved by encoding glyph structures of missing characters.
(4) Long-term Plan
We shall construct an ontological knowledge base capable of utilizing the knowledge in InfoMap, WordNet, HowNet and FrameNet and other statistical knowledge in a coherent fashion to perform fine-grain semantic inferences. Knowledge-based parsing and tagging tools will be developed to enable semi-automatic construction of ontology in various knowledge domains. We shall develop more advanced approaches to organize information at user's  space and understand content of natural languages from user's perspectives.

TOP
 
 
 
 
space
Academia Sinica Institue of Information Science