Page 26 - untitled
P. 26

Language and Knowledge Processing
 Language and Knowledge Processing  c)  Cross-language information retrieval using Web   enhance search efficiency. Chinese question answer-
                 mining                                       ing (QA) is a very challenging research topic. Our lab
 Principal investigators:  Keh-Jiann Chen (Chair), Fu Chang, Lee-Feng Chien, Chun-Nan Hsu,  The research focuses on the exploitation of   integrates several Chinese NLP techniques, such as
 Wen-Lian Hsu, Der-Ming Juang, Hsin-Min Wang.  cross-language Web search techniques through dy-  question type classification, passage retrieval, named
               namic mining of Web resources and without relying   entity recognition and answer ranking, to construct a
               on dictionary lookup. The research goals include: (1)   Chinese QA system. Our system won the first place in
 We focus on problems concerning knowledge-  GOING, which automatically translates a phonetic (or   developing query translation techniques to cope with   the CLQA contest of NTCIR 2005 held in Tokyo with
 based information processing. This area of research is   Pinyin) sequence into characters with a hit ratio close   the difficulties of proper name translations, which are   an accuracy of 44.5%. We have also applied this tech-
 strongly motivated by the over flooding information   to 96%, is widely used in Taiwan. It received the Dis-  not included in common dictionaries, (2) constructing   nology to construct a ChatBot and a TutorBot in MSN
 on WWW for which effective and autonomous infor-  tinguished Chinese Information Product Award€ʕ  Livetrans query translation server with learning abil-  instant messaging system.
 mation processing tools are still lacking. In order to   ˖௫̈༟ৃପۜᆤin 1993. In PC Home software   ity to process the translation of unknown cross-lingual   3. Knowledge Representation
 achieve high-level intelligent information processing,   download area, GOING has been downloaded about   queries for connected content clients, and (3) building
 many most challenging research problems in the areas   one million times. It is one of two software developed   up a high-performance system to provide real English-  For the task of knowledge representation we
 of knowledge acquisition, knowledge representation,   domestically within the top 20 download software.   Chinese cross-language Web search services. In our   study the logical foundation of ontology as well as
 and knowledge utilization have to be addressed.  Our knowledge representation kernel, InfoMap, has   previous research we have developed several query   the fine-grain semantic representation. We study near-  Research Groups
 been applied to a wide variety of application systems   translation approaches using Web mining techniques,   synonyms to see fine-grain differences between syn-
 1. Knowledge Acquisition  in natural language processing, biological knowledge   including anchor text mining [ACM TOIS’04] and   onyms. The above processes enable us to know better
 For the task of acquiring linguistic and com-  base, automation of pipeline experiment, and e-learn-  search result mining [ACM SIGIR’04].  about meaning representation and meaning composi-
 mon sense knowledge, we will focus on strategies and   ing. Our model for concept understanding can utilize   d)  Audio (speech / music / song) processing & retrieval  tion. We will remodel the current ontology structures
 methodologies of automating knowledge acquisition   heterogeneous knowledge representation systems,   We aim to develop methods for analyzing, ex-  of WordNet, HowNet, and FrameNet to achieve a
 processes. We expect that in the future, enhancement   which has been successfully applied to an educational   better and more unified representation. We will study
 of knowledge bases will be carried out automatically   tutoring system that can automatically detect errors in   tracting, recognizing, indexing, and retrieving infor-  modal logics and integrate modal logic systems into a
               mation from the audio of multimedia contents. Our
 by using the established and yet to be developed pro-  mathematics problems of primary schools (grade 3).   unified framework and develop automated inference
                                                                                                                  Research Groups
 cessing technologies to extract new knowledge from   research has been focused on speech recognition and   and theorem proving methods based on the logical
               speech information retrieval for many years. Text-
 WWW and from various text sources such as XML   In the future, we will design a knowledge based   framework.
 documents and tagged corpus.   parsing system, which is the key technology for lan-  to-speech synthesis and speaker identification/veri-  a) Knowledge web
 guage understanding and also acts as a major building   fication are ongoing research topics, too. We have
 a)  Construction of ontology, linguistic and common   block of our learning system. The parsing system will   proposed several new ideas for improving recognition   In the last five years, we have developed an
 sense knowledge bases  utilize our Knowledge Web to understand Chinese   accuracy and retrieval performance and successfully   ontology called InfoMap, which is designed to unify
 The construction of ontology and common sense   language. We will also develop basic technologies   implemented several prototype systems such as a   linguistic knowledge, common sense knowledge and
 knowledge bases is very time consuming. In the past   for processing spoken languages and support vari-  TV news retrieval system and a Mandarin text-to-  various domains knowledge. InfoMap can be used
 twenty some years, we have developed an infrastruc-  ous applications. The major research topics include:   speech system. We have published papers in IEEE   to perform complex and fuzzy structural matching.
 ture for Chinese language processing which includes   knowledge-based language processing, information   TSAP, Speech Communication, ACM TALIP, etc.   It has been successfully adopted to construct ques-
 part-of-speech tagged corpus, tree-banks, Chinese   extraction and retrieval from text, audio, and video, in-  More recently, we have also extended our studies to   tion answering system, intelligent tutoring agent
 lexical database, Chinese grammars, InfoMap, word   telligent search, cross-language information retrieval,   music/song processing and retrieval. Our research has   system and English grammar checker. In the future,
 identification systems, sentence parsers, etc. We have   computer processing for Taiwanese, and intelligent   been focused mainly on query by singing/humming,   we will remodel the current ontology structures of
 also developed an auto-map system that can help con-  tutoring, etc.  melody extraction, solo vocal modeling, etc. We have   InfoMap, WordNet, HowNet, and FrameNet to form
 struct ontology semi-automatically. In the future we   successfully implemented several prototype systems   a new Knowledge Web, called E-HowNet, in hopes to
 plan to utilize the developed infrastructure to extract   a) Knowledge-based (Chinese language) processing  such as a music retrieval system and a singer identifi-  achieve a better and unified representation to perform
 linguistic and domain knowledge from various cor-  We will focus our attention on conceptual pro-  cation system and published papers in IEEE TASLP,   knowledge-based extraction, search, parsing and logi-
 pora and texts on the web and to enhance the current   cessing of Chinese documents. The design of knowl-  Computer Music Journal, JCDL’05, etc.  cal inference under a unified framework.
 knowledge bases. The target knowledge bases include   edge-based language processing systems will utilize   e)  Visual and auditory textual information retrieval &   b)  Resolving the encoding problem of Chinese
 general domain ontology, special domain ontology,   the statistical, linguistic, and common sense knowl-  skimming  characters
 and lexical, syntactic, semantic knowledge bases. The   edge, which is provided by our evolving Knowledge   The goal of this project is to develop methods   The research focuses on the exploiting knowl-
 various knowledge bases will be inter-connected to   Web, to parse conceptual structures of sentences and   edge of Philology to resolve the encoding problem of
 form a Knowledge Web, which will be utilized for   to interpret meanings of sentences. The knowledge-  for analyzing, recognizing and retrieving information   Chinese characters. The reason why missing character
               from the symbolic or natural objects from multimedia
 language processing and logical inference.  based language processing systems incorporate the   problems exist all the time is because the existing
 knowledge bases to form a learning system such that   contents. We have developed methods for enhancing
 b) Machine learning and data mining  the language processing systems would increase their   binary quality of document images with non-uniform   Chinese character coding systems assumed that the
 Regarding the basic research in machine learn-  processing power due to enhancement of knowledge   luminance, methods for segmenting textual and non-  set of Chinese characters is closed and finite, just like
 ing, we have been focusing on the theories and ap-  bases. On the other hand the knowledge bases are   textual objects from document images using learning-  sets of alphabets, and totally ignored the fact that each
 plications of learning from sparse data, which is a key   evolving due to the automatic knowledge extraction   induced rules, and various decomposition methods   Chinese character is composed of limited basic mean-
 issue in the emerging applications of machine learning   made by language processing systems.  to accelerate vector matching for large-scale pattern   ingful components. In our proposed encoding system,
 and data mining. In particular, we study how to ac-  recognition tasks. The ongoing work is to further lay   the set of glyphs and operators for Chinese characters
 celerate the EM algorithm. We have developed a triple   b) User-oriented intelligent search  down the theoretical foundation for many of these   are coded and the problem of missing characters are
 jump extrapolation approach to accelerating the EM   The research focuses on the exploitation of in-  methods and also to extend their applications to other   resolved by encoding glyph structures of missing
 algorithm and other bound optimization methods for   telligent techniques to learn more about what users   fields such as data mining, knowledge discovery, bio-  characters.
 learning from sparse data. We show that when the con-  search, and to make use of users’ information in the   logical computing, etc.  4. Long-term Plan
 vergence of EM becomes slower, the distance of the   development of high-performance retrieval techniques.   f) Chinese question answering system
 triple jump will be longer, and thus produce a higher   The research goals include: (1) developing techniques   Current search engines are based on keyword   We shall construct an ontological knowledge
 speedup for data sets where EM converges slowly. Ex-  to organize users’ query vocabularies into a well-  base capable of utilizing the knowledge in InfoMap,
               extraction and only returns ranked documents. The
 perimental results show that the triple jump approach   formed topic hierarchy and (2) developing user-orient-  user needs to manually filter out irrelevant documents   WordNet, HowNet and FrameNet and other statistical
 significantly outperforms EM and other acceleration   ed information retrieval techniques to provide more   knowledge in a coherent fashion to perform fine-grain
               and sentences. In a natural language question answer-
 methods for EM for a variety of probabilistic models.  accurate retrieval performance. Our previous research   ing system, the user can ask question in an ordinary   semantic inferences. Knowledge-based parsing and
 has established an initial base for the study including      tagging tools will be developed to enable semi-auto-
 2. Knowledge Utilization  approaches developed for auto-generation of query   fashion, such as ¨Who is the President of the United   matic construction of ontology in various knowledge
 We have designed a Chinese input system--  taxonomy [ACM TOIS’04], query log mining [JASIS’  States?© Once the system understands the question,   domains. We shall develop more advanced approaches
 02], Web-based text classifiers [WWW’04].  it could answer concisely, ¨The President of the   to organize information at users’ space and understand
               United States is Bush.© Such a system would greatly
                                                              content of natural languages from users’ perspectives.


 16                                                                                                               17
   21   22   23   24   25   26   27   28   29   30   31