Page 25 - untitled
P. 25

Language and Knowledge Processing
                 Language and Knowledge Processing                                                                                      c)  Cross-language information retrieval using Web   enhance search efficiency. Chinese question answer-
                                                                                                                                          mining                                       ing (QA) is a very challenging research topic. Our lab
                   Principal investigators:  Keh-Jiann Chen (Chair), Fu Chang, Lee-Feng Chien, Chun-Nan Hsu,                                The research focuses on the exploitation of   integrates several Chinese NLP techniques, such as
                                         Wen-Lian Hsu, Der-Ming Juang, Hsin-Min Wang.                                                   cross-language Web search techniques through dy-  question type classification, passage retrieval, named
                                                                                                                                        namic mining of Web resources and without relying   entity recognition and answer ranking, to construct a
                                                                                                                                        on dictionary lookup. The research goals include: (1)   Chinese QA system. Our system won the first place in
                     We focus on problems concerning knowledge-  GOING, which automatically translates a phonetic (or                   developing query translation techniques to cope with   the CLQA contest of NTCIR 2005 held in Tokyo with
                 based information processing. This area of research is   Pinyin) sequence into characters with a hit ratio close       the difficulties of proper name translations, which are   an accuracy of 44.5%. We have also applied this tech-
                 strongly motivated by the over flooding information   to 96%, is widely used in Taiwan. It received the Dis-           not included in common dictionaries, (2) constructing   nology to construct a ChatBot and a TutorBot in MSN
                 on WWW for which effective and autonomous infor-  tinguished Chinese Information Product Award€ʕ                       Livetrans query translation server with learning abil-  instant messaging system.
                 mation processing tools are still lacking. In order to   ˖௫̈༟ৃପۜᆤin 1993. In PC Home software                         ity to process the translation of unknown cross-lingual   3. Knowledge Representation
                 achieve high-level intelligent information processing,   download area, GOING has been downloaded about                queries for connected content clients, and (3) building
                 many most challenging research problems in the areas   one million times. It is one of two software developed          up a high-performance system to provide real English-  For the task of knowledge representation we
                 of knowledge acquisition, knowledge representation,   domestically within the top 20 download software.                Chinese cross-language Web search services. In our   study the logical foundation of ontology as well as
                 and knowledge utilization have to be addressed.  Our knowledge representation kernel, InfoMap, has                     previous research we have developed several query   the fine-grain semantic representation. We study near-  Research Groups
                                                                been applied to a wide variety of application systems                   translation approaches using Web mining techniques,   synonyms to see fine-grain differences between syn-
                 1. Knowledge Acquisition                       in natural language processing, biological knowledge                    including anchor text mining [ACM TOIS’04] and   onyms. The above processes enable us to know better
                     For the task of acquiring linguistic and com-  base, automation of pipeline experiment, and e-learn-               search result mining [ACM SIGIR’04].           about meaning representation and meaning composi-
                 mon sense knowledge, we will focus on strategies and   ing. Our model for concept understanding can utilize            d)  Audio (speech / music / song) processing & retrieval  tion. We will remodel the current ontology structures
                 methodologies of automating knowledge acquisition   heterogeneous knowledge representation systems,                        We aim to develop methods for analyzing, ex-  of WordNet, HowNet, and FrameNet to achieve a
                 processes. We expect that in the future, enhancement   which has been successfully applied to an educational                                                          better and more unified representation. We will study
                 of knowledge bases will be carried out automatically   tutoring system that can automatically detect errors in         tracting, recognizing, indexing, and retrieving infor-  modal logics and integrate modal logic systems into a
                                                                                                                                        mation from the audio of multimedia contents. Our
                 by using the established and yet to be developed pro-  mathematics problems of primary schools (grade 3).                                                             unified framework and develop automated inference
                                                                                                                                                                                                                                           Research Groups
                 cessing technologies to extract new knowledge from                                                                     research has been focused on speech recognition and   and theorem proving methods based on the logical
                                                                                                                                        speech information retrieval for many years. Text-
                 WWW and from various text sources such as XML      In the future, we will design a knowledge based                                                                    framework.
                 documents and tagged corpus.                   parsing system, which is the key technology for lan-                    to-speech synthesis and speaker identification/veri-  a) Knowledge web
                                                                guage understanding and also acts as a major building                   fication are ongoing research topics, too. We have
                 a)  Construction of ontology, linguistic and common   block of our learning system. The parsing system will            proposed several new ideas for improving recognition   In the last five years, we have developed an
                   sense knowledge bases                        utilize our Knowledge Web to understand Chinese                         accuracy and retrieval performance and successfully   ontology called InfoMap, which is designed to unify
                     The construction of ontology and common sense   language. We will also develop basic technologies                  implemented several prototype systems such as a   linguistic knowledge, common sense knowledge and
                 knowledge bases is very time consuming. In the past   for processing spoken languages and support vari-                TV news retrieval system and a Mandarin text-to-  various domains knowledge. InfoMap can be used
                 twenty some years, we have developed an infrastruc-  ous applications. The major research topics include:              speech system. We have published papers in IEEE   to perform complex and fuzzy structural matching.
                 ture for Chinese language processing which includes   knowledge-based language processing, information                 TSAP, Speech Communication, ACM TALIP, etc.    It has been successfully adopted to construct ques-
                 part-of-speech tagged corpus, tree-banks, Chinese   extraction and retrieval from text, audio, and video, in-          More recently, we have also extended our studies to   tion answering system, intelligent tutoring agent
                 lexical database, Chinese grammars, InfoMap, word   telligent search, cross-language information retrieval,            music/song processing and retrieval. Our research has   system and English grammar checker. In the future,
                 identification systems, sentence parsers, etc. We have   computer processing for Taiwanese, and intelligent            been focused mainly on query by singing/humming,   we will remodel the current ontology structures of
                 also developed an auto-map system that can help con-  tutoring, etc.                                                   melody extraction, solo vocal modeling, etc. We have   InfoMap, WordNet, HowNet, and FrameNet to form
                 struct ontology semi-automatically. In the future we                                                                   successfully implemented several prototype systems   a new Knowledge Web, called E-HowNet, in hopes to
                 plan to utilize the developed infrastructure to extract   a) Knowledge-based (Chinese language) processing             such as a music retrieval system and a singer identifi-  achieve a better and unified representation to perform
                 linguistic and domain knowledge from various cor-  We will focus our attention on conceptual pro-                      cation system and published papers in IEEE TASLP,   knowledge-based extraction, search, parsing and logi-
                 pora and texts on the web and to enhance the current   cessing of Chinese documents. The design of knowl-              Computer Music Journal, JCDL’05, etc.          cal inference under a unified framework.
                 knowledge bases. The target knowledge bases include   edge-based language processing systems will utilize              e)  Visual and auditory textual information retrieval &   b)  Resolving the encoding problem of Chinese
                 general domain ontology, special domain ontology,   the statistical, linguistic, and common sense knowl-                 skimming                                       characters
                 and lexical, syntactic, semantic knowledge bases. The   edge, which is provided by our evolving Knowledge                  The goal of this project is to develop methods   The research focuses on the exploiting knowl-
                 various knowledge bases will be inter-connected to   Web, to parse conceptual structures of sentences and                                                             edge of Philology to resolve the encoding problem of
                 form a Knowledge Web, which will be utilized for   to interpret meanings of sentences. The knowledge-                  for analyzing, recognizing and retrieving information   Chinese characters. The reason why missing character
                                                                                                                                        from the symbolic or natural objects from multimedia
                 language processing and logical inference.     based language processing systems incorporate the                                                                      problems exist all the time is because the existing
                                                                knowledge bases to form a learning system such that                     contents. We have developed methods for enhancing
                 b) Machine learning and data mining            the language processing systems would increase their                    binary quality of document images with non-uniform   Chinese character coding systems assumed that the
                     Regarding the basic research in machine learn-  processing power due to enhancement of knowledge                   luminance, methods for segmenting textual and non-  set of Chinese characters is closed and finite, just like
                 ing, we have been focusing on the theories and ap-  bases. On the other hand the knowledge bases are                   textual objects from document images using learning-  sets of alphabets, and totally ignored the fact that each
                 plications of learning from sparse data, which is a key   evolving due to the automatic knowledge extraction           induced rules, and various decomposition methods   Chinese character is composed of limited basic mean-
                 issue in the emerging applications of machine learning   made by language processing systems.                          to accelerate vector matching for large-scale pattern   ingful components. In our proposed encoding system,
                 and data mining. In particular, we study how to ac-                                                                    recognition tasks. The ongoing work is to further lay   the set of glyphs and operators for Chinese characters
                 celerate the EM algorithm. We have developed a triple   b) User-oriented intelligent search                            down the theoretical foundation for many of these   are coded and the problem of missing characters are
                 jump extrapolation approach to accelerating the EM   The research focuses on the exploitation of in-                   methods and also to extend their applications to other   resolved by encoding glyph structures of missing
                 algorithm and other bound optimization methods for   telligent techniques to learn more about what users               fields such as data mining, knowledge discovery, bio-  characters.
                 learning from sparse data. We show that when the con-  search, and to make use of users’ information in the            logical computing, etc.                        4. Long-term Plan
                 vergence of EM becomes slower, the distance of the   development of high-performance retrieval techniques.             f) Chinese question answering system
                 triple jump will be longer, and thus produce a higher   The research goals include: (1) developing techniques              Current search engines are based on keyword     We shall construct an ontological knowledge
                 speedup for data sets where EM converges slowly. Ex-  to organize users’ query vocabularies into a well-                                                              base capable of utilizing the knowledge in InfoMap,
                                                                                                                                        extraction and only returns ranked documents. The
                 perimental results show that the triple jump approach   formed topic hierarchy and (2) developing user-orient-         user needs to manually filter out irrelevant documents   WordNet, HowNet and FrameNet and other statistical
                 significantly outperforms EM and other acceleration   ed information retrieval techniques to provide more                                                             knowledge in a coherent fashion to perform fine-grain
                                                                                                                                        and sentences. In a natural language question answer-
                 methods for EM for a variety of probabilistic models.  accurate retrieval performance. Our previous research           ing system, the user can ask question in an ordinary   semantic inferences. Knowledge-based parsing and
                                                                has established an initial base for the study including                                                                tagging tools will be developed to enable semi-auto-
                 2. Knowledge Utilization                       approaches developed for auto-generation of query                       fashion, such as ¨Who is the President of the United   matic construction of ontology in various knowledge
                     We have designed a Chinese input system--  taxonomy [ACM TOIS’04], query log mining [JASIS’                        States?© Once the system understands the question,   domains. We shall develop more advanced approaches
                                                                02], Web-based text classifiers [WWW’04].                               it could answer concisely, ¨The President of the   to organize information at users’ space and understand
                                                                                                                                        United States is Bush.© Such a system would greatly
                                                                                                                                                                                       content of natural languages from users’ perspectives.


        16                                                                                                                                                                                                                                17
   20   21   22   23   24   25   26   27   28   29   30