Page 26 - untitled
P. 26
Language and Knowledge Processing
Language and Knowledge Processing c) Cross-language information retrieval using Web enhance search efficiency. Chinese question answer-
mining ing (QA) is a very challenging research topic. Our lab
Principal investigators: Keh-Jiann Chen (Chair), Fu Chang, Lee-Feng Chien, Chun-Nan Hsu, The research focuses on the exploitation of integrates several Chinese NLP techniques, such as
Wen-Lian Hsu, Der-Ming Juang, Hsin-Min Wang. cross-language Web search techniques through dy- question type classification, passage retrieval, named
namic mining of Web resources and without relying entity recognition and answer ranking, to construct a
on dictionary lookup. The research goals include: (1) Chinese QA system. Our system won the first place in
We focus on problems concerning knowledge- GOING, which automatically translates a phonetic (or developing query translation techniques to cope with the CLQA contest of NTCIR 2005 held in Tokyo with
based information processing. This area of research is Pinyin) sequence into characters with a hit ratio close the difficulties of proper name translations, which are an accuracy of 44.5%. We have also applied this tech-
strongly motivated by the over flooding information to 96%, is widely used in Taiwan. It received the Dis- not included in common dictionaries, (2) constructing nology to construct a ChatBot and a TutorBot in MSN
on WWW for which effective and autonomous infor- tinguished Chinese Information Product Awardʕ Livetrans query translation server with learning abil- instant messaging system.
mation processing tools are still lacking. In order to ˖௫̈༟ৃପۜᆤin 1993. In PC Home software ity to process the translation of unknown cross-lingual 3. Knowledge Representation
achieve high-level intelligent information processing, download area, GOING has been downloaded about queries for connected content clients, and (3) building
many most challenging research problems in the areas one million times. It is one of two software developed up a high-performance system to provide real English- For the task of knowledge representation we
of knowledge acquisition, knowledge representation, domestically within the top 20 download software. Chinese cross-language Web search services. In our study the logical foundation of ontology as well as
and knowledge utilization have to be addressed. Our knowledge representation kernel, InfoMap, has previous research we have developed several query the fine-grain semantic representation. We study near- Research Groups
been applied to a wide variety of application systems translation approaches using Web mining techniques, synonyms to see fine-grain differences between syn-
1. Knowledge Acquisition in natural language processing, biological knowledge including anchor text mining [ACM TOIS’04] and onyms. The above processes enable us to know better
For the task of acquiring linguistic and com- base, automation of pipeline experiment, and e-learn- search result mining [ACM SIGIR’04]. about meaning representation and meaning composi-
mon sense knowledge, we will focus on strategies and ing. Our model for concept understanding can utilize d) Audio (speech / music / song) processing & retrieval tion. We will remodel the current ontology structures
methodologies of automating knowledge acquisition heterogeneous knowledge representation systems, We aim to develop methods for analyzing, ex- of WordNet, HowNet, and FrameNet to achieve a
processes. We expect that in the future, enhancement which has been successfully applied to an educational better and more unified representation. We will study
of knowledge bases will be carried out automatically tutoring system that can automatically detect errors in tracting, recognizing, indexing, and retrieving infor- modal logics and integrate modal logic systems into a
mation from the audio of multimedia contents. Our
by using the established and yet to be developed pro- mathematics problems of primary schools (grade 3). unified framework and develop automated inference
Research Groups
cessing technologies to extract new knowledge from research has been focused on speech recognition and and theorem proving methods based on the logical
speech information retrieval for many years. Text-
WWW and from various text sources such as XML In the future, we will design a knowledge based framework.
documents and tagged corpus. parsing system, which is the key technology for lan- to-speech synthesis and speaker identification/veri- a) Knowledge web
guage understanding and also acts as a major building fication are ongoing research topics, too. We have
a) Construction of ontology, linguistic and common block of our learning system. The parsing system will proposed several new ideas for improving recognition In the last five years, we have developed an
sense knowledge bases utilize our Knowledge Web to understand Chinese accuracy and retrieval performance and successfully ontology called InfoMap, which is designed to unify
The construction of ontology and common sense language. We will also develop basic technologies implemented several prototype systems such as a linguistic knowledge, common sense knowledge and
knowledge bases is very time consuming. In the past for processing spoken languages and support vari- TV news retrieval system and a Mandarin text-to- various domains knowledge. InfoMap can be used
twenty some years, we have developed an infrastruc- ous applications. The major research topics include: speech system. We have published papers in IEEE to perform complex and fuzzy structural matching.
ture for Chinese language processing which includes knowledge-based language processing, information TSAP, Speech Communication, ACM TALIP, etc. It has been successfully adopted to construct ques-
part-of-speech tagged corpus, tree-banks, Chinese extraction and retrieval from text, audio, and video, in- More recently, we have also extended our studies to tion answering system, intelligent tutoring agent
lexical database, Chinese grammars, InfoMap, word telligent search, cross-language information retrieval, music/song processing and retrieval. Our research has system and English grammar checker. In the future,
identification systems, sentence parsers, etc. We have computer processing for Taiwanese, and intelligent been focused mainly on query by singing/humming, we will remodel the current ontology structures of
also developed an auto-map system that can help con- tutoring, etc. melody extraction, solo vocal modeling, etc. We have InfoMap, WordNet, HowNet, and FrameNet to form
struct ontology semi-automatically. In the future we successfully implemented several prototype systems a new Knowledge Web, called E-HowNet, in hopes to
plan to utilize the developed infrastructure to extract a) Knowledge-based (Chinese language) processing such as a music retrieval system and a singer identifi- achieve a better and unified representation to perform
linguistic and domain knowledge from various cor- We will focus our attention on conceptual pro- cation system and published papers in IEEE TASLP, knowledge-based extraction, search, parsing and logi-
pora and texts on the web and to enhance the current cessing of Chinese documents. The design of knowl- Computer Music Journal, JCDL’05, etc. cal inference under a unified framework.
knowledge bases. The target knowledge bases include edge-based language processing systems will utilize e) Visual and auditory textual information retrieval & b) Resolving the encoding problem of Chinese
general domain ontology, special domain ontology, the statistical, linguistic, and common sense knowl- skimming characters
and lexical, syntactic, semantic knowledge bases. The edge, which is provided by our evolving Knowledge The goal of this project is to develop methods The research focuses on the exploiting knowl-
various knowledge bases will be inter-connected to Web, to parse conceptual structures of sentences and edge of Philology to resolve the encoding problem of
form a Knowledge Web, which will be utilized for to interpret meanings of sentences. The knowledge- for analyzing, recognizing and retrieving information Chinese characters. The reason why missing character
from the symbolic or natural objects from multimedia
language processing and logical inference. based language processing systems incorporate the problems exist all the time is because the existing
knowledge bases to form a learning system such that contents. We have developed methods for enhancing
b) Machine learning and data mining the language processing systems would increase their binary quality of document images with non-uniform Chinese character coding systems assumed that the
Regarding the basic research in machine learn- processing power due to enhancement of knowledge luminance, methods for segmenting textual and non- set of Chinese characters is closed and finite, just like
ing, we have been focusing on the theories and ap- bases. On the other hand the knowledge bases are textual objects from document images using learning- sets of alphabets, and totally ignored the fact that each
plications of learning from sparse data, which is a key evolving due to the automatic knowledge extraction induced rules, and various decomposition methods Chinese character is composed of limited basic mean-
issue in the emerging applications of machine learning made by language processing systems. to accelerate vector matching for large-scale pattern ingful components. In our proposed encoding system,
and data mining. In particular, we study how to ac- recognition tasks. The ongoing work is to further lay the set of glyphs and operators for Chinese characters
celerate the EM algorithm. We have developed a triple b) User-oriented intelligent search down the theoretical foundation for many of these are coded and the problem of missing characters are
jump extrapolation approach to accelerating the EM The research focuses on the exploitation of in- methods and also to extend their applications to other resolved by encoding glyph structures of missing
algorithm and other bound optimization methods for telligent techniques to learn more about what users fields such as data mining, knowledge discovery, bio- characters.
learning from sparse data. We show that when the con- search, and to make use of users’ information in the logical computing, etc. 4. Long-term Plan
vergence of EM becomes slower, the distance of the development of high-performance retrieval techniques. f) Chinese question answering system
triple jump will be longer, and thus produce a higher The research goals include: (1) developing techniques Current search engines are based on keyword We shall construct an ontological knowledge
speedup for data sets where EM converges slowly. Ex- to organize users’ query vocabularies into a well- base capable of utilizing the knowledge in InfoMap,
extraction and only returns ranked documents. The
perimental results show that the triple jump approach formed topic hierarchy and (2) developing user-orient- user needs to manually filter out irrelevant documents WordNet, HowNet and FrameNet and other statistical
significantly outperforms EM and other acceleration ed information retrieval techniques to provide more knowledge in a coherent fashion to perform fine-grain
and sentences. In a natural language question answer-
methods for EM for a variety of probabilistic models. accurate retrieval performance. Our previous research ing system, the user can ask question in an ordinary semantic inferences. Knowledge-based parsing and
has established an initial base for the study including tagging tools will be developed to enable semi-auto-
2. Knowledge Utilization approaches developed for auto-generation of query fashion, such as ¨Who is the President of the United matic construction of ontology in various knowledge
We have designed a Chinese input system-- taxonomy [ACM TOIS’04], query log mining [JASIS’ States?© Once the system understands the question, domains. We shall develop more advanced approaches
02], Web-based text classifiers [WWW’04]. it could answer concisely, ¨The President of the to organize information at users’ space and understand
United States is Bush.© Such a system would greatly
content of natural languages from users’ perspectives.
16 17