|
|
|
|
| |
|
Language and Knowledge Processing
Principal Investigators:
[ Language and Knowledge Processing ]
We focus on problems concerning knowledgebased information processing.
This area of research is strongly motivated by the over flooding information
on WWW for which effective and autonomous information processing tools
are still lacking. In order to achieve high-level intelligent information
processing, many most challenging research problems in the areas of knowledge
acquisition, knowledge representation, and knowledge utilization have
to be addressed.
(1) Knowledge Acquistion
For the task of acquiring linguistic and common sense knowledge,
we will focus on strategies and methodologies of automating knowledge
acquisition processes. We expect that in the future, enhancement of knowledge
bases will be carried out automatically by using the established and
yet to be developed processing technologies to extract new knowledge
from WWW and from various text sources such as XML documents and tagged
corpus.
- Construction of ontology, linguistic and common sense knowledge bases
For the task of acquiring linguistic
and common sense knowledge, we will focus on strategies and methodologies
of automating knowledge acquisition processes. We expect that in the
future, enhancement of knowledge bases will be carried out automatically
by using the established and yet to be developed processing technologies
to extract new knowledge from WWW and from various text sources such as
XML documents and tagged corpus.
- Machine learning and data mining
Regarding the basic research
in machine learning, we have been focusing on the theories and applications
of learning from sparse data, which is a key issue in the emerging applications
of machine learning and data mining. In particular, we study how to
accelerate the EM algorithm. We have developed a triple jump extrapolation
approach to accelerating the EM algorithm and other bound optimization
methods for learning from sparse data. We show that when the convergence
of EM becomes slower, the distance of the triple jump will be longer, and
thus produce a higher speedup for data sets where EM converges slowly. Experimental
results show that the triple jump approach significantly outperforms EM
and other acceleration methods for EM for a variety of probabilistic models.
(2) Knowledge Utilization
We have designed a Chinese input system--GOING, which automatically translates a phonetic (or Pinyin) sequence into characters with a hit ratio close to 96%, is widely used in Taiwan. It received the Distinguished Chinese Information Product Award(中文傑出資訊產品獎)in 1993. In PC Home software download area, GOING has been downloaded about one million times. It is one of two software developed domestically within the top 20 download software. Our knowledge representation kernel, InfoMap, has been applied to a wide variety of application systems in natural language processing, biological knowledge base, automation of pipeline experiment, and e-learning. Our model for concept understanding can utilize heterogeneous knowledge representation systems, which has been successfully applied to an educational tutoring system that can automatically detect errors in mathematics problems of primary schools (grade 3). In the future, we will design a knowledge based parsing system, which is the key technology for language understanding and also acts as a major building block of our learning system. The parsing system will utilize our Knowledge Web to understand Chinese language. We will also develop basic technologies for processing spoken languages and support various applications. The major research topics include: knowledge-based language processing, information extraction and retrieval from text, audio, and video, intelligent search, cross-language information retrieval, computer processing for Taiwanese, and intelligent tutoring, etc.
- Knowledge-based (Chinese language)
processing
We will focus our attention on
conceptual processing of Chinese documents. The design of knowledge- based
language processing systems will utilize the statistical, linguistic, and
common sense knowledge, which is provided by our evolving Knowledge Web,
to parse conceptual structures of sentences and to interpret meanings of
sentences. The knowledgebased language processing systems incorporate
the knowledge bases to form a learning system such that the language processing
systems would increase their processing power due to enhancement of knowledge
bases. On the other hand the knowledge bases are evolving due to the automatic
knowledge extraction made by language processing systems.
- User-oriented intelligent search
The research focuses on the exploitation
of intelligent techniques to learn more about what users search, and
to make use of user's information in the development of high-performance
retrieval techniques. The research goals include: (1) developing techniques
to organize user's query vocabularies into a wellformed topic hierarchy
and (2) developing user-oriented information retrieval techniques to provide
more accurate retrieval performance. Our previous research has established
an initial base for the study including approaches developed for auto-generation
of query taxonomy [ACM TOIS'04], query log mining [JASIS' 02], Web-based
text classifiers [WWW'04].
- Cross-language information retrieval
using Web mining
The research focuses on the exploitation
of cross-language Web search techniques through dynamic mining of Web
resources and without relying on dictionary lookup. The research goals
include: (1) developing query translation techniques to cope with the
difficulties of proper name translations, which are not included in common
dictionaries, (2) constructing Livetrans query translation server with
learning ability to process the translation of unknown cross-lingual queries
for connected content clients, and (3) building up a high-performance
system to provide real English- Chinese cross-language Web search services.
In our previous research we have developed several query translation approaches
using Web mining techniques, including anchor text mining [ACM TOIS'04]
and search result mining [ACM SIGIR'04].
- Audio (speech / music / song) processing
& retrieval
We aim to develop methods for
analyzing, extracting, recognizing, indexing, and retrieving information
from the audio of multimedia contents. Our research has been focused
on speech recognition and speech information retrieval for many years.
Textto- speech synthesis and speaker identification/verification are
ongoing research topics, too. We have proposed several new ideas for
improving recognition accuracy and retrieval performance and successfully
implemented several prototype systems such as a TV news retrieval system
and a Mandarin text-tospeech system. We have published papers in IEEE
TSAP, Speech Communication, ACM TALIP, etc. More recently, we have also
extended our studies to music/song processing and retrieval. Our research
has been focused mainly on query by singing/humming, melody extraction,
solo vocal modeling, etc. We have successfully implemented several prototype
systems such as a music retrieval system and a singer identification
system and published papers in IEEE TASLP, Computer Music Journal, JCDL'05,
etc.
- Visual and auditory textual information
retrieval & skimming
The goal of this project is to
develop methods for analyzing, recognizing and retrieving information
from the symbolic or natural objects from multimedia contents. We have
developed methods for enhancing binary quality of document images with
non-uniform luminance, methods for segmenting textual and nontextual objects
from document images using learninginduced rules, and various decomposition
methods to accelerate vector matching for large-scale pattern recognition
tasks. The ongoing work is to further lay down the theoretical foundation
for many of these methods and also to extend their applications to other
fields such as data mining, knowledge discovery, biological computing,
etc.
- Chinese question answering system
Current search engines are based
on keyword extraction and only returns ranked documents. The user needs
to manually filter out irrelevant documents and sentences. In a natural
language question answering system, the user can ask question in an
ordinary fashion, such as "Who is the President of the United States?"
Once the system understands the question, it could answer concisely,
"The President of the United States is Bush." Such a system would
greatly enhance search efficiency. Chinese question answering (QA) is a
very challenging research topic. Our lab integrates several Chinese NLP
techniques, such as question type classification, passage retrieval, named
entity recognition and answer ranking, to construct a Chinese QA system.
Our system won the first place in the CLQA contest of NTCIR 2005 held in
Tokyo with an accuracy of 44.5%. We have also applied this technology to
construct a ChatBot and a TutorBot in MSN instant messaging system.
(3) Knowledge Representation
For the task of knowledge representation we study the logical foundation
of ontology as well as the fine-grain semantic representation. We study
nearsynonyms to see fine-grain differences between synonyms. The above
processes enable us to know better about meaning representation and meaning
composition. We will remodel the current ontology structures of WordNet,
HowNet, and FrameNet to achieve a better and more unified representation.
We will study modal logics and integrate modal logic systems into a unified
framework and develop automated inference and theorem proving methods based
on the logical framework.
- Knowledge web
In the last five years, we have
developed an ontology called InfoMap, which is designed to unify linguistic
knowledge, common sense knowledge and various domains knowledge. InfoMap
can be used to perform complex and fuzzy structural matching. It has
been successfully adopted to construct question answering system, intelligent
tutoring agent system and English grammar checker. In the future, we
will remodel the current ontology structures of InfoMap, WordNet, HowNet,
and FrameNet to form a new Knowledge Web, called E-HowNet, in hopes to
achieve a better and unified representation to perform knowledge-based
extraction, search, parsing and logical inference under a unified framework.
- Resolving the encoding problem of Chinese
characters
The research focuses on the exploiting
knowledge of Philology to resolve the encoding problem of Chinese characters.
The reason why missing character problems exist all the time is because
the existing Chinese character coding systems assumed that the set of
Chinese characters is closed and finite, just like sets of alphabets, and
totally ignored the fact that each Chinese character is composed of limited
basic meaningful components. In our proposed encoding system, the set of
glyphs and operators for Chinese characters are coded and the problem of
missing characters are resolved by encoding glyph structures of missing
characters.
(4) Long-term Plan
We shall construct an ontological knowledge base capable of utilizing
the knowledge in InfoMap, WordNet, HowNet and FrameNet and other statistical
knowledge in a coherent fashion to perform fine-grain semantic inferences.
Knowledge-based parsing and tagging tools will be developed to enable
semi-automatic construction of ontology in various knowledge domains.
We shall develop more advanced approaches to organize information at user's
space and understand content of natural languages from user's perspectives.

|
|