Institute of Information Science
Natural Language and Knowledge Processing Laboratory
Principal Investigators:
:::Wen-Lian Hsu (Chair) :::Keh-Jiann Chen :::Hsin-Min Wang :::Lun-Wei Ku
:::Keh-Yih Su

[Group Profile]
We focus on problems concerning knowledge-based information processing, a process which is strongly motivated by the over-flooding of information on the Internet. We shall work on knowledge acquisition, utilization, and representation.
1. Knowledge Acquisition
Our focus is on strategies and methodologies of automating knowledge acquisition processes.

a) Construction of linguistic knowledge bases
In the past twenty some years, we have developed an infrastructure for Chinese language processing that includes part-of-speech tagged corpus, tree-banks, Chinese lexical databases, Chinese grammars, InfoMap, word identification systems, sentence parsers, etc. We have also developed some basic techniques for knowledge extraction, such as namedentity recognition (NER), semantic role labeling, and relation extraction in both Chinese and biological literature. We plan to extract linguistic and domain knowledge from the web with crowd sourcing.

b) Machine learning and pattern classification
We have proposed an extremely efficient tree decomposition approach to train non-linear support vector machines at a speedup factor of hundreds, or even thousands sometimes, while still achieving comparable test accuracy. We are also pioneering a new method for ranking and selecting features using multiple feature subsets, and have gained advantages in computing speed, test accuracy, the number of essential features that are ranked above all irrelevant features, and the number of essential features in the selected features.
c) Pattern-based information extraction (IE)
Most pattern-based IE approaches kick off by manually providing seed instances. We have proposed two mechanisms to remove human efforts at the beginning state. First, we applied a semi-supervised method that can take a large quantity of seed instances with diverse quality. Second, we proposed a weakly-supervised approach for extracting instances of semantic classes, which uses a compression model to assess the contextual evidence of its extraction.
2. Knowledge Utilization
Our Chinese input system, GOING, is used by over one million people in Taiwan. Our knowledge representation kernel, Info- Map, has been applied to a wide variety of application systems. In the future, we will design event frames as a major building block of our learning system. We will also develop basic technologies for processing spoken languages, and music to support various applications.

a) Knowledge-based Chinese language processing
We will focus on the conceptual processing of Chinese documents. Our system will utilize the statistical, linguistic, and common sense knowledge derived by our evolving Knowledge Web and E-HowNet to parse the conceptual structures of sentences and interpret the sentence meanings.
b) Audio (speech / music / song) processing & retrieval
We focus on speech recognition, speaker recognition/segmentation/clustering, and spoken document retrieval/summarization. Our speaker verification system was ranked 2nd in 2006 International Symposium on Chinese Spoken Language Processing. We have developed a prototype TV news retrieval system. In regards to music, our research focuses on vocal melody extraction, query by singing/humming, music tag annotation, and tag-based music retrieval. Our audio-tagging system was ranked 1st in 2009 Music Information Retrieval Evaluation eXchange (MIREX2009). We have developed a novel query by multitags music search system.

c) Chinese question answering system
We integrated several Chinese NLP techniques to construct a Chinese factoid QA system, which won the first place in NTCIR-5 and NTCIR-6. In the future, we will extend the system to answer “how” and “what” types of questions.

d) Named entity recognition (NER)
Identifying person, location, and organization names in documents is very important for natural language understanding. In the past, we have developed a machine-learning based NER system, which won the second place in 2006 SIGHAN competition, and the 1st place in 2009 BioCreative II.5 gene name normalization shared task. In recent years, we focused on the research of using semantic rules and language patterns for NER adopting Markov-Logic Network, which provides more flexibility in NER.

e) Chinese Textual Entailment (TE)
TE is the task of identifying inferences between sentences. We have integrated several NLP tools and resources, focusing on deeper semantic and syntactic analysis to construct a Chinese TE recognition system, which achieved good performance in 2011 NTCIR-9 TE shared task.
3. Knowledge Representation
We will remodel the current ontology structures of WordNet, HowNet, and FrameNet to achieve a more unified representation. We designed a universal concept representational mechanism called E-HowNet, which is a frame-based entity-relation model E-HowNet has semantic composition and decomposition capabilities which intend to derive near-canonical sense representation of sentences through semantic composition of lexical senses.


Academia Sinica Institue of Information Science Academia Sinica