Institute of Information Science
Natural Language and Knowledge Processing Laboratory
Principal Investigators:
:::Wen-Lian Hsu (Chair) :::Lun-Wei Ku :::Wei-Yun Ma :::Keh-Yih Su
:::Hsin-Min Wang

[Group Profile]
We focus on problems concerning knowledge-based information processing, a process which is strongly motivated by the deluge of information available on the Internet. We are currently engaged in research on knowledge acquisition, representation, and utilization, with special emphasis on Chinese processing.

1. Knowledge Acquisition
Our focus is on strategies and methodologies of automating knowledge acquisition processes.
a) Construction of linguistic knowledge bases
Over the past two decades, we have developed an infrastructure for Chinese language processing that includes part-of-speech tagged corpus, tree-banks, Chinese lexical databases, Chinese grammatical elements, InfoMap, word identification systems, and sentence parsers, among other components. We plan to make use of these tools to extract linguistic and domain knowledge from the web with crowd sourcing.

b) Pattern-based information extraction (IE)
Most pattern-based IE approaches are initiated by manually providing seed instances. We have proposed a semi-supervised method that can manage a large quantity of seed instances with diverse quality. Our strategy can provide flexible frame-based pattern matching and summarization.

2. Knowledge Representation
We will remodel the current ontology structures of WordNet, HowNet, and FrameNet to achieve a more unified representation. We have designed a universal concept representational mechanism called E-HowNet, a frame-based entity-relation model. E-HowNet has semantic composition and decomposition capabilities, which are intended to enable it to derive near-canonical sense representation of sentences through semantic composition of lexical senses.

3. Knowledge Utilization
Our Chinese input system, GOING, is used by over one million people in Taiwan. Our knowledge representation kernel, InfoMap, has been applied to a wide variety of application systems. In the future, we will design event frames as a major building block of our learning system. We will also develop basic technologies for processing spoken languages, and music to support various applications.

a) Knowledge-based Chinese language processing
We will focus on the conceptual processing of Chinese documents. Our system will utilize the statistical, linguistic, and common sense knowledge derived by our evolving Knowledge Web and E-HowNet to parse the conceptual structures of sentences and interpret the meanings of sentences.

b) Audio (speech/music/song) processing & retrieval
Our goal is to develop methods for analyzing, extracting, recognizing, indexing, and retrieving information from audio data, with special emphasis on speech and music. For speech, our research focuses on speaker recognition, spoken language recognition, voice conversion, and spoken document retrieval/summarization. As regards music, our ongoing research topics include vocal melody extraction, automatic music tagging, music emotion recognition, and music search. Our audiotagging system was ranked 1st in the 2009 Music Information Retrieval Evaluation eXchange (MIREX2009). Our work on acoustic visual emotion Gaussians modeling for automatic music video generation won the ACM Multimedia 2012 Grand Challenge First Prize.

c) Chinese question answering system
We integrated several Chinese NLP techniques to construct a Chinese factoid QA system, which won first place in NTCIR-5 and NTCIR-6. In the future, we will extend the system to answer “how” and “what” types of question.

d) Named entity recognition (NER)
Identifying person, location, and organization names in documents is very important for natural language understanding. In the past, we developed a machine-learning-based NER system, which won second place in the 2006 SIGHAN competition, and first place in the 2009 BioCreative II.5 gene name normalization shared task. In recent years, we have focused on using semantic rules and language patterns for NER-adopting Markov-Logic Network, which provides more flexibility for NER.

e) Chinese Textual Entailment (TE)
TE is the process of identifying inferences between sentences. We have integrated several NLP tools and resources, focusing on deeper semantic and syntactic analysis to construct a Chinese TE recognition system, which performed well in the 2011 NTCIR-9 TE shared task.

f ) Sentiment Analysis and Opinion Mining
Processing subjective information requires a deep understanding of the subject matter. We have studied opinions, sentiments, subjectivities, effects, emotions, and views in texts such as news articles, blogs, forums, reviews, comments, and dialogs, and developed related analysis techniques for Chinese and English. With developed techniques of sentiment analysis, we built Feelit, a web-post emotion visualization system, and RESOLVE, a writing system for ESL learners, to help users understand and learn to express their emotions. Based on their promising results, we will continue to improve performance utilizing the Sinica parser, semantic role labels, and e-Hownet.

g) Semantic-Oriented Machine Translation:
We adopt the deep syntactic structure with lexicon sense for each token and case-label at each node. An integrated statistical model is used to search the most likely combination of parse-tree, lexicon senses and node-case-labels (i.e., the best path). After the desired source semantic normal form is obtained, the corresponding target semantic normal form and the target string is then generated according to the patterns and parameters automatically learnt from those selected paths. For each unreachable sentence, a surrogate path will be created by searching the path (within the searching beam) that possesses the maximum value of the specified function (of associated sentence-level BLEU score and likelihood value).

h) Chinese Natural Language Understanding:
We will build a Chinese natural language understanding system based on various analysis modules (e.g., word segmenter, parser, semantic role labeler, logic form transformer) that we have previously constructed. We plan to start this long-term research project with a Chinese machine reading program which can be evaluated with reading comprehension tests. This project is expected to start from elementary school texts, and gradually shift to high school and then real domain-oriented applications (e.g., Intelligent Q&A).
Fast Input Software Déjà vu on cell phone


Academia Sinica Institue of Information Science Academia Sinica