Group ProfileWe focus on problems concerning knowledge-based information processing, which is strongly motivated by the over-flooding of information on the Internet. We are currently studying knowledge acquisition, representation and utilization with a special emphasis on Chinese processing.
1. Knowledge Base
Our focus is on strategies and methodologies for automating knowledge acquisition processes.
a) Construction of linguistic knowledge bases
Over the past twenty-some years, we have developed an infrastructure for Chinese language processing that includes part-of-speech tagged corpus, tree-banks, Chinese lexical databases, Chinese grammar, InfoMap, word identification systems, sentence parsers, and more. We plan to utilize these tools, in combination with crowd-sourcing, to extract linguistic and domain knowledge from the web. Various knowledge bases including general and special domain ontology, as well as lexical items and named entities from Wikipedia are connected to form a complete concept net.
2. Natural Language Understanding
We will remodel the current ontology structures of WordNet, HowNet, and FrameNet to achieve a more unified representation. We designed a universal concept representational mechanism called E-HowNet, which is a frame-based entity-relation model. E-HowNet has semantic composition and decomposition capabilities which may derive near-canonical representations of sentences through semantic composition of lexical senses. To connect with other well-developed ontology structures, senses in E-HowNet were manually connected to the corresponding synsets in WordNet, and lexicons in E-HowNet are automatically linked to synsets in WordNet.
a) Knowledge-based Chinese language processing
We will focus on the conceptual processing of Chinese documents. Our system will utilize the statistical, linguistic, and common sense knowledge, derived from our evolving Knowledge Web and E-HowNet, to parse the conceptual structures of sentences and interpret sentence meanings.
b) Statistical Principle-based Model
Most pattern-based IE approaches are initiated by manually providing seed instances. We have proposed a semisupervised method that can take a large quantity of seed instances with diverse quality. Our strategy provides flexible frame-based pattern matching and summarization.
c) Chinese question answering system
We integrated several Chinese NLP techniques to construct a Chinese factoid QA system, which won first prize in NTCIR-5 and NTCIR-6. In the future, we will extend the system to answer “how” and “what” questions.
d) Named entity recognition (NER)
Identifying person, location, and organization names in documents is extremely important for natural language understanding. In the past, we have developed a machinelearning based NER system, which won second prize in the 2006 SIGHAN competition, and first prize in the 2009 BioCreative II.5 gene name normalization shared task.
e) Chinese Textual Entailment (TE)
TE is the task of identifying inferences between sentences. We have integrated several NLP tools and resources, focusing on deeper semantic and syntactic analysis to construct a Chinese TE recognition system, which performed well in the 2011 NTCIR-9 TE shared task.
f) Distributional Word Representation
Recently, distributional word representations have become widely used in NLP tasks. Compared with traditional symbolic word meaning representations, distributional word representations are trained from a corpus, and represent word meanings as vectors, thus providing additional computational power and the advantage of generation. However, this representation is short of explanation ability. Thus fully making use of the strength of each kind of representation is crucial in resolving practical NLP tasks. For instance, we developed a lexical sentiment analyzer using both distributional word representation and E-HowNet to predict the sentiment of a given word. This system won first prize in Valance at an international contest held by IALP in 2016. We have also studied how to infuse information in knowledge bases into the form of distributional word representation. These results were published in EACL, 2017.
3. Natural Language Applications
a) Sentiment Analysis and Opinion Mining
Processing subjective information requires deep understanding. We have studied opinions, sentiments, subjectivities, affects, emotions and views in texts, such as news articles, blogs, forums, reviews, comments, dialogs and short messages. From this information we have developed sentiment analysis techniques for both Chinese and English. We built one of the most popular Chinese sentiment analysis toolkits, CSentiPackage, which includes sentiment dictionaries, scoring tools, and the deep neural network module, UTCNN. Using sentiment analysis techniques, we have built Feelit and WordForce (web-post emotion and opinion visualization systems), EmotionPush (Android app for Facebook short message emotion detection), RESOLVE (writing system for ESL learners to help them read and express emotions), and GiveMeExample (example sentence suggestion system for near-synonyms). Based on the excellent performance of these systems, we will continue to improve and develop the newest technology to enable emotion sensing in applications.
b) Semantic-Oriented Machine Translation
We use deep syntactic structures with lexicon senses and case-label at each node. An integrated statistical model is then used to discover the most likely combination of parse-tree, lexicon senses and node-case-labels (the best path). After the desired source semantic normal form is obtained, the corresponding target semantic normal form and the target string is generated according to the patterns and parameters automatically learned from those selected paths. For each unreachable sentence, a surrogate path will be created by searching the path (within the searching beam) that possesses the maximum value of the specified function (of associated sentencelevel BLEU score and likelihood value).
c) Machine Reading
We will build a Chinese natural language understanding system based on various analysis modules (word segmenter, parser, semantic role labeler, logic form transformer, etc.) that we have previously built. We plan to start this long term research project with a Chinese machine reading program, which can be evaluated by reading comprehension tests. This project is expected to begin by reading elementary school texts, and then gradually shift to high school-level and then real domain-oriented applications (e.g., smart Q&A).
d) Spoken language processing
Our research topics include speaker recognition, spoken language recognition, voice conversion, and spoken document retrieval/summarization. Recent achievements include locally linear embedding-based approaches for voice conversion and postfiltering, discriminative autoencoders for speech/speaker recognition, and novel paragraph embedding methods for spoken document retrieval/ summarization. Our group member, Dr. Kuan-Yu Chen, received the 2016 Postdoctoral Academic Publication Award from the Ministry of Science and Technology of Taiwan with a paper on spoken document summarization published in COLING, 2016.