We focus on problems concerning knowledge-based
information processing, which is strongly motivated by the
over-flooding of information on the Internet. We are currently
studying knowledge acquisition, representation and
utilization with a special emphasis on Chinese processing.
1. Knowledge Base
Our focus is on strategies and methodologies for
automating knowledge acquisition processes.
a) Construction of linguistic knowledge bases
Over the past twenty-some years, we have developed an
infrastructure for Chinese language processing that includes
part-of-speech tagged corpus, tree-banks, Chinese lexical
databases, Chinese grammar, InfoMap, word identification
systems, sentence parsers, and more. We plan to utilize
these tools, in combination with crowd-sourcing, to extract
linguistic and domain knowledge from the web. Various
knowledge bases including general and special domain
ontology, as well as lexical items and named entities from
Wikipedia are connected to form a complete concept net.
2. Natural Language Understanding
We will remodel the current ontology structures of
WordNet, HowNet, and FrameNet to achieve a more
unified representation. We designed a universal concept
representational mechanism called E-HowNet, which is a
frame-based entity-relation model. E-HowNet has semantic
composition and decomposition capabilities which may
derive near-canonical representations of sentences through
semantic composition of lexical senses. To connect with
other well-developed ontology structures, senses in
E-HowNet were manually connected to the corresponding
synsets in WordNet, and lexicons in E-HowNet are
automatically linked to synsets in WordNet.
a) Knowledge-based Chinese language processing
We will focus on the conceptual processing of Chinese
documents. Our system will utilize the statistical, linguistic,
and common sense knowledge, derived from our evolving
Knowledge Web and E-HowNet, to parse the conceptual
structures of sentences and interpret sentence meanings.
b) Statistical Principle-based Model
Most pattern-based IE approaches are initiated by manually
providing seed instances. We have proposed a semisupervised
method that can take a large quantity of seed
instances with diverse quality. Our strategy provides flexible
frame-based pattern matching and summarization.
c) Chinese question answering system
We integrated several Chinese NLP techniques to construct
a Chinese factoid QA system, which won first prize in
NTCIR-5 and NTCIR-6. In the future, we will extend the
system to answer “how” and “what” questions.
d) Named entity recognition (NER)
Identifying person, location, and organization names in
documents is extremely important for natural language
understanding. In the past, we have developed a machinelearning
based NER system, which won second prize in
the 2006 SIGHAN competition, and first prize in the 2009
BioCreative II.5 gene name normalization shared task.
e) Chinese Textual Entailment (TE)
TE is the task of identifying inferences between sentences.
We have integrated several NLP tools and resources,
focusing on deeper semantic and syntactic analysis
to construct a Chinese TE recognition system, which
performed well in the 2011 NTCIR-9 TE shared task.
f) Distributional Word Representation
Recently, distributional word representations have become
widely used in NLP tasks. Compared with traditional
symbolic word meaning representations, distributional word
representations are trained from a corpus, and represent
word meanings as vectors, thus providing additional
computational power and the advantage of generation.
However, this representation is short of explanation ability.
Thus fully making use of the strength of each kind of
representation is crucial in resolving practical NLP tasks.
For instance, we developed a lexical sentiment analyzer
using both distributional word representation and E-HowNet
to predict the sentiment of a given word. This system won
first prize in Valance at an international contest held by IALP
in 2016. We have also studied how to infuse information
in knowledge bases into the form of distributional word
representation. These results were published in EACL,
3. Natural Language Applications
a) Sentiment Analysis and Opinion Mining
Processing subjective information requires deep
understanding. We have studied opinions, sentiments,
subjectivities, affects, emotions and views in texts, such as
news articles, blogs, forums, reviews, comments, dialogs
and short messages. From this information we have developed sentiment analysis
techniques for both Chinese and English. We built one of the most popular Chinese
sentiment analysis toolkits, CSentiPackage, which includes sentiment dictionaries,
scoring tools, and the deep neural network module, UTCNN. Using sentiment
analysis techniques, we have built Feelit and WordForce (web-post emotion and
opinion visualization systems), EmotionPush (Android app for Facebook short
message emotion detection), RESOLVE (writing system for ESL learners to help them
read and express emotions), and GiveMeExample (example sentence suggestion
system for near-synonyms). Based on the excellent performance of these systems,
we will continue to improve and develop the newest technology to enable emotion
sensing in applications.
b) Semantic-Oriented Machine Translation
We use deep syntactic structures with lexicon senses and case-label at each node.
An integrated statistical model is then used to discover the most likely combination
of parse-tree, lexicon senses and node-case-labels (the best path). After the desired
source semantic normal form is obtained, the corresponding target semantic normal
form and the target string is generated according to the patterns and parameters
automatically learned from those selected paths. For each unreachable sentence, a
surrogate path will be created by searching the path (within the searching beam) that
possesses the maximum value of the specified function (of associated sentencelevel
BLEU score and likelihood value).
c) Machine Reading
We will build a Chinese natural language understanding system based on various
analysis modules (word segmenter, parser, semantic role labeler, logic form
transformer, etc.) that we have previously built. We plan to start this long term
research project with a Chinese machine reading program, which can be evaluated
by reading comprehension tests. This project is expected to begin by reading
elementary school texts, and then gradually shift to high school-level and then real
domain-oriented applications (e.g., smart Q&A).
d) Spoken language processing
Our research topics include speaker recognition, spoken language recognition, voice
conversion, and spoken document retrieval/summarization. Recent achievements
include locally linear embedding-based approaches for voice conversion and postfiltering,
discriminative autoencoders for speech/speaker recognition, and novel
paragraph embedding methods for spoken document retrieval/ summarization.
Our group member, Dr.
Kuan-Yu Chen, received
the 2016 Postdoctoral
Award from the Ministry of
Science and Technology
of Taiwan with a paper
on spoken document
in COLING, 2016.