Institute of Information Science
Recent Resarch Results
Current Research Results
"Image Feature Extraction in Encrypted Domain with Privacy-Preserving SIFT," IEEE Trans. on Image Processing, November 2012.
Authors: Chao-Yung Hsu, Chun-Shien Lu, and Soo-Chang Pei

Chun-ShienLuAbstract:
Privacy has received considerable attention but is still largely ignored in the multimedia community.
Consider a cloud computing scenario where the server is resource-abundant and is capable of finishing the designated tasks.
It is envisioned that secure media applications with privacy preservation will be seriously treated.
In view of the fact that scale-invariant feature transform (SIFT) has been widely adopted in various fields, this paper is the first to target the importance of privacy-preserving SIFT (PPSIFT) and to address the problem of secure SIFT feature extraction and representation in the encrypted domain.
As all of the operations in SIFT must be moved to the encrypted domain, we propose a privacy-preserving realization of the SIFT method based on homomorphic encryption.
We show through the security analysis based on the discrete logarithm problem and RSA that PPSIFT is secure against ciphertext only attack and known plaintext attack.
Experimental results obtained from different case studies demonstrate that the proposed homomorphic encryption-based privacy-preserving SIFT performs comparably to original SIFT and
that our method is useful in SIFT-based privacy-preserving applications.
Current Research Results
"U-Skyline: A New Skyline Query for Uncertain Databases," IEEE Trans. on Knowledge and Data Engineering, To Appear.
Authors: X. Liu, D.-N. Yang, M. Ye, and W.-C. Lee

De-NianYangAbstract:
The skyline query, aiming at identifying a set of skyline tuples that are not dominated by any other tuple, is particularly useful for multi-criteria data analysis and decision-making. For uncertain databases, a probabilistic skyline query, called P-Skyline, has been developed to return skyline tuples by specifying a probability threshold. However, the answer obtained via a P-Skyline query usually includes skyline tuples undesirably dominating each other when a small threshold is specified; or it may contain much fewer skyline tuples if a larger threshold is employed. To address this concern, we propose a new uncertain skyline query, called U-Skyline query, in this paper. Instead of setting a probabilistic threshold to qualify each skyline tuple independently, the U-Skyline query searches for a set of tuples that has the highest probability (aggregated from all possible scenarios) as the skyline answer. In order to answer U-Skyline queries efficiently, we propose a number of optimization techniques for query processing, including 1) computational simplification of U-Skyline probability, 2) pruning of unqualified candidate skylines and early termination of query processing, 3) reduction of the input dataset, and 4) partition and conquest of the reduced dataset. We perform a comprehensive performance evaluation on our algorithm and an alternative approach that formulates the U-Skyline processing problem by integer programming.
Current Research Results
"The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context," Transactions on Asian Language Information Processing (TALIP), To Appear.
Authors: MIKE TIAN-JIAN JIANG, TSUNG-HSIEN LEE, WEN-LIAN HSU

Abstract:
Since a Chinese syllable can correspond to many characters (homophones), the syllable-to-character conversion task is quite challenging for Chinese phonetic input methods (CPIM). There are usually two stages in a CPIM: 1. segment the syllable sequence into syllable words; 2. select the most likely character words for each syllable word. A CPIM usually assumes that the input is a complete sentence, and evaluates the performance based on a well-formed corpus. However, in practice, most Pinyin users prefer progressive text entry in several short chunks, mainly in one or two words each (most Chinese words consist of two or more characters). Short chunks do not provide enough contexts to perform the best possible syllable-to-character conversion, especially when a chunk consists of overlapping syllable words. In such cases, a conversion system often selects the boundary of a word with the highest frequency. Short chunk input is even more popular on platforms with limited computing power, such as mobile phones. Based on the observation that the relative strength of a word can be quite different when calculated leftwards or rightwards, we propose a simple division of the word context into the left context and the right context. Furthermore, we design a double ranking strategy for each word to reduce the number of errors in Step 1. Our strategy is modeled as the minimum feedback arc set problem on bipartite tournament with approximate solutions derived from genetic algorithm. Experiments show that, compared to the frequency-based method (FBM) (low memory and fast) and the conditional random fields (CRF) model (larger memory and slower), our double ranking strategy has the benefits of less memory, low power requirement with competitive performance. We believe a similar strategy could also be adopted to disambiguate conflicting linguistic patterns effectively.
"On Socio-Spatial Group Query for Location-Based Social Networks," ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM KDD), August 2012.
Authors: D.-N. Yang, C.-Y. Shen, W.-C. Lee, and M.-S. Chen

Ming-SyanChenDe-NianYangAbstract:
Challenges faced in organizing impromptu activities are the requirements of making timely invitations in accordance with the locations of candidate attendees and the social relationship among them. It is desirable to find a group of attendees close to a rally point and ensure that the selected attendees have a good social relationship to create a good atmosphere in the activity. Therefore, this paper proposes Socio-Spatial Group Query (SSGQ) to select a group of nearby attendees with tight social relation. Efficient processing of SSGQ is very challenging due to the tradeoff in the spatial and social domains. We show that the problem is NP-hard via a proof and design an efficient algorithm SSGSelect, which includes effective pruning techniques to reduce the running time for finding the optimal solution. We also propose a new index structure, Social R-Tree to further improve the efficiency. User study and experimental results demonstrate that SSGSelect significantly outperforms manual coordination in both solution quality and efficiency.
Current Research Results
"Object recognition using discriminative parts," Computer Vision and Image Understanding, July 2012.
Authors: Y.-H. Liu, A. J. T. Lee, and F. Chang

FuChangAbstract:
The existing object recognition methods can be classified into two categories: interest-point-based and discriminative-part-based. The interest-point-based methods do not perform well if the interest points cannot be selected very carefully. The performance of the discriminative-part-base methods is not stable if viewpoints change, because they select discriminative parts from the interest points. In addition, the discriminative-part-based methods often do not provide an incremental learning ability. To address these problems, we propose a novel method that consists of three phases. First, we use some sliding windows that are different in scale to retrieve a number of local parts from each model object and extract a feature vector for each local part retrieved. Next, we construct prototypes for the model objects by using the feature vectors obtained in the first phase. Each prototype represents a discriminative part of a model object. Then, we establish the correspondence between the local parts of a test object and those of the model objects. Finally, we compute the similarity between the test object and each model object, based on the correspondence established. The test object is recognized as the model object that has the highest similarity with the test object. The experimental results show that our proposed method outperforms or is comparable with the compared methods in terms of recognition rates on the COIL-100 dataset, Oxford buildings dataset and ETH-80 dataset, and recognizes all query images of the ZuBuD dataset. It is robust enough for distortion, occlusion, rotation, viewpoint and illumination change. In addition, we accelerate the recognition process using the C4.5 decision tree technique, and the proposed method has the ability to build prototypes incrementally.
"Joint Management of RAM and Flash Memory with Access Pattern Considerations," ACM/IEEE Design Automation Conference (DAC), June 2012.
Authors: Po-Chun Huang, Yuan-Hao Chang, and Tei-Wei Kuo

Yuan-HaoChangAbstract:
The popularity of flash memory has triggered the emerging of a variety of products with flash memory as storage medium. More advanced architectures with better hardware resources are now explored by vendors to fit different market needs. Different from the past work, this paper proposes to consider RAM as a storage medium together with flash memory so as to take advantage of the characteristics of both RAM and flash memory. In particular, an adaptive management strategy is proposed with the considerations of access patterns to improve both the system performance and the system endurance. The capability of the proposed approach is
evaluated by a series of experiments, for which we have very
encouraging results.
Current Research Results
Authors: Ling-Jyh Chen, Yu-Song Syu, Hung-Chia Chen, and Wang-Chien Lee

Ling-JyhChenAbstract:
Geospatial tagging (geotagging) is an emerging and very promising application that can help users find a wide variety of location-specific information, and thereby facilitate the development of advanced location-based services. Conventional geotagging systems share some limitations, such as the use of a two-phase operating model and the tendency to tag popular objects with simple contexts. To address these problems, a number of geotagging systems based on the concept of ‘Games with a Purpose’ (GWAP) have been developed recently. In this study, we use analysis to investigate these new systems. Based on our analysis results, we design three metrics to evaluate the system performance, and develop five task assignment algorithms for GWAP-based systems. Using a comprehensive set of simulations under both synthetic and realistic mobility scenarios, we find that the Least-Throughput First Assignment algorithm (LTFA) is the most effective approach because it can achieve competitive system utility, while its computational complexity remains moderate. We also find that, to improve the system utility, it is better to assign as many tasks as possible in each round. However, because players may feel annoyed if too many tasks are assigned at the same time, it is recommended that multiple tasks be assigned one by one in each round in order to achieve higher system utility.
"De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs," in Proceedings 5th International Conference on Cloud Computing, IEEE CLOUD 2012, June 2012.
Authors: Chang Yu-Jung, Chien-Chih Chen, Chuen-Liang Chen, and Jan-Ming Ho

Jan-MingHoAbstract:
The next-generation sequencing technologies dramatically accelerate the throughput of DNA sequencing in a much faster rate than the growth rate of computer speed as predicted by the “Moore’s Law.” It is a problem even to load and run these sequencing data in memory. There is an urgent need for de novo assemblers to efficiently handle the huge amount of sequencing data using scalable commodity servers in the clouds. In this paper, we present CloudBrush, a parallel algorithm that runs on the MapReduce framework of cloud computing for de novo assembly of high-throughput sequencing data. The algorithm uses Myers’s bi-directed string graphs as its basis and consists of two main stages: graph construction and graph simplification. First, a vertex is defined for each non-redundant sequence read. We present a prefix-and-extend algorithm to identify overlaps between a pair of reads and to reduce transitive edges. The graph is further simplified by using conventional operations including path compression, tip removal and bubble removal. We also present a new operation, Similar Neighbour Edge Adjustment, to remove error topology structures in string graphs. Besides, we also disconnect repeat regions by revised A-statistics. The goal is to partition the string graph so that all paths in each connected subgraph correspond to similar subsequences of the underlying genome. We then traverse each connected subgraph to find a long path supported by a sufficient amount of reads to represent the subgraph. Preliminary results show that the CloudBrush assembler, compared with Contrail and Edena on the sequencing data of E. coli genomes, may yield longer contigs
"Distributed Graph Database for Large Scale Social Computing," IEEE International Conference on Cloud Computing (IEEE CLOUD, top conference in cloud computing, acceptance rate 17%), June 2012.
Authors: Li-Yung Ho, Jan-Jan Wu, Pangfeng Liu

Jan-JanWuAbstract:
Recently NoSQL database has become the focus of large scale data processing. NoSQL databases is classified into two major categories – key-value and column-family. Examples of NoSQL database include Dynamo from Amazon and BigTable from Google. NoSQL databases are highly scalable because of their simple data schema and weaker consistency model when compared with relation database, and they are fault-tolerance because they mostly run in large scale cluster environments where failure is the norm.
However, NoSQL databases are suitable for social computing for the following three reasons. First, data records stored in NoSQL database are assumed to be independent from each other. However most data in social computing are closely related. For example, a tagged picture can relate to multiple users, and possibly their other data as well. Second, structure queries are very common in social computing, but it is very difficult to express structure queries concisely in NoSQL database. Social network application developers often have to use complex SQL queries to explore data relationship. These queries usually involve joining multiple tables, which may degrade performance severely. For example, most open source implementations of column-family databases only support indexing on the primary key, so to join multiple tables on the secondary key will cause extensive scan on data and is therefore inefficient. Finally, it is very common for a social application to traverse a graph consisting of persons based on their relationship. However, the current query interface of NoSQL database is not adequate to support graph traversal because NoSQL only supports very primitive operations such as get and set, and it is not intuitive to express graph traversal with these primitive NoSQL operations.
Graph database provides better support for social computing. Data relation is the first class citizen in a graph database because graph database is optimized for structured query. In order to support a large-scale social computing application we need a large scale graph database. To provide such a database it requires a distributed graph data store and a distributed data processing system. To the best of our knowledge, there has not been any open source and cloudready distributed graph database available to social computing. This lack of support for large scale social computing motives us to develop our distributed graph database. Our system consists of a distributed graph data store and a parallel graph processing system. The graph data store provides indexing on nodes and edges for processing efficiency, and a user friendly data manipulation interface for facilitating graph data processing. The parallel graph processing system provides capability to process extremely large social networks.
We conduct experiments to demonstrate the efficiency of our system by running representative applications on real-world large scale social networks including Youtube, Flicker, LiveJournal and Orkut. Experimental results indicate that our system outperforms Hadoop file system in subgraph computation on social networks."
Current Research Results
"Inter-Layer Bit-Allocation for Scalable Video Coding," IEEE Transactions on Image Processing, May 2012.
Authors: Guan-Ju Peng, Wen-Liang Hwang, and Sao-Jie Chen

Wen-LiangHwangKuan-ChuPengAbstract:
In this paper, we present a theoretical analysis of the distortion in multilayer coding structures. Specifically, we analyze the prediction structure used to achieve temporal, spatial, and quality scalability of scalable video coding (SVC) and show that the average peak signal-to-noise ratio (PSNR) of SVC is a weighted combination of the bit rates assigned to all the streams. Our analysis utilizes the end user's preference for certain resolutions. We also propose a rate-distortion (R-D) optimization algorithm and compare its performance with that of a state-of-the-art scalable bit allocation algorithm. The reported experiment results demonstrate that the R-D algorithm significantly outperforms the compared approach in terms of the average PSNR.
"Random Error Reduction in Similarity Search on Time Series: A Statistical Approach," The 28th IEEE International Conference on Data Engineering (ICDE-2012), April 2012.
Authors: Wush Chi-Hsuan Wu, Mi-Yen Yeh, and Jian Pei

Mi-YenYehAbstract:
Errors in measurement can be categorized into two types: systematic errors that are predictable, and random errors that are inherently unpredictable and have null expected value. Random error is always present in a measurement. More often than not, readings in time series may contain inherent random errors due to causes like dynamic error, drift, noise, hysteresis, digitalization error and limited sampling frequency. Random errors may affect the quality of time series analysis substantially. Unfortunately, most of the existing time series mining and analysis methods, such as similarity search, clustering, and classification tasks, do not address random errors, possibly because random error in a time series, which can be modeled as a random variable of unknown distribution, is hard to handle. In this paper, we tackle this challenging problem. Taking similarity search as an example, which is an essential task in time series analysis, we develop MISQ, a statistical approach for random error reduction in time series analysis. The major intuition in our method is to use only the readings at different time instants in a time series to reduce random errors. We achieve a highly desirable property in MISQ: it can ensure that the recall is above a user-specified threshold. An extensive empirical study on 20 benchmark real data sets clearly shows that our method can lead to better performance than the baseline method without random error reduction in real applications such as classification. Moreover, MISQ achieves good quality in similarity search.
Current Research Results
"The impact of normalization and phylogenetic information on estimating the distance metrics for metagenomes," IEEE Transactions on Computational Biology and Bioinformatics, April 2012.
Authors: Su, C.H., Wang, T.Y., Hsu, M.T., Weng, F., Kao, C.Y., Wang, D., and Tsai, H.K.

Huai-KuangTsaiAbstract:
Metagenomics enables the study of unculturable microorganisms in different environments directly. Discriminating between the compositional differences of metagenomes is an important and challenging problem. Several distance functions have been proposed to estimate the differences based on functional profiles or taxonomic distributions; however, the strengths and limitations of such functions are still unclear. Initially, we analyzed three well-known distance functions and found very little difference between them in the clustering of samples. This motivated us to incorporate suitable normalizations and phylogenetic information into the functions so that we could cluster samples from both real and synthetic data sets. The results indicate significant improvement in sample clustering over that derived by rank-based normalization with phylogenetic information, regardless of whether the samples are from real or synthetic microbiomes. Furthermore, our findings suggest that considering suitable normalizations and phylogenetic information is essential when designing distance functions for estimating the differences between metagenomes. We conclude that incorporating rank-based normalization with phylogenetic information into the distance functions helps achieve reliable clustering results.
Current Research Results
Authors: Jhih-Siang Lai, Cheng-Wei Cheng, Ting-Yi Sung*, Wen-Lian Hsu*

Wen-LianHsuTing-YiSungAbstract:

Secretome analysis is important in pathogen studies. A fundamental and convenient way to identify secreted proteins is to first predict signal peptides, which are essential for protein secretion. However, signal peptides are highly complex functional sequences that are easily confused with transmembrane domains. Such confusion would obviously affect the discovery of secreted proteins. Transmembrane proteins are important drug targets, but very few transmembrane protein structures have been determined experimentally; hence, prediction of the structures is essential. In the field of structure prediction, researchers do not make assumptions about organisms, so there is a need for a general signal peptide predictor.   To improve signal peptide prediction without prior knowledge of the associated organisms, we present a machine-learning method, called SVMSignal, which uses biochemical properties as features, as well as features acquired from a novel encoding, to capture biochemical profile patterns for learning the structures of signal peptides directly.
We tested SVMSignal and five popular methods on two benchmark datasets from the SPdb and UniProt/Swiss-Prot databases, respectively. Although SVMSignal was trained on an old dataset, it performed well, and the results demonstrate that learning the structures of signal peptides directly is a promising approach. We also utilized SVMSignal to analyze proteomes in the entire HAMAP microbial database. Finally, we conducted a comparative study of secretome analysis on seven tuberculosis-related strains selected from the HAMAP database. We identified ten potential secreted proteins, two of which are drug resistant and four are potential transmembrane proteins. SVMSignal is publicly available at http://bio-cluster.iis.sinica.edu.tw/SVMSignal. It provides user-friendly interfaces and visualizations, and the prediction results are available for download.
 

"Cluster-Based Multi-version B+-Tree in Flash-Based Embedded Database Systems," Work-in-Progress (WiP) Session of IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), April 2012.
Authors: Jian-Tao Wang, Kam-Yiu Lam, Yuan-Hao Chang, Jen-Wei Hsieh, Song Han, Yuang-Hung Kuan, and Aloysius K. Mok

Yuan-HaoChangAbstract:
Due to the unique performance characteristics of flash memory as compared with the disk storage systems, the conventional multi-version B$^+$-tree index structures are inefficient and can incur heavy update cost for databases resided in flash memory. In this paper, we propose a novel B$^+$-tree index structure, called \emph{Cluster-Based Multi-version B$^+$-tree (CMVBT)}, for indexing multi-versions of data items in flash-based embedded database systems. In CMVBT, the successive versions of the same data item are allocated consecutively in a \emph{version block} to reduce the size of the index structure and the overheads for updating the index structure in processing insert and update operations from applications. In addition, the reduced index structure of CMVBT can also help significantly reduce the data searching cost in processing queries which may require the access to a range of versions of a data item.
"Closed-Form Mortgage Pricing Formula with Outstanding Principal as Prepayment Value," proceedings IEEE Computational Intelligence for Financial Engineering and Economics, March 2012.
Authors: Yi-Cheng Tsai, Zheng-Hui Chen, Jan-Ming Ho, Ming-Yang Kao and Szu-Lang Liao

Jan-MingHoAbstract:
This study considers all the possible actions a borrower may have, i.e., to default, to prepay, and to maintain the mortgage, during mortgage horizon. Then, we provide an effective and accurate pricing formula, which not only considers the effect that default might affect the mortgage value, but also more accurately explores the impact due to prepayment risk. In our model, we define prepayment value of the mortgage as the amount of outstanding principle. In contrast, previous literature defines prepayment value as a constant proportion of maintaining value of the mortgage. Finally, based on closedform pricing formula, we analyze the yield, duration and convexity of risky mortgage loan, providing a better framework for risk management.
Current Research Results
Authors: Xiyuan Hu, Silong Peng, and Wen-Liang Hwang

Wen-LiangHwangAbstract:
Empirical mode decomposition (EMD) is an adaptive and data-driven approach for analyzing multicomponent nonlinear and nonstationary signals. The stop criterion, envelope technique, and mode-mixing problem are the most important topics that need to be addressed in order to improve the EMD algorithm. In this paper, we study the envelope technique and the mode-mixing problem caused by separating multicomponent AM-FM signals with the EMD algorithm. We present a new necessary condition on the envelope that questions the current assumption that the envelope passes through the extreme points of an intrinsic mode function (IMF). Then, we present a solution to the mode-mixing problem that occurs when multicomponent AM-FM signals are separated. We experiment on several signals, including simulated signals and real-life signals, to demonstrate the efficacy of the proposed method in resolving the mode-mixing problem.
Current Research Results
"An Adaptive File-System-Oriented FTL Mechanism for Flash-Memory Storage Systems," ACM Transactions on Embedded Computing Systems, March 2012.
Authors: Yuan-Hao Chang, Po-Liang Wu, Tei-Wei Kuo, and Shih-Hao Hung

Yuan-HaoChangAbstract:
As flash memory becomes popular over various platforms, there is a strong demand regarding the performance degradation problem, due to the special characteristics of flash memory. This research proposes the design of a file-system-oriented flash translation layer, in which a filter mechanism is designed to separate the access requests of file-system metadata and file contents for better performance. A recovery scheme is then proposed for maintaining the integrity of a file system. The proposed flash translation layer is implemented as a Linux device driver and evaluated with respect to ext2 and ext3 file systems. Experiments were also done over NTFS by a series of realistic traces. The experimental results show significant performance improvement over ext2, ext3, and NTFS file systems with limited system overheads.
Current Research Results
Authors: Yu-BinWang, Shu-Hwa Chen, Chun-Yen Lin and Jr-Kai Yu

Abstract:
Cephalochordates, commonly called amphioxus or lancelets, are now considered the most basal chordate group, and their study therefore offers important insights into various levels of evolutionary biology. In the past two decades, the investigation of amphioxus developmental biology has provided key knowledge for understanding the basic patterning mechanisms of chordates. Comparative genome studies of vertebrates and amphioxus have uncovered clear evidence for the two-round whole genome duplication hypothesis during early vertebrate evolution and have shed light on the evolution of morphological novelties in the complex vertebrate body plan. Complementary to the amphioxus genome sequencing project, a large collection of expressed sequence tags (ESTs) has been generated for amphioxus in recent years, representing a rich resource for gene discovery, expression profiling, and molecular developmental studies. Here, we review previous EST analyses and available cDNA resources in amphioxus and discuss their value for evolutionary and developmental studies. We also discuss potential opportunities and advantages for applying high-throughput, next-generation sequencing (NGS) technologies in amphioxus research.
"Counter-Example Guided Fence Insertion under TSO," Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2012), March 2012.
Authors: Parosh Abdulla, Mohamed Faouzi Atig, Yu-Fang Chen, Carl Leonardsson and Ahmed Rezine

Yu-FangChenAbstract:
We give a sound and complete procedure for fence insertion for concurrent finite-state programs running under the classical TSO memory model. This model allows "write to read" relaxation corresponding to the addition of an unbounded store buffer between each processor and the main memory. We introduce a novel machine model, called the Single-Buffer (SB) semantics, and show that the reachability problem for a program under TSO can be reduced to the reachability problem under SB. We present a simple and effective backward reachability analysis algorithm for the latter, and propose a counter-example guided fence insertion procedure. The procedure is augmented by a placement constraint, that allows the user to choose the places inside the program where fences may be inserted. For a given placement constraint, the method infers automatically all minimal sets of fences that ensure correctness of the program. We have implemented a prototype and run it successfully on all standard benchmarks, together with several challenging examples that are beyond the applicability of existing methods.
"Pricing Discrete Asian Barrier Options on Lattices," Proceedings IEEE Computational Intelligence for Financial Engineering and Economics, March 2012.
Authors: William W.Y. Hsu, Cheng-Yu Lu, Ming-Yang Kao, and Jan-Ming Ho

Jan-MingHoAbstract:
Asian barrier options are barrier options whose trigger is based on an average underlying price. They provide the advantages of both Asian options and barrier options. This paper introduces the first quadratic-time lattice algorithm to price European-style Asian barrier options. It is by far the most efficient lattice algorithm with convergence guarantees.
The algorithm relies on the Lagrange multipliers to optimally distribute the number of states for each node of the multinomial lattice. We also show experiment results to demonstrate effectiveness and efficiency of our algorithm by comparing with Monte Carlo simulations.
Current Research Results
Authors: Chung-Ming Yu, Hung-Pin Peng, Ing-Chien Chen, Yu-Ching Lee, Jun-Bo Chen, Keng-Chang Tsai, Ching-Tai Chen, Jeng-Yih Chang, Ei-Wen Yang, Po-Chiang Hsu, Jhih-Wei Jian,Hung-Ju Hsu, Hung-Ju Chang, Wen-Lian Hsu, Kai-Fa Huang, Alex Che Ma and An-Suei Yang

Wen-LianHsuAbstract:
Protein-protein interactions are critical determinants in biological systems. Engineered proteins binding to specific areas on protein surfaces could lead to therapeutics or diagnostics for treating diseases in humans. But designing epitope-specific protein-protein interactions with computational atomistic interaction free energy remains a difficult challenge. Here we show that, with the antibody-VEGF (vascular endothelial growth factor) interaction as a model system, the experimentally observed amino acid preferences in the antibody-antigen interface can be rationalized with 3-dimensional distributions of interacting atoms derived from the database of protein structures. Machine learning models established on the rationalization can be generalized to design amino acid preferences in antibody-antigen interfaces, for which the experimental validations are tractable with current high throughput synthetic antibody display technologies. Leave-one-out cross validation on the benchmark system yielded the accuracy, precision, recall (sensitivity) and specificity of the overall binary predictions to be 0.69, 0.45, 0.63, and 0.71 respectively, and the overall Matthews correlation coefficient of the 20 amino acid types in the 24 interface CDR positions was 0.312. The structure-based computational antibody design methodology was further tested with other antibodies binding to VEGF. The results indicate that the methodology could provide alternatives to the current antibody technologies based on animal immune systems in engineering therapeutic and diagnostic antibodies against predetermined antigen epitopes.
"The Development of a Real-time Valuation Service of Financial Derivatives," proceedings IEEE Computational Intelligence for Financial Engineering and Economics, March 2012.
Authors: Hsin-Tsung Peng, Chi-Fang Chang, Szu-Lang Liao, Ming-Yang Kao, Feipei Lai, and Jan-Ming Ho

Jan-MingHoAbstract:
Financial derivative valuation is the key of the adoption of the International Financial Reporting Standards (IFRS), which are based on fair value accounting. When the derivatives do not have an active market, the inputs and methods for estimating their fair value will be more subjective and, the derivative valuation will be less reliable. The goal of this research is to design a derivative valuation service to aid the accounting professionals to meet the reliability requirement of the IFRS which, in the derivative setting, requires derivative valuation be objective and free from errors. First, we incorporate the various valuation models and their risk factors into the service as the basis for providing an objective valuation result. Second, we provide a user interface to simplified management of parameters and provisions of term sheets of derivatives to avoid users’ errors in parsing the term sheets. Third, when the users hold a large number of derivatives, the derivative valuation to meet real-time constraints in financial reporting. We thus developed the service in parallel computing environment to reduce computational time of the valuation process. Empirical results of derivative valuation are also presented.
Current Research Results
"A Caching-Oriented Management Design for the Performance Enhancement of Solid-State Drives," ACM Transactions on Storage, February 2012.
Authors: Yuan-Hao Chang, Cheng-Kang Hsieh, Po-Chun Huang, and Pi-Cheng Hsiu

Pi-ChengHsiuPi-ChengHsiuYuan-HaoChangAbstract:
While solid-state drives are excellent alternatives to hard disks in mobile devices, a number of performance and reliability issues need to be addressed. In this work, we design an efficient flash management scheme for the performance improvement of low-cost MLC flash memory devices. Specifically, we design an efficient flash management scheme for multi-chipped flash memory devices with cache support, and develop a two-level address translation mechanism with an adaptive caching policy. We evaluated the approach on real workloads. The results demonstrate that it can improve the performance of multi-chipped solid-state drives through logical-to-physical mappings and concurrent accesses to flash chips.
Current Research Results
"Optimal Multiresolution Blending of Confocal Microscope Images," IEEE Transaction on Biomedical Engineering, February 2012.
Authors: Hao-Chiang Shao, Wen-Liang Hwang, and Yung-Chang Chen

Wen-LiangHwangAbstract:
Typical mosaicing schemes assume that to-be-combined images are equally informative; thus, the images are processed in a similar manner. However, the new imaging technique for confocal fluorescence images has revealed a problem when two asymmetrically informative biological images are stitched during microscope image mosaicing. The latter process is widely used in biological studies to generate a higher resolution image by combining multiple images taken at different times and angles. To resolve the earlier problem, we propose a multiresolution optimization approach that evaluates the blending coefficients based on the relative importance of the overlapping regions of the to-be-combined image pair. The blending coefficients are the optimal solution obtained by a quadratic programming algorithm with constraints that are enforced by the biological requirements. We demonstrate the efficacy of the proposed approach on several confocal microscope fluorescence images and compare the results with those derived by other methods.
Current Research Results
Authors: Chin-I Chang, Pei-Hsin Hung, Chia-Che Wu, Ta Chih Cheng, Jyh-Ming Tsai, King-Jung Lin and Chung-Yen Lin

Chung-YenLinAbstract:
We coupled 16S rDNA PCR and DNA hybridization technology to construct a microarray for simultaneous detection and discrimination of eight fish pathogens (Aeromonas hydrophila, Edwardsiella tarda, Flavobacterium columnare, Lactococcus garvieae, Photobacterium damselae, Pseudomonas anguillisepticaStreptococcus iniae and Vibrio anguillarum) commonly encountered in aquaculture. The array comprised short oligonucleotide probes (30 mer) complementary to the polymorphic regions of 16S rRNA genes for the target pathogens. Targets annealed to the microarray probes were reacted with streptavidin-conjugated alkaline phosphatase and nitro blue tetrazolium/5-bromo-4-chloro-3′-indolylphosphate, p-toluidine salt (NBT/BCIP), resulting in blue spots that are easily visualized by the naked eye. Testing was performed against a total of 168 bacterial strains, i.e., 26 representative collection strains, 81 isolates of target fish pathogens, and 61 ecologically or phylogenetically related strains. The results showed that each probe consistently identified its corresponding target strain with 100% specificity. The detection limit of the microarray was estimated to be in the range of 1 pg for genomic DNA and 103 CFU/mL for pure pathogen cultures. These high specificity and sensitivity results demonstrate the feasibility of using DNA microarrays in the diagnostic detection of fish pathogens.
Keywords: fish pathogen detection; 16S rDNA; naked-eye reading microarray
Current Research Results
Authors: Chen, M.J., Chou, L.C., Hsieh, T.T., Lee, D.D., Liu, K.W., Yu, C.Y., Oyang, Y.J., Tsai, H.K.* and Chen, C.Y.*

Huai-KuangTsaiAbstract:

Motivation: Gene regulation involves complicated mechanisms such as cooperativity between a set of transcription factors (TFs). Previous studies have used target genes shared by two TFs as a clue to infer TF–TF interactions. However, this task remains challenging because the target genes with low binding affinity are frequently omitted by experimental data, especially when a single strict threshold is employed. This article aims at improving the accuracy of inferring TF–TF interactions by incorporating motif discovery as a fundamental step when detecting overlapping targets of TFs based on ChIP-chip data.

Results: The proposed method, simTFBS, outperforms three naïve methods that adopt fixed thresholds when inferring TF–TF interactions based on ChIP-chip data. In addition, simTFBS is compared with two advanced methods and demonstrates its advantages in predicting TF–TF interactions. By comparing simTFBS with predictions based on the set of available annotated yeast TF binding motifs, we demonstrate that the good performance of simTFBS is indeed coming from the additional motifs found by the proposed procedures.

"CloudBrush: A String Graph Approach of De Novo Assembly for High-Throughput Sequencing Data with Cloud Computing," Proceedings the 10th Asia Pacific Bioinformatics Conference, January 2012.
Authors: Yu-Jung Chang, Chien-Chih Chen, Chuen-Liang Chen and Jan-Ming Ho

Jan-MingHoAbstract:
Since 2005, the next-generation sequencing technologies dramatically accelerate the throughput of DNA sequencing in a much faster rate than the growth rate of computer speed as predicted by the “Moore’s Law.” The number of reads per genome has also increased dramatically due to the high-throughput sequencing technology. It is a problem even to load and run these sequencing data in memory. There is an urgent need for de novo assemblers to efficiently handle the huge amount of sequencing data using scalable commodity servers in the clouds.

In this paper, we present CloudBrush, a parallel algorithm that runs on the MapReduce framework of cloud computing for de novo assembly of high-throughput sequencing data. The algorithm uses Myers’s bi-directed string graphs as its basis and consists of two main stages: graph construction and graph simplification. First, a vertex is defined for each non-redundant sequence read. We present a prefix-and-extend algorithm to identify overlaps between a pair of reads and to reduce transitive edges. The graph is further simplified by using conventional operations including linear path compression, dead-end tip removal and bubble removal. We also present a new operation, similar neighbour detection and edge adjustment, abbreviated as SNEE, to detect and simplify braid structure in the string graph. Besides, we also prune edges from one side of a node if at least one of these edges is not similar with the others. Note that, after doing so, all paths in a remaining connected subgraph corresponds to similar subsequences of the underlying genome. We then traverse each connected subgraph to find a long path supported by a sufficient amount of reads to represent the subgraph.

Preliminary results show that the CloudBrush assembler, compared with Contrail and Edena on the sequencing data of E. coli genomes, may yield longer contigs.
Current Research Results
Authors: Yen-Chen Chen, Yun-Ching Chen, Wen-Dar Lin, Chung-Der Hsiao, Hung-Wen Chiu, and Jan-Ming Ho

Jan-MingHoAbstract:
In this postgenomic era, a huge volume of information derived from expressed sequence tags (ESTs) has been constructed for functional description of gene expression profiles. Comparative studies have become more and more important to researchers of biology. In order to facilitate these comparative studies, we have constructed a user-friendly EST annotation pipeline with comparison tools on an integrated EST service website, Bio301. Bio301 includes regular EST preprocessing, BLAST similarity search, gene ontology (GO) annotation, statistics reporting, a graphical GO browsing interface, and microarray probe selection tools. In addition, Bio301 is equipped with statistical library comparison functions using multiple EST libraries based on GO annotations for mining meaningful biological information.
Current Research Results
"BibPro: A Citation Parser Based on Sequence Alignment," IEEE Transactions on Knowledge and Data Engineering, January 2012.
Authors: Chien-Chih Chen, Kai-Hsiang Yang, Chuen-Liang Chen and Jan-Ming Ho

Jan-MingHoAbstract:
Dramatic increase in the number of academic publications has led to growing demand for efficient organization of the resources to meet researchers’ needs. As a result, a number of network services have compiled databases from the public resources scattered over the Internet. However, publications by different conferences and journals adopt different citation styles. It is an interesting problem to accurately extract metadata from a citation string which is formatted in one of thousands of different styles. It has attracted a great deal of attention in research in recent years. In this paper, based on the notion of sequence alignment, we present a citation parser called BibPro that extracts components of a citation string. To demonstrate the efficacy of BibPro, we conducted experiments on three benchmark data sets. The results show that BibPro achieved over 90 percent accuracy on each benchmark. Even with citations and associated metadata retrieved from the web as training data, our experiments show that BibPro still achieves a reasonable performance.
Current Research Results
"TSCAN: A Content Anatomy Approach to Temporal Topic Summarization," IEEE Transactions on Knowledge and Data Engineering, January 2012.
Authors: Chien Chin Chen, Meng Chang Chen

Meng ChangChenAbstract:
A topic is defined as a seminal event or activity along with all directly related events and activities. It is represented by a chronological sequence of documents published by different authors on the Internet. In this study, we define a task called topic anatomy, which summarizes and associates the core parts of a topic temporally so that readers can understand the content easily. The proposed topic anatomy model, called TSCAN, derives the major themes of a topic from the eigenvectors of a temporal block association matrix. Then, the significant events of the themes and their summaries are extracted by examining the constitution of the eigenvectors. Finally, the extracted events are associated through their temporal closeness and context similarity to form an evolution graph of the topic. Experiments based on the official TDT4 corpus demonstrate that the generated temporal summaries present the storylines of topics in a comprehensible form. Moreover, in terms of content coverage, coherence, and consistency, the summaries are superior to those derived by existing summarization methods based on humancomposed reference summaries
Current Research Results
Authors: Chi-Jen Wu, Jan-Ming Ho, and Ming-Syan Chen,

Ming-SyanChenJan-MingHoChi-JenWuAbstract:
Social network applications are becoming increasingly popular on mobile devices. A mobile presence service is an essential component of a social network application because it maintains each mobile user's presence information, such as the current status (online/offline), GPS location and network address, and also updates the user's online friends with the information continually. If presence updates occur frequently, the enormous number of messages distributed by presence servers may lead to a scalability problem in a large-scale mobile presence service. To address the problem, we propose an efficient and scalable server architecture, called PresenceCloud, which enables mobile presence services to support large-scale social network applications. PresenceCloud organizes presence servers into a quorum-based server-to-server architecture for efficient presence searching. It also leverages a directed search algorithm and a one-hop caching strategy to achieve small constant search latency. We analyze the performance of PresenceCloud in terms of the search cost and search satisfaction level. The search cost is defined as the total number of messages generated by the presence server when a user arrives; and search satisfaction level is defined as the time it takes to search for the arriving user's friend list. The results of simulations demonstrate that PresenceCloud achieves performance gains in the search cost without compromising search satisfaction.
Current Research Results
Authors: Hsin-Nan Lin, Cédric Notredame, Jia-Ming Chang, Ting-Yi Sung, Wen-Lian Hsu

Wen-LianHsuTing-YiSungHsin-NanLinAbstract:

Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently.

In this paper, we present a flexible similarity measure for residue pairs to improve the quality of protein sequence alignment. Our approach, called SymAlign, relies on the identification of conserved words found across a sizeable fraction of the considered dataset, and supported by evolutionary analysis. These words are then used to define a position specific substitution matrix that better reflects the biological significance of local similarity. The experiment results show that the SymAlign scoring scheme can be incorporated within T-Coffee to improve sequence alignment accuracy. We also demonstrate that SymAlign is less sensitive to the presence of structurally non-similar proteins. In the analysis of the relationship between sequence identity and structure similarity, SymAlign can better differentiate structurally similar proteins from non- similar proteins.

We show that protein sequence alignments can be significantly improved using a similarity estimation based on weighted n-grams. In our analysis of the alignments thus produced, sequence conservation becomes a better indicator of structural similarity. SymAlign also provides alignment visualization that can display sub-optimal alignments on dot-matrices. The visualization makes it easy to identify well-supported alternative alignments that may not have been identified by dynamic programming. SymAlign is available at http://bio-cluster.iis.sinica.edu.tw/Sym​Align/.

Current Research Results
"Power Domination in Circular-arc Graphs," Algorithmica, November 2011.
Authors: Chung-Shou Liao and D. T. Lee

Der- TsaiLeeAbstract:

A set S ⊆ V is a power dominating set (PDS) of a graph G =( V , E ) if every vertex and every edge in G can be observed based on the observation rules of power system monitoring. The power domination problem involves minimizing the cardinality of a PDS of a graph. We consider this combinatorial optimization problem and present a linear time algorithm for finding the minimum PDS of an interval graph if the interval ordering of the graph is provided. In addition, we show that the algorithm, which runs in Θ( n log n ) time, where n is the number of intervals, is asymptotically optimal if the interval ordering is not given. We also show that the results hold for the class of circular-arc graphs.


"Solving large-scale multi-label SVM problems with a tree decomposition approach," The 3rd Asian Conference on Machine Learning, November 2011.
Authors: Fu Chang and Chan-Cheng Liu

FuChangAbstract:
We propose a tree decomposition approach for solving large-scale multi-label classication problems. In this approach, we follow a convention of rst transforming the problem into a number of one-against-others classication problems. Then, to solve each transformed problem, we use a decision tree to decompose the corresponding data space and train local SVMs on the decomposed regions. The resultant classier is called decision tree support vector machine (DTSVM). The approach has the following advantages. First, when nonlinear SVM is used as the learning machine, DTSVM requires much shorter training time than global SVM (gSVM) while achieving comparable test accuracy. Second, when linear SVM is used as the learning machine, DTSVM often achieves higher test accuracy than gSVM. Third, in textual applications, linear DTSVM plays the leading role among linear gSVM, non-linear gSVM, and non-linear DTSVM for the following reason. Linear DTSVM achieves higher test accuracy than linear gSVM; moreover, it achieves comparable test accuracy to, but requires much less training time than, non-linear gSVM and non-linear DTSVM.
Current Research Results
Authors: Wang, T.Y., Su, C.H. and Tsai, H.K.*

Huai-KuangTsaiAbstract:

Motivation: Metagenomics involves sampling and studying the genetic materials in microbial communities. Several statistical methods have been proposed for comparative analysis of microbial community compositions. Most of the methods are based on the estimated abundances of taxonomic units or functional groups from metagenomic samples. However, such estimated abundances might deviate from the true abundances in habitats due to sampling biases and other systematic artifacts in metagenomic data processing.

Results: We developed the MetaRank scheme to convert abundances into ranks. MetaRank employs a series of statistical hypothesis tests to compare abundances within a microbial community and determine their ranks. We applied MetaRank to synthetic samples and real metagenomes. The results confirm that MetaRank can reduce the effects of sampling biases and clarify the characteristics of metagenomes in comparative studies of microbial communities. Therefore, MetaRank provides a useful rank-based approach to analyzing microbiomes.

"Text Patterns and Compression Models for Semantic Class Learning," Proceedings of the 5th International Joint Conference on Natural Language Processing, November 2011.
Authors: Chung-Yao Chuang, Yi-Hsun Lee, and Wen-Lian Hsu

Wen-LianHsuAbstract:
This paper proposes a weakly-supervised approach for extracting instances of se-mantic classes. This method constructs simple wrappers automatically based on specified seed instances and uses a com-pression model to assess the contextual ev-idence of its extraction. By adopting this compression model, our approach can bet-ter avoid erroneous extractions in a noisy corpus such as the Web. The empiri-cal results show that our system performs quite consistently even when operating on a noisy text with a lot of possibly irrelevant documents.
"Mining Fuzzy Domain Ontology Based on Concept Vector from Wikipedia Category Network," Proceedings Web Intelligence Conference, August 2011.
Authors: Cheng-Yu Lu, Shou-Wei Ho, Jen-Ming Chung, Hahn-Ming Lee, and Jan-Ming Ho

Jan-MingHoAbstract:
Ontology is essential in the formalization of domain knowledge for effective human-computer interactions (i.e., expert-finding). Many researchers have proposed approaches to measure the similarity between concepts by accessing fuzzy domain ontology. However, engineering of the construction of domain ontologies turns out to be labor intensive and tedious. In this paper, we propose an approach to mine domain concepts from Wikipedia Category Network, and to generate the fuzzy relation based on a concept vector extraction method to measure the relatedness between a single term and a concept. Our methodology can conceptualize domain knowledge by mining Wikipedia Category Network. An empirical experiment is conducted to evaluate the robustness by using TREC dataset. Experiment results show the constructed fuzzy domain ontology derived by proposed approach can discover robust fuzzy domain ontology with satisfactory accuracy in information retrieval tasks.
"Using Web-Mining Approach for Scholar Measurement and Recommendation," Proceedings International Conference on Web Intelligence, August 2011.
Authors: Jen-Ming Chung, Chi-Jen Wu, Cheng-Yu Lu and Jan-Ming Ho

Jan-MingHoChi-JenWuAbstract:
Scholars usually spend great deal of time on searching and reading papers of key researchers. However, to objectively determine key researcher of a topic relies on several measurements, such as publication, citation, recent academic activities. In this paper, a prototype of scholars searching and recommendation system based on a web mining approach in expert finding system is proposed. The system gives and recommends the ranking of scholars and turns out top-k scholars. A new ranking measure is designed, namely p-index, to reveal the scholar ranking of a certain field. We use a real-world dataset to test the robustness, the experiment results show our approach outperforms other existing approaches and users are highly interested in using the system again.
Current Research Results
Authors: Jiann-Horng Leu, Shu-Hwa Chen, Yu-Bin Wang, Yen-Chen Chen, Sheng-Yao Su, Chung-Yen Lin*, Jan-Ming Ho, Chu-Fang Lo*

Jan-MingHoChung-YenLinAbstract:
Penaeus Genome Database (PAGE) is a genome database with integrated analysis tools which is Penaeus genome oriented for over 200,000 Expressed Sequence Tags (ESTs). In PAGE, we provide sequences and tentative functional annotations for each assemble contigs and ESTs. Users can conduct search easily in keywords or sequences to those ESTs and contigs in specific or across species.

More

Academia Sinica Institue of Information Science Academia Sinica