Current Research
Hubba: Hub Objects Analyzer : A Framework of Interactome Hubs Identification for Network Biology
Authors: Chung-Yen Lin, Chia-Hao Chin, Hsin-Hung Wu, Shu-Hwa Chen, Chin-Wen Ho, and Ming-Tat Ko
Ming-Tat Ko
Chung-Yen Lin
Abstract: One major task in the post-genome era is to reconstruct proteomic and genomic interacting networks using high-throughput experiment data. To identify essential nodes/hubs in these interactomes is a way to decipher the critical keys inside biochemical pathways or complex networks. These essential nodes/hubs may serve as potential drug-targets for developing novel therapy of human diseases, such as cancer or infectious disease caused by emerging pathogens. Hub Objects Analyzer (Hubba) is a web-based service for exploring important nodes in an interactome network generated from specific small- or large-scale experimental methods based on graph theory. Two characteristic analysis algorithms, Maximum Neighborhood Component (MNC) and Density of Maximum Neighborhood Component (DMNC) are developed for exploring and identifying hubs/essential nodes from interactome networks. Users can submit their own interaction data in PSI format (Proteomics Standards Initiative, version 2.5 and 1.0), tab format and tab with weight values. User will get an email notification of the calculation complete in minutes or hours, depending on the size of submitted dataset. Hubba result includes a rank given by a composite index, a manifest graph of network to show the relationship amid these hubs, and links for retrieving output files. This proposed method (DMNC || MNC) can be applied to discover some unrecognized hubs from previous dataset. For example, most of the Hubba high-ranked hubs (80% in top 10 hub list, and >70% in top 40 hub list) from the yeast protein interactome data (Y2H experiment) are reported as essential proteins. Since the analysis methods of Hubba are based on topology, it can also be used on other kinds of networks to explore the essential nodes, like networks in yeast, rat, mouse and human. The website of Hubba is freely available at http://hub.iis.sinica.edu.tw/Hubba.
Evolutionary conservation of DNA-contact residues in DNA-binding domains
Authors: Yao-Lin Chang , Huai-Kuang Tsai , Cheng-Yan Kao , Yung-Chian Chen , Yuh-Jyh Hu and Jinn-Moon Yang
Huai-Kuang Tsai
Abstract:
Background
DNA-binding proteins are of utmost importance to gene regulation. The identification of DNA-binding domains is useful for understanding the regulation mechanisms of DNA-binding proteins. In this study, we proposed a method to determine whether a domain or a protein can has DNA binding capability by considering evolutionary conservation of DNA-binding residues.
Results
Our method achieves high precision and recall for 66 families of DNA-binding domains, with a false positive rate less than 5% for 250 non-DNA-binding proteins. In addition, experimental results show that our method is able to identify the different DNA-binding behaviors of proteins in the same SCOP family based on the use of evolutionary conservation of DNA-contact residues.
Conclusion
This study shows the conservation of DNA-contact residues in DNA-binding domains. We conclude that the members in the same subfamily bind DNA specifically and the members in different subfamilies often recognize different DNA targets. Additionally, we observe the co-evolution of DNA-contact residues and interacting DNA base-pairs.
Using Incremental PLSI for Threshold-Resilient Online Event Analysis
Authors: Tzu-Chuan Chou, Meng Chang Chen
Meng Chang Chen
Abstract: The goal of on-line event analysis is to detect events and their associated documents in real-time from a continuous stream of documents generated by multiple information sources. Existing approaches (e.g., window-based, decay function, and adaptive threshold methods) incorporate the temporal relations of documents into traditional text categorization methods for event analysis. However, these methods suffer from the threshold dependence problem, i.e., their performance is only acceptable for a narrow range of thresholds; thus, it is difficult to designate an appropriate threshold in advance. In this paper, we propose a threshold resilient algorithm, called Incremental Probabilistic Latent Semantic Indexing (IPLSI), which can capture the storyline development of an event without the threshold dependence problem. The IPLSI algorithm is theoretically sound and more efficient than naive PLSI approaches. The results of the performance evaluation based on the TDT 4 corpus show that the proposed algorithm reduces the error tradeoff cost of event detection by as much as 14.51% and increases the threshold range for acceptable performance by 300% - 800%
Algebra of programming using dependent types
Authors: Shin-Cheng Mu, Hsiang-Shang Ko, and Patrik Jansson
Shin-Cheng Mu
Abstract: Dependent type theory is rich enough to express that a program satis es an input/output relational speci cation, but it could be hard to construct the proof term. On the other hand, squiggolists know very well how to show that one relation is included in another by algebraic reasoning. We demonstrate how to encode functional and relational derivations in a dependently typed programming language. A program is coupled with an algebraic derivation from a speci cation, whose correctness is guaranteed by the type system.
Discovering gapped binding sites of yeast transcription factors
Authors: Chien-Yu Chen, Huai-Kuang Tsai, Chen-Ming Hsu, Mei-Ju May Chen, Hao-Geng Hung, Grace Tzu-Wei Huang, and Wen-Hsiung Li
Huai-Kuang Tsai
Abstract: A gapped transcription factor-binding site (TFBS) contains one or more highly degenerate positions. Discovering gapped motifs is difficult, because allowing highly degenerate positions in a motif greatly enlarges the search space and complicates the discovery process. Here, we propose a method for discovering TFBSs, especially gapped motifs. We use ChIP-chip data to judge the binding strength of a TF to a putative target promoter and use orthologous sequences from related species to judge the degree of evolutionary conservation of a predicted TFBS. Candidate motifs are constructed by growing compact motif blocks and by concatenating two candidate blocks, allowing 0–15 degenerate positions in between. The resultant patterns are statistically evaluated for their ability to distinguish between target and nontarget genes. Then, a position-based ranking procedure is proposed to enhance the signals of true motifs by collecting position concurrences. Empirical tests on 32 known yeast TFBSs show that the method is highly accurate in identifying gapped motifs, outperforming current methods, and it also works well on ungapped motifs. Predictions on additional 54 TFs successfully discover 11 gapped and 38 ungapped motifs supported by literature. Our method achieves high sensitivity and specificity for predicting experimentally verified TFBSs.
A Content-Centric Framework for Effective Data Dissemination in Opportunistic Networks
Authors: Ling-Jyh Chen, Chen-Hung Yu, Cheng-Long Tseng, Hao-hua Chu, Cheng-Fu Chou
PHOTO
Ling-Jyh Chen
Abstract: In this paper, we address the challenges of content transfer in opportunistic networks, and propose techniques to better facilitate data dissemination based on the characteristics of the content. To investigate this problem from its origins, we propose three message scheduling algorithms: Sequential Forwarding (SF), Full Interleaving (FI), and Block-based Interleaving (BI). Each algorithm is embedded in a specially tailored data dissemination technique to evaluate the benefits of applying it to different types of content and data dissemination methods. Three types of content (file, video and web) are considered and evaluated, and the dissemination methods considered are Layered Multiple Description Coding (LMDC) based and file-based. Using simulations as well as both synthetic and realistic network scenarios, we evaluate the proposed schemes in terms of latency and user perceived quality, and demonstrate how the schemes can achieve much better latency performance for file transfers. Furthermore, we show that using LMDC-based techniques leads to higher user perceived quality, since the end user is allowed to “preview” video file or web content, even before the data has been completely transferred. The effectiveness and robustness of our message scheduling algorithms and their corresponding content dissemination techniques make them ideal solutions that can go a long way toward effective data dissemination in opportunistic networks.
Techniques for Solving the Large-Scale Classification Problem in Chinese Handwriting Recognition
Authors: Fu Chang
PHOTO
Fu Chang
Abstract: Given the large number of categories, or class types, in the Chinese language, the challenge offered by character recognition involves dealing with such a large-scale problem in both training and testing phases. This paper ad-dresses three techniques, the combination of which has been found to be effec-tive in solving the problem. The techniques are: 1) a prototype learn-ing/matching method that determines the number and location of prototypes in the learning phase, and chooses the candidates for each character in the testing phase; 2) support vector machines (SVM) that post-process the top-ranked can-didates obtained during the prototype learning or matching process; and 3) fast feature-vector matching techniques to accelerate prototype matching via deci-sion trees and sub-vector matching. The techniques are applied to Chinese handwritten characters, expressed as feature vectors derived by extraction op-erations, such as nonlinear normalization, directional feature extraction, and feature blurring.
Maximum segment sum is back: deriving algorithms for two segment problems with bounded lengths
Authors: Shin-Cheng Mu
PHOTO
Shin-Cheng Mu
Abstract: It may be surprising that variations of the maximum segment sum (MSS) problem, a textbook example for the squiggolists, are still active topics for algorithm designers. In this paper we examine the new developments from the view of relational program calculation. It turns out that, while the classical MSS problem is solved by the Greedy Theorem, by applying the Thinning Theorem, we get a linear-time algorithm for MSS with upper bound on length. To derive a linear-time algorithm for the em maximum segment density problem, on the other hand, we purpose a variation of thinning based on an extended notion of monotonicity. The concepts of left-negative and right-screw segments emerge from the search for monotonicity conditions. The efficiency of the resulting algorithms crucially relies on exploiting properties of the set of partial solutions and design efficient data structures for them.
The unique probe selector: a comprehensive web service for probe design and oligonucleotide arrays
Authors: Shu-Hwa Chen, Chen-Zen Lo, Ming-Chi Tsai, Chao A Hsiung and Chung-Yen Lin
PHOTO
Chung-Yen Lin
Abstract:
Background: Nucleic acid hybridization, a fundamental technique in molecular biology, can be modified into very effective and sensitive methods for detecting particular targets mixed with millions of non-target sequences. Therefore, avoiding cross-hybridization is the most crucial issue for developing diagnostic methods based on hybridization.
Results: To develop a probe with a high discriminating power, this study constructed a web service, the Unique Probe Selector (UPS), for customized probe design. The UPS service integrates a probe design mechanism and a scoring system for evaluating the performance of probe annealing and the uniqueness of a probe in a user-defined genetic background. Starting from an intuitive web interface, the UPS accepts a query with single or multiple sequences in fasta format. The best probe(s) for each sequence can be downloaded from result pages in a fasta or .csv format with a summary of probe characteristics. The option "Unique probe within group" selects the most unique probe for each target sequence with low probability to hybridize to the other sequences in the same submitted query. The option "Unique probe in the specific organism" devises probes for each submitted sequence to identify its target among selected genetic backgrounds based on Unigene.
Conclusion: The UPS evaluates probe-to-target hybridization under a user-defined condition in silico to ensure high-performance hybridization and minimizes the possibility of non-specific reactions. UPS has been applied to design human arrays for gene expression studies and to develop several small arrays of gene families that were inferred as molecular signatures of cancer typing/ staging or pathogen signatures. Notably, UPS is freely accessible at http://array.iis.sinica.edu.tw/ups/.
Discovering gapped binding sites of yeast transcription factors
Authors: Chien-Yu Chen, Huai-Kuang Tsai, Chen-Ming Hsu, Mei-Ju May Chen, Hao-Geng Hung, Grace Tzu-Wei Huang,and Wen-Hsiung Li
PHOTO
H.K. Tsai
Abstract: A gapped transcription factor-binding site (TFBS) contains one or more highly degenerate positions. Discovering gapped motifs is difficult, because allowing highly degenerate positions in a motif greatly enlarges the search space and complicates the discovery process. Here, we propose a method for discovering TFBSs, especially gapped motifs. We use ChIP-chip data to judge the binding strength of a TF to a putative target promoter and use orthologous sequences from related species to judge the degree of evolutionary conservation of a predicted TFBS. Candidate motifs are constructed by growing compact motif blocks and by concatenating two candidate blocks, allowing 0V15 degenerate positions in between. The resultant patterns are statistically evaluated for their ability to distinguish between target and nontarget genes. Then, a position-based ranking procedure is proposed to enhance the signals of true motifs by collecting position concurrences. Empirical tests on 32 known yeast TFBSs show that the method is highly accurate in identifying gapped motifs, outperforming current methods, and it also works well on ungapped motifs. Predictions on additional 54 TFs successfully discover 11 gapped and 38 ungapped motifs supported by literature. Our method achieves high sensitivity and specificity for predicting experimentally verified TFBSs.
Protein subcellular localization prediction based on compartment-specific features and structure conservation
Authors: Emily Chia-Yu Su, Hua-Sheng Chiu, Allan Lo, Jenn-Kang Hwang, Ting-Yi Sung, and Wen-Lian Hsu
PHOTO
Wen-Lian Hsu
PHOTO
Ting-Yi Sung
Abstract: Background
Protein subcellular localization is crucial for genome annotation, protein function prediction, and drug discovery. Determination of subcellular localization using experimental approaches is time-consuming; thus, computational approaches become highly desirable. Extensive studies of localization prediction have led to the development of several methods including composition-based and homology-based methods. However, their performance might be significantly degraded if homologous sequences are not detected. Moreover, methods that integrate various features could suffer from the problem of low coverage in high-throughput proteomic analyses due to the lack of information to characterize unknown proteins.
Results
We propose a hybrid prediction method for Gram-negative bacteria that combines a one-versus-one support vector machines (SVM) model and a structural homology approach. The SVM model comprises a number of binary classifiers, in which biological features derived from Gram-negative bacteria translocation pathways are incorporated. In the structural homology approach, we employ secondary structure alignment for structural similarity comparison and assign the known localization of the top-ranked protein as the predicted localization of a query protein. The hybrid method achieves overall accuracy of 93.7% and 93.2% using ten-fold cross-validation on the benchmark data sets. In the assessment of the evaluation data sets, our method also attains accurate prediction accuracy of 84.0%, especially when testing on sequences with a low level of homology to the training data. A three-way data split procedure is also incorporated to prevent overestimation of the predictive performance. In addition, we show that the prediction accuracy should be approximately 85% for non-redundant data sets of sequence identity less than 30%.
Conclusion
Our results demonstrate that biological features derived from Gram-negative bacteria translocation pathways yield a significant improvement. The biological features are interpretable and can be applied in advanced analyses and experimental designs. Moreover, the overall accuracy of combining the structural homology approach is further improved, which suggests that structural conservation could be a useful indicator for inferring localization in addition to sequence homology. The proposed method can be used in large-scale analyses of proteomes.
BIOSMILE: A semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features
Authors: Richard Tzong-Han Tsai, Wen-Chi Chou, Yu-Chun Lin, Ying-Shan Su, Cheng-Lung Sung, Hong-Jie Dai, Irene Tzu-Hsuan Yeh, Wei Ku, Ting-Yi Sung and Wen-Lian Hsu
PHOTO
Wen-Lian Hsu
PHOTO
Ting-Yi Sung
Abstract: Background
Bioinformatics tools for automatic processing of biomedical literature are invaluable for both the design and interpretation of large-scale experiments. Many information extraction (IE) systems that incorporate natural language processing (NLP) techniques have thus been developed for use in the biomedical field. A key IE task in this field is the extraction of biomedical relations, such as protein-protein and gene-disease interactions. However, most biomedical relation extraction systems usually ignore adverbial and prepositional phrases and words identifying location, manner, timing, and condition, which are essential for describing biomedical relations. Semantic role labeling (SRL) is a natural language processing technique that identifies the semantic roles of these words or phrases in sentences and expresses them as predicate-argument structures. We construct a biomedical SRL system called BIOSMILE that uses a maximum entropy (ME) machine-learning model to extract biomedical relations. BIOSMILE is trained on BioProp, our semi-automatic, annotated biomedical proposition bank. Currently, we are focusing on 30 biomedical verbs that are frequently used or considered important for describing molecular events.
Results
To evaluate the performance of BIOSMILE, we conducted two experiments to (1) compare the performance of SRL systems trained on newswire and biomedical corpora; and (2) examine the effects of using biomedical-specific features. The experimental results show that using BioProp improves the F-score of the SRL system by 21.45% over an SRL system that uses a newswire corpus. It is noteworthy that adding automatically generated template features improves the overall F-score by a further 0.52%. Specifically, ArgM-LOC, ArgM-MNR, and Arg2 achieve statistically significant performance improvements of 3.33%, 2.27%, and 1.44%, respectively.
Conclusion
We demonstrate the necessity of using a biomedical proposition bank for training SRL systems in the biomedical domain. Besides the different characteristics of biomedical and newswire sentences, factors such as cross-domain framesets and verb usage variations also influence the performance of SRL systems. For argument classification, we find that NE (named entity) features indicating if the target node matches with NEs are not effective, since NEs may match with a node of the parsing tree that does not have semantic role labels in the training set. We therefore incorporate templates composed of specific words, NE types, and POS tags into the SRL system. As a result, the classification accuracy for adjunct arguments, which is especially important for biomedical SRL, is improved significantly.
Enhanced Membrane Protein Topology Prediction Using a Hierarchical Classification Method and a New Scoring Function
Authors: Allan Lo, Hua-Sheng Chiu, Ting-Yi Sung, Ping-Chiang Lyu, and Wen-Lian Hsu
PHOTO
Wen-Lian Hsu
PHOTO
Ting-Yi Sung
Abstract: The prediction of transmembrane (TM) helix and topology provides important information about the structure and function of a membrane protein. Due to the experimental difficulties in obtaining a high-resolution model, computational methods are highly desirable. In this paper, we present a hierarchical classification method using support vector machines (SVMs) that integrates selected features by capturing the sequence-to-structure relationship and developing a new scoring function based on membrane protein folding. The proposed approach is evaluated on low- and high-resolution data sets with cross-validation, and the topology (sidedness) prediction accuracy reaches as high as 90%. Our method is also found to correctly predict both the location of TM helices and the topology for 69% of the low-resolution benchmark set. We also test our method for discrimination between soluble and membrane proteins and achieve very low overall false positive (0.5%) and false negative rates (0~1.2%). Lastly, the analysis of the scoring function suggests that the topogeneses of single-spanning and multispanning TM proteins have different levels of complexity, and the consideration of interloop topogenic interactions for the latter is the key to achieving better predictions. This method can facilitate the annotation of membrane proteomes to extract useful structural and functional information. It is publicly available at http://bio-cluster.iis.sinica.edu.tw/~bioapp/SVMtop
Improve Parsing Performance by Self-Learning
Authors: Hsieh, Yu-Ming, Duen-Chi Yang, Keh-Jiann Chen
PHOTO
Keh-Jiann Chen
Abstract: There are many methods to improve performance of statistical parsers. Resolving structural ambiguities is a major task of these methods. In the proposed approach, the parser produces a set of n-best trees based on a feature-extended PCFG grammar and then selects the best tree structure based on association strengths of dependency word-pairs. However, there is no sufficiently large Treebank producing reliable statistical distributions of all word-pairs. This paper aims to provide a self-learning method to resolve the problems. The word association strengths were automatically extracted and learned by parsing a giga-word corpus. Although the automatically learned word associations were not perfect, the constructed structure evaluation model improved the bracketed f-score from 83.09% to 86.59%. We believe that the above iterative learning processes can improve parsing performances automatically by learning word-dependence information continuously from web.
On 3-Steiner Root Problem
Authors: Maw-Shang Chang, Ming-Tat Ko
PHOTO
Ming-Tat Ko

Abstract: For a graph G and a positive integer k, the k-power of G is the graph Gk with V (G) as its vertex set and {(u,v)|u,v V (G),dG(u,v) k} as its edge set where dG(u,v) is the distance between u and v in graph G. The k-Steiner root problem on a graph G asks for a tree T with V (G) V (T ) and G is the subgraph of T k induced by V (G). If such a tree T exists, we call it a k-Steiner root of G. This paper gives a linear time algorithm for the 3-Steiner root problem. Consider an un-rooted tree T with leaves one-to-one labeled by the elements of a set V . The k-leaf power of T is a graph, denoted TLk, with TLk =(V,E), where E = {(u,v) u,v Vand dT (u,v) k}. We call T a k-leaf root of TLk . The k-leaf power recognition problem is to decide whether a graph has such a k-leaf root. The complexity of this problem is still open for k 5 [6]. It can be solved in polynomial time if the (k 2)-Steiner root problem can be solved in polynomial time [6]. Our result implies that the k-leaf power recognition problem can be solved in linear time for k = 5.

An Optimal Scheduling Algorithm for an Agent-based Multicast Strategy on Irregular Networks
Authors: P. Liu, Y.-F. Lin, Jan-Jan Wu
PHOTO
Jan-Jan Wu
Abstract: This paper describes an agent-based approach for scheduling multiple multicast on wormhole switch-based networks with irregular topologies. Multicast/broadcast is an important communication pattern, with applications in collective communication operations such as barrier synchronization and global combining. Our approach assigns an agent to each subtree of switches such that the agents can exchange information efficiently and independently. The entire multicast problem is then recursively solved with each agent sending message to those switches that it is responsible for. In this way, communication is localized by the assignment of agents to subtrees. This idea can be easily generalized to multiple multicast since the order of message passing among agents can be interleaved for different multicasts. The key to the performance of this agent-based approach is the message-passing scheduling between agents and the destination processors. We propose an optimal scheduling algorithm, called ForwardInSwitch to solve this problem. We conduct extensive experiments to demonstrate the efficiency of our approach by comparing our results with SPCCO, a highly efficient multicast algorithm reported in literature. We found that SPCCO suffers link contention when the number of simultaneous multiple multicast becomes large. On the other hand, our agent-based approach achieves better performance in large cases.
Boosting Multiclass Learning with Repeating Codes and Weak Detectors for Protein Subcellular Localization
Authors: Chung-Chih Lin, Yuh-Show Tsai , Yu-Shi Lin , Tai-Yu Chiu , Chia-Cheng Hsiung , May-I Lee , Jeremy C. Simpson and Chun-Nan Hsu
PHOTO
Chun-Nan Hsu
Abstract:
Motivation: Determining locations of protein expression is essential to understand protein function. Advances in green fluorescence protein (GFP) fusion proteins and automated fluorescence microscopy allow for rapid acquisition of large collections of protein localization images. Recognition of these cell images requires an automated image analysis system. Approaches taken by previous work concentrated on designing a set of optimal features and then applying standard machine learning algorithms. In fact, trends of recent advances in machine learning and computer vision can be applied to improve the performance. One trend is the advances in multiclass learning with error-correcting output codes (ECOC). Another trend is the use of a large number of weak detectors with boosting for detecting objects in images of real-world scenes.
Results: We take advantage of these advances to propose a new learning algorithm, AdaBoost.ERC, coupled with weak and strong detectors, to improve the performance of automatic recognition of protein subcellular locations in cell images. We prepared two image data sets of CHO and Vero cells and downloaded a HeLa cell image data set in the public domain to evaluate our new method. We show that AdaBoost.ERC outperforms other AdaBoost extensions. We demonstrate the benefit of weak detectors by showing significant performance improvements over classifiers using only strong detectors. We also empirically test our methods capability of generalizing to heterogeneous image collections. Compared with previous work, our method performs reasonably well for the HeLa cell images.
Availability: CHO and Vero cell images, their corresponding feature sets (SSLF and WSLF), our new learning algorithm, AdaBoost.ERC, and supplementary data are available at http://aiia.iis.sinica.edu.tw/.
Contact: chunnan@iis.sinica.edu.tw
Co-expression of adjacent genes in yeast cannot be simply attributed to shared regulatory system
Authors: Huai-Kuang Tsai, Cindy PC Su, Mei-Yeh J Lu , Ching-Hua Shih and Daryi Wang
PHOTO
H.K. Tsai
Abstract: Background
Adjacent gene pairs in the yeast genome have a tendency to express concurrently. Sharing of regulatory elements within the intergenic region of those adjacent gene pairs was often considered the major mechanism responsible for such co-expression. However, it is still in debate to what extent that common transcription factors (TFs) contribute to the co-expression of adjacent genes. In order to resolve the evolutionary aspect of this issue, we investigated the conservation of adjacent pairs in five yeast species. By using the information for TF binding sites in promoter regions available from the MYBS database http://cg1.iis.sinica.edu.tw/~mybs/ , the ratios of TF-sharing pairs among all the adjacent pairs in yeast genomes were analyzed. The levels of co-expression in different adjacent patterns were also compared.
Results
Our analyses showed that the proportion of adjacent pairs conserved in five yeast species is relatively low compared to that in the mammalian lineage. The proportion was also low for adjacent gene pairs with shared TFs. Particularly, the statistical analysis suggested that co-expression of adjacent gene pairs was not noticeably associated with the sharing of TFs in these pairs. We further proposed a case of the PAC (polymerase A and C) and RRPE (rRNA processing element) motifs which co-regulate divergent/bidirectional pairs, and found that the shared TFs were not significantly relevant to co-expression of divergent promoters among adjacent genes.
Conclusion
Our findings suggested that the commonly shared cis-regulatory system does not solely contribute to the co-expression of adjacent gene pairs in yeast genome. Therefore we believe that during evolution yeasts have developed a sophisticated regulatory system that integrates both TF-based and non-TF based mechanisms(s) for concurrent regulation of neighboring genes in response to various environmental changes.
Accelerating Feature-Vector Matching Using Multiple-Tree and Sub-Vector Methods
Authors: Ying-Ho Liu, Chin-Chin Lin, Wen-Hsiung Lin and Fu Chang
PHOTO
Fu Chang
Abstract: We propose two methods to accelerate the matching of an unknown object with known objects, all of which are expressed as feature vectors. The acceleration becomes necessary when the population of known objects is large and a great deal of time would be required to match all of them. Our proposed methods are multiple decision trees and sub-vector matching, both of which use a learning procedure to estimate the optimal values of certain parameters. Online matching with a combination of the two methods is then performed, whereby candidates are matched rapidly without sacrificing the test accuracy. The process is demonstrated by experiments in which we apply the proposed methods to handwriting recognition and language identification. The speed-up factor of our approach is dramatic compared with an alternative approach that eliminates candidates in a deterministic fashion.
User Identification based on Game-Play Activity Patterns
Authors: Kuan-Ta Chen, Li-Wen Hong
PHOTO
Kuan-Ta Chen
Abstract: Account hijacking is considered one of the most serious security problems in online games. A hijacker normally takes away valuable virtual items from the stolen accounts, and trades those items for real money. Even though account hijacking is not uncommon, there is currently no general solutions to determine whether an account has been hijacked. The game company is not aware of a hijack event unless it is reported by the victim. However, it is usually too lateXusually a hijacker already took away anything valuable when a user finds that his/her account has been stolen. In this paper, we propose a new biometric for human identification based on users' game-play activities. Our main summary are two-fold: 1) we show that the idle time distribution is a representative feature of game players; 2) we propose the RET (Relative-Entropy Test) scheme, which is based on the Kullback-Leibler divergence between idle time distributions, for user identification. Our evaluations shows that the RET scheme achieves higher than 90% accuracy with 20-minute detection time given a 200-minute history size.
Secure Multicast in Dynamic Environments
Authors: Chun-Ying Huang, Yun-Peng Chiu, Kuan-Ta Chen, and Chin-Laung Lei
PHOTO
Kuan-Ta Chen
Abstract: A secure multicast framework should only allow authorized members of a group to decrypt received messages; usually, one ''group key'' is shared by all approved members. However, this raises the problem of ''one affects all'', whereby the actions of one member affect the whole group. Many researchers have solved the problem by dividing a group into several subgroups, but most current solutions require key distribution centers to coordinate secure data communications between subgroups. We believe this is a constraint on network scalability. In this paper, we propose a novel framework to solve key management problems in multicast networks. Our contribution is threefold: (1) We exploit the ElGamal cryptosystem and propose a technique of key composition. (2) Using key composition with proxy cryptography, the key distribution centers used in secure multicast frameworks are eliminated. (3) For key composition, the framework is designed to resist node failures and support topology reconstruction, which makes it suitable for dynamic network environments. Without reducing the security or performance of proxy cryptography, we successfully eliminate the need for key distribution centers. Our analysis shows that the proposed framework is secure, and comparison with other similar frameworks demonstrates that it is efficient in terms of time and space complexity. In addition, the costs of most protocol operations are bounded by constants, regardless of a group's size and the number of branches of transit nodes.
JitterPath: Probing Noise Resilient One-Way Delay Jitter-based Available Bandwidth Estimation
Authors: Yu-Chen Huang, Chun-Shien Lu, and Hsiao-Kuang Wu
Abstract: Measurement of end-to-end available bandwidth has received considerable attention due to its potential use in improving QoS. Available bandwidth enables the sending rate to adapt to network conditions, so that packet loss, caused by congestion, can be significantly reduced before error control mechanisms are finally employed. To this end, we propose a probing noise resilient available bandwidth estimation scheme, called JitterPath, which is adaptive to both the fluid and bursty traffic models. Two key factors, one-way delay jitter and accumulated queuing delay, are both exploited to predict the type of queuing region for each packet pair. Then, the bottleneck utilization information included in the joint queuing regions is estimated and used to quantify the captured traffic ratio, which indicates the relationship between the probing rate and available bandwidth. The contributions of our method are as follows: 1) JitterPath can work without being restricted to fluid traffic models; 2) since JitterPath does not directly use the bottleneck link capacity to calculate the available bandwidth, it is feasible for use in a multihop environment with a single bottleneck; and 3) JitterPath inherently reduces the impact of probing noises under the bursty cross traffic model. Extensive simulations, Internet experiments, and comparisons with other methods were conducted to verify the effectiveness of our method under both single-hop and multihop environments
MYBS: a comprehensive web server for mining transcription factor binding sites in yeast
Authors: Huai-Kuang Tsai, Meng-Yuan Chou, Ching-Hua Shih, Grace Tzu-Wei Huang, Tien-Hsien Chang, and Wen-Hsiung Li
PHOTO
H.K. Tsai
Abstract: Correct interactions between transcription factors (TFs) and their binding sites (TFBSs) are of central importance to gene regulation. Recently developed chromatin-immunoprecipitation DNA chip (ChIP-chip) techniques and the phylogenetic footprinting method provide ways to identify TFBSs with high precision. In this study, we constructed a user-friendly interactive platform for dynamic binding site mapping using ChIP-chip data and phylogenetic footprinting as two filters. MYBS (Mining Yeast Binding Sites) is a comprehensive web server that integrates an array of both experimentally verified and predicted position weight matrixes (PWMs) from eleven databases, including 481 binding motif consensus sequences and 71 PWMs that correspond to 183 TFs. MYBS users can search within this platform for motif occurrences (possible binding sites) in the promoters of genes of interest via simple motif or gene queries in conjunction with the above two filters. In addition, MYBS enables users to visualize in parallel the potential regulators for a given set of genes, a feature useful for finding potential regulatory associations between TFs. MYBS also allows users to identify target gene sets of each TF pair, which could be used as a starting point for further explorations of TF combinatorial regulation. MYBS is available at http://cg1.iis.sinica.edu.tw/~mybs/.
Finding Self-Similarities in Opportunistic People Networks
Authors: Ling-Jyh Chen, Yung-Chih Chen, Tony Sun, Paruveli Sreedevi, Kuan-Ta Chen, Chen-Hung Yu, and Hao-hua Chu
Abstract: Opportunistic network is a type of Delay Tolerant Networks (DTN) where network communication opportunities appear opportunistic. In this study, we investigate opportunistic network scenarios based on public network traces, and our contributions are the following: First, we identify the censorship issue in network traces that usually leads to strongly skewed distribution of the measurements. Based on this knowledge, we then apply the Kaplan-Meier Estimator to calculate the survivorship of network measurements, which is used in designing our proposed censorship removal algorithm (CRA) that is used to recover censored data. Second, we perform a rich set of analysis illustrating that UCSD and Dartmouth network traces show strong self-similarity, and can be modeled as such. Third, we pointed out the importance of these newly revealed characteristics in future development and evaluation of opportunistic networks.
Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation
Authors: Wei-Ho Tsai, Shih-sian Cheng, and Hsin-min Wang
IMG Abstract: This paper investigates the problem of automatically grouping unknown speech utterances based on their associated speakers. In attempts to determine which utterances should be grouped together, it is necessary to measure the voice similarities between utterances. Since most existing methods measure the inter-utterance similarities based directly on the spectrum-based features, the resulting clusters may not be well-related to speakers, but to various acoustic classes instead. This study remedies this shortcoming by projecting utterances onto a reference space trained to cover the generic voice characteristics underlying the whole utterance collection. The resultant projection vectors naturally reflect the relationships of voice similarities among all the utterances, and hence are more robust against interference from nonspeaker factors. Then, a clustering method based on maximum purity estimation is proposed, with the aim of maximizing the similarities between utterances within all the clusters. This method employs a genetic algorithm to determine the cluster to which each utterance should be assigned, which overcomes the limitation of conventional hierarchical clustering that the final result can only reach the local optimum. In addition, the proposed clustering method adapts a Bayesian information criterion to determine how many clusters should be created
Reconstruction of human protein interolog network using evolutionary conserved network
Authors: Tao-Wei Huang, Chung-Yen Lin and Cheng-Yan Kao
PHOTO
T.W. Huang
PHOTO
C.Y. Lin
PHOTO
C.Y. Kao
Abstract: Background
The recent increase in the use of high-throughput two-hybrid analysis has generated large quantities of data on protein interactions. Specifically, the availability of information about experimental protein-protein interactions and other protein features on the Internet enables human protein-protein interactions to be computationally predicted from co-evolution events (interolog). This study also considers other protein interaction features, including sub-cellular localization, tissue-specificity, the cell-cycle stage and domain-domain combination. Computational methods need to be developed to integrate these heterogeneous biological data to facilitate the maximum accuracy of the human protein interaction prediction.
Results
This study proposes a relative conservation score by finding maximal quasi-cliques in protein interaction networks, and considering other interaction features to formulate a scoring method. The scoring method can be adopted to discover which protein pairs are the most likely to interact among multiple protein pairs. The predicted human protein-protein interactions associated with confidence scores are derived from six eukaryotic organisms V rat, mouse, fly, worm, thale cress and baker's yeast.
Conclusion
Evaluation results of the proposed method using functional keyword and Gene Ontology (GO) annotations indicate that some confidence is justified in the accuracy of the predicted interactions. Comparisons among existing methods also reveal that the proposed method predicts human protein-protein interactions more accurately than other interolog-based methods.
Simultaneous amino acid substitutions at antigenic sites drive influenza A hemagglutinin evolution
Authors: Arthur Chun-Chieh Shih, Tzu-Chang Hsiao, Mei-Shang Ho, and Wen-Hsiung Li
PHOTO
Arthur Shih
PHOTO
T.C. Hsia
PHOTO
M.S. Ho
PHOTO
W.H. Li
Abstract: The HA1 domain of HA, the major antigenic protein of influenza A viruses, contains all of the antigenic sites of HA and is under continual immune-driven selection. To resolve controversies on whether only a few or many residue sites of HA1 have undergone positive selection, whether positive selection at HA1 is continual or punctuated, and whether antigenic change is punctuated, we introduce an approach to analyze 2,248 HA1 sequences collected from 1968 to 2005. We identify 95 substitutions at 63 sites from 1968 to 2005 and show that each substitution occurred very rapidly. The rapid substitution and the fact that 57 of the 63 sites are antigenic sites indicate that hitchhiking plays a minor role and that most of these sites, many more than previously found, have undergone positive selection. Strikingly, 88 of the 95 substitutions occurred in groups, and multiple mutations at antigenic sites sped up the fixation process. Our results suggest that positive selection has been ongoing most of the time, not sporadic, and that multiple mutations at antigenic sites cumulatively enhance antigenic drift, indicating that antigenic change is less punctuated than recently proposed.
A new per-class flow fixed proportional differentiated service for multi-service wireless LAN
Authors: Meng Chang Chen, Li-Ping Tung, Yeali Sun, and Wei-Kuan Shih
PHOTO
M.C. Chen
Abstract: In this paper, we propose a new per-CLAss Flow fixed proportional differentiated service model (CLAF) and a companioning medium access control scheme for multi-service wireless LANs (WLANs), based on the IEEE 802.11 framework. Different from conventional relative differentiated service, in CLAF, a fixed bandwidth quality ratio is guaranteed on per-class per-flow basis regardless of the traffic load of each service class. Specifically, each service class is assigned a number of separate coordination periods, proportional to the policy-based bandwidth quality ratio for class isolation. Each class is associated with its own contention window size which is dynamically adjusted in accordance with the number of flows in the class in such a way to minimize collision probability between flows of the same class. Simulations results of the CLAF performance as well as a comparison with IEEE 802.11e EDCA including the support of QoS-sensitive VoIP applications are presented. The results show that the proposed scheme outperforms EDCA and achieves better resource utilization efficiency. CLAF can provide users a more predictive, affirmative service guarantees than conventional relative differentiated service like IEEE 802.11e EDCA.
Estimation of Skew Angles for Scanned Documents Based on Piecewise Covering by Parallelograms
Authors: Chien-Hsing Chou, Shih-Yu Chu and Fu Chang
PHOTO
Fu Chang
Abstract: We propose a fast and robust skew estimation method for scanned documents that estimates skew angles based on piecewise covering of objects, such as textlines, figures, forms, or tables. The method first divides a document image into a number of non-overlapping slabs in which each object is covered by parallelograms. It then estimates the skew angle based on these parallelograms or, equivalently, their complementary regions. Putting our method to a systematic test and comparing it with some alternatives, we find that it yields favorable results in terms of accuracy, sensitivity to non-textual objects, effectiveness in dealing with documents of unspecified reading order, and computational efficiency. Some work is also conducted to find an effective way to further shorten its computation time at the expense of an extremely small loss of accuracy.
Visual Salience-Guided Mesh Decomposition
Authors: Hsueh-Yi Sean Lin, Hong-Yuan Mark Liao, and Ja-Chen Lin
PHOTO
Mark Liao
Abstract: In this paper, we propose a novel mesh-decomposition scheme called "visual salience-guided mesh decomposition". The concept of "part salience", which originated in cognitive psychology, asserts that the salience of a part can be determined by (at least) three factors: the protrusion, the boundary strength, and the relative size of the part. We try to convert these conceptual rules into real computational processes, and use them to guide a three-dimensional (3D) mesh decomposition process in such a way that the significant components can be precisely identified and efficiently extracted from a given 3D mesh. The proposed decomposition scheme not only identifies the parts' boundaries defined by the minima rule, but also labels each part with a quantitative degree of visual salience during the mesh decomposition process. The experimental results show that the proposed scheme is indeed effective and powerful in decomposing a 3D mesh into its significant components
Current Research