¡@[¤¤¤åª©] [English Version]
¡@ Home¢xContact us
¡@About IASL | Research | Publications | Demos | People
¡@ Home>>Research>>Biological Computing

Protein subcellular localization prediction

Protein subcellular localization prediction The study of protein subcellular localization (PSL) is important for elucidating protein functions involved in various cellular processes. We proposed two PSL prediction methods, PSL101 and PSLDoc, which were published in BMC Bioinformatics 2007 and Proteins: Structure, Function, and Bioinformatics 2008, respectively.

PSL101 combines a structural homology approach and a support vector machine model, in which biological features derived from bacterial translocation pathways are incorporated. PSLDoc uses a probabilistic latent semantic analysis on gapped-dipeptides of various distances, where evolutionary information from position specific scoring matrix is utilized. Our methods achieve 79%~94% in overall accuracy for several prokaryotic and eukaryotic species, and compared favorably to the respective state-of-the-art results by 7.4% when those sequences of low homology to the training set. Moreover, the proposed biological features and gapped-dipeptide signatures are interpretable and can be applied in advanced studies and experimental designs.

(1) PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis (Jia-Ming Chang)

(2) Protein subcellular localization prediction based on compartment-specific features and structure conservation (Emily Chia-Yu Su)

(3) Protein subcellular localization prediction based on compartment-specific biological features (Emily Chia-Yu Su)

(4) KnowPredSite:A web server for predicting single and multiple subcellular localization sites

Protein subcellular localization (PSL) is important to elucidate protein functions as proteins cooperate towards a common function in the same subcellular compartment. Determining the localization sites of a protein through experiments can be time-consuming and labor-intensive. With the large number of sequences that continue to emerge from the genome sequencing projects, computational methods for PSL at a proteome scale become increasingly important. Most of the PSL prediction systems are established particularly for single-localized proteins. A significant number of
eukaryotic proteins are, however, known to be localized into multiple subcellular organelles. In addition, the majority of existing computational methods have the following disadvantages: 1) they only predict a limited number of locations; 2) they are limited to subsets of proteomes which contain signal peptide sequences or with prior structural/functional information; 3) the datasets used for training are for specific species, which is not sufficiently robust to represent the entire proteomes.

To overcome these problems, we have proposed a knowledge-based approach called KnowPredsite [1] to predict the localization site(s) of both single-localized and multi-localized proteins. Based on the local similarity, we can identify the ¡§related sequences¡¨ for prediction. We construct a knowledge base to record the possible sequence variations for protein sequences. When predicting the localization annotation of a query protein, we search against the knowledge base and used a scoring mechanism to determine the predicted sites.

Protease Substrate Site Prediction

(1) Factor Xa.

Regulatory proteases modulate biological systems through catalyzing hydrolysis reactions on designated peptide bonds, which in turn tip proteomic balance, resulting in rapid and substantial change of the systems. The regulatory power of the proteases is known to originate from the corresponding substrate specificities of the proteases. However, for large part of the protease families, the specificity spectrum and the in vivo substrates have been poorly comprehended. A viable approach for elucidating the proteomic role of a regulatory protease is to construct computational systems capable of modeling the rules governing the complete substrate specificities of the protease to scan for potential in vivo substrates in the proteome. The predicted substrate sites could provide important clues for the protease's regulatory network, on which verifying experiments could then be focused. We have constructed computational models for scanning potential substrate sites in nature protein sequences by integrating multi-level substrate phage display experiments with bootstrap aggregation machine learning algorithms. Factor Xa, a key regulatory serine protease of the blood coagulation system, was used as a model protease to demonstrate that the systematically coupled experimental and computational procedures together were able to produce computational systems capable of scanning for substrate sites of significant biological relevance in nature protein sequences. The protocol for experimentally sampling and computationally learning on the rules governing the substrate specificity can be generalized to any protease of interest for which the active form is available for the in vitro experiments.

Membrane protein structure prediction

I) Membrane protein topology prediction: Membrane proteins are encoded in 30% of all sequenced genomes yet their structures are scarce due to experimental difficulties. They are important for a diverse array of biological functions and prominent pharmaceutical targets. To gain insights into their secondary structure (topology), we developed a method named SVMtop for alpha-helical membrane proteins. The method is based on support vector machines (SVM) in a hierarchical framework in which helix prediction is performed in the first stage, followed by topology prediction in the second. Standard benchmarks showed that SVMtop is one of the top-performing methods, correctly predicting 70% of the protein topology and less than 1% of false positive rate for identifying soluble proteins.

Pubmed link: http://www.ncbi.nlm.nih.gov/pubmed/18081245
webiconWeb server: http://bio-cluster.iis.sinica.edu.tw/SVMtop

I) Membrane protein helix-helix interactions prediction: Interactions between TM helices are important for structure assembly, stability and function of membrane proteins. The molecular interactions are mediated by residue contacts. We developed a novel two-level method to predict helix-helix interactions based on contact residues. In the first level, single contact residues are predicted from sequence, followed by their pairing relationships in the second level. The two-level approach consistently improves the non-hierarchical approach in prediction accuracy, with up to 95% of reduction of input. Our method also outperforms previous methods based on correlated mutation by 14%. Our results demonstrate that a hierarchical framework can be applied in contact prediction to eliminate false positives while reducing computational complexity. Together with the statistical analysis on contact propensities, this method can be used to gain insights into helix-packing in membrane proteins.

Pubmed link: http://www.ncbi.nlm.nih.gov/pubmed/19244388
webiconWeb server: http://bio-cluster.iis.sinica.edu.tw/TMhit

II) Lipid accessibility and rotational preference of transmembrane helices: Membrane protein structures are difficult for experimental determination, thus computational methods are in demand for closing the gap between their sequence and structure space. In addition, to gain insights into membrane protein folding and reconstructing transmembrane (TM) helical bundles for structure prediction, the knowledge of helix-lipid interactions are required. Therefore, it is important to develop sequence-based methods for predicting the lipid exposure of TM residues. We present a new method for predicting both the burial status and real-value lipid exposure surface of TM domains based on random forests (RFs). A knowledge-based propensity scale is calculated and it captures important information in lipid exposure of TM domains. In addition, we integrate the above scale with evolutionary profile and sequence conservation as input features for constructing the RF models. We also further extend our method to infer the rotational preference of TM helices. The propensity scale and the prediction method presented herein can be used to gain insights into the lipid exposure and rotational orientation of TM helices.

NMR Backbone resonance assignment

NMR data from different experiments often contain errors; thus, automated backbone resonance assignment is a very challenging issue. We develop an iterative relaxation algorithm, called RIBRA, for NMR protein backbone assignment. RIBRA applies nearest neighbor and weighted maximum independent set algorithms to solve the problem. We test RIBRA on two real NMR datasets: hbSBD and hbLBD, and perfect BMRB data (with 902 proteins) and four synthetic BMRB data which simulate four kinds of errors. The accuracy of RIBRA on hb-SBD and hbLBD are 91.4% and 83.6%, respectively. The average accuracy of RIBRA on perfect BMRB datasets is 98.28%, and 98.28%, 95.61%, 98.16% and 96.28% on four kinds of synthetic datasets, respectively.

Besides, we also present a method called GANA that uses a genetic algorithm to automatically perform backbone resonance assignment with a high degree of precision and recall. GANA takes spin systems as input data and almost all spin systems can be mapped correctly onto a target protein, even if the data are noisy. The average recall rates of GANA on BMRB and the four simulated test cases are 99.26, 99.19, 98.85, 98.87 and 97.78%, respectively. The precision and recall rates of GANA on hbSBD are 95.12 and 92.86%, respectively, and those of hbLBD are 100 and 97.40%, respectively.

(1) RIBRA-an Error-Tolerant Algorithm for the NMR Backbone Assignment Problem (Kuen-Pin Wu, Jia-Ming Chang)

(2) GANA ¡V A Genetic Algorithm for NMR Backbone Resonance Assignment (Hsin-Nan Lin)

(3) An Iterative Relaxation Technique for the NMR Backbone Assignment Problem (J. M. Chang)

1. Wu, K.P., Chang, J.M., Chou, W.C., Chen, J.B., Sung, T.Y., Chang, C.F., Wu, W.J., Huang, T.H. and Hsu, W.L. (2006) RIBRA¡XAn Error-Tolerant Algorithm for the NMR Backbone Assignment Problem. Journal of Computational Biology, 13, 229-244.
2. Lin, H.N., Wu, K.P., Chang, J.M., Sung, T.Y. and Hsu, W.L. (2005) GANA¡Xa genetic algorithm for NMR backbone resonance assignment. Nucleic Acids Research, 33, 4593-4601.
3. Wu, K.P., Chang, J.M., Chou, W.C., Chen, J.B., Sung, T.Y., Chang, C.F., Wu, W.J., Huang, T.H. and Hsu, W.L. (2005) RIBRA-an Error-Tolerant Algorithm for the NMR Backbone Assignment Problem. The Ninth Annual International Conference on Research in Computational Molecular Biology (RECOMB 2005), 103-117.

Predicting RNA-binding sites of proteins using support vector machines and evolutionary information

Pubmed Link: http://www.ncbi.nlm.nih.gov/pubmed/19091029

English Abstract
RNA-protein interaction plays an essential role in several biological processes, such as protein synthesis, gene expression, post-transcriptional regulation, and antiviral drug discovery. Identification of RNA-binding sites in proteins can provide valuable insights for biologists. However, experimental determination RNA-protein interaction remains time-consuming and labor-intensive. Thus, computational approaches for the prediction of RNA-binding sites from protein sequences have become highly desirable. In this paper, we propose a method, RNAProB, to predict RNA-binding sites based on support vector machines and a new encoding scheme for smoothed position-specific scoring matrix. Evaluated by five-fold cross-validation, our method achieves Matthew¡¦s correlation coefficient (MCC) values of 0.68, 0.58, and 0.42 compared to 0.45, 0.35, and 0.32 by the state-of-the-art systems for three benchmark data sets, respectively. Moreover, to avoid data overfitting, we use a three-way data split procedure to estimate our predictive performance, and our approach obtains MCC values of 0.67, 0.56, and 0.40, respectively. In conclusion, our method significantly improves the predictive performance of RNA-binding site prediction. The proposed encoding scheme for smoothed PSSM can be used in other research problems, such as DNA-protein interaction, protein-protein interaction, and prediction of post-translational modification, etc.

Current research

1. Transmembrane helix-helix interactions database The TransMembrane helix-helix interactions database (TMhitDB) is a comprehensive repository of helical interactions from experimentally derived membrane protein structures. In particular, TMhitDB provides pre-calculated geometric descriptors of helix-helix interactions at the helix-packing interface. TMhitDB also includes topology information, lipid accessibility, ligand and binding sites of each transmebrane protein. Each record also contains an overview about the protein such as sequence, name, experimental details, function, and cross references. TMhitDB provides structural classification and allows extensive queries, browsing, and visualization of data.

2. Re-entrant region of membrane proteins and signal peptide prediction We are developing two major enhancements to our topology prediction method, SVMtop, by integrating predictions methods that deal with potentially problematic sequences containing re-entrant structures and signal peptides. As membrane protein structures often contain the above two types of sequences, clear distinctions must be made about them when predicting membrane protein topology. These works will complement the existing method and strengthen the usability of SVMtop.

3. Transmembrane helix-helix crossing angles prediction We are investigating the relationship between sequence and structure for helix-helix interactions found in membrane proteins. In particular, the sequence determinants of helix-helix crossing angles remain unclear. Currently, we are developing computational models for predicting the helix-packing geometries from sequence and other structural information. This work will ultimately aid in our constrain-based structure prediction of membrane protein structures.


Back to
Wen-Lian Hsu/a>
Professor, IEEE Fellow
Research Fellow
Institute of Information Science ,
Academia Sinica, Taipei,
Taiwan, R. O. C.
886-2-27883799 ext.1804


Ting-Yi Sung
Research Fellow
Institute of Information Science ,
Academia Sinica, Taipei,
Taiwan, R. O. C.
886-2-27883799 ext.1711


Intelligent Agent Systems Lab., Institute of Information Science, Academia Sinica.
128 Academia Road, Sec.2, Nankang, Taipei, Taiwan, ROC
Tel: +886-2-2788-3799, Fax: 886-2-2782-4814, 886-2-2651-8660