| ¡@About
IASL | Research | Publications
| Demos | People |
| ¡@ Home>>Research>>Biological
Computing |
 |
| ¡@ |
|
Protein subcellular localization prediction |
Protein subcellular localization prediction
The study of protein subcellular localization (PSL) is important for elucidating protein functions involved in various cellular processes. We proposed two PSL prediction methods, PSL101 and PSLDoc, which were published in BMC Bioinformatics 2007 and Proteins: Structure, Function, and Bioinformatics 2008, respectively.
PSL101 combines a structural homology approach and a support vector machine model, in which biological features derived from bacterial translocation pathways are incorporated. PSLDoc uses a probabilistic latent semantic analysis on gapped-dipeptides of various distances, where evolutionary information from position specific scoring matrix is utilized. Our methods achieve 79%~94% in overall accuracy for several prokaryotic and eukaryotic species, and compared favorably to the respective state-of-the-art results by 7.4% when those sequences of low homology to the training set. Moreover, the proposed biological features and gapped-dipeptide signatures are interpretable and can be applied in advanced studies and experimental designs.
(1)
PSLDoc: Protein subcellular localization
prediction based on gapped-dipeptides and
probabilistic latent semantic analysis (Jia-Ming
Chang)
(2)
Protein subcellular localization prediction
based on compartment-specific features and
structure conservation (Emily Chia-Yu Su)
(3)
Protein subcellular localization prediction
based on compartment-specific biological
features (Emily Chia-Yu Su)
(4)
KnowPredSite:A web server for predicting
single and multiple subcellular localization
sites
Protein subcellular localization (PSL) is
important to elucidate protein functions as
proteins cooperate towards a common function
in the same subcellular compartment.
Determining the localization sites of a
protein through experiments can be
time-consuming and labor-intensive. With the
large number of sequences that continue to
emerge from the genome sequencing projects,
computational methods for PSL at a proteome
scale become increasingly important. Most of
the PSL prediction systems are established
particularly for single-localized proteins.
A significant number of
eukaryotic proteins are, however, known to
be localized into multiple subcellular
organelles. In addition, the majority of
existing computational methods have the
following disadvantages: 1) they only
predict a limited number of locations; 2)
they are limited to subsets of proteomes
which contain signal peptide sequences or
with prior structural/functional
information; 3) the datasets used for
training are for specific species, which is
not sufficiently robust to represent the
entire proteomes.
To overcome these problems, we have proposed
a knowledge-based approach called
KnowPredsite [1] to predict the localization
site(s) of both single-localized and
multi-localized proteins. Based on the local
similarity, we can identify the ¡§related
sequences¡¨ for prediction. We construct a
knowledge base to record the possible
sequence variations for protein sequences.
When predicting the localization annotation
of a query protein, we search against the
knowledge base and used a scoring mechanism
to determine the predicted sites.
|
|
Protease Substrate Site Prediction |
(1)
Factor Xa.
Regulatory proteases modulate biological systems
through catalyzing hydrolysis reactions on
designated peptide bonds, which in turn tip
proteomic balance, resulting in rapid and
substantial change of the systems. The
regulatory power of the proteases is known to
originate from the corresponding substrate
specificities of the proteases. However, for
large part of the protease families, the
specificity spectrum and the in vivo substrates
have been poorly comprehended. A viable approach
for elucidating the proteomic role of a
regulatory protease is to construct
computational systems capable of modeling the
rules governing the complete substrate
specificities of the protease to scan for
potential in vivo substrates in the proteome.
The predicted substrate sites could provide
important clues for the protease's regulatory
network, on which verifying experiments could
then be focused. We have constructed
computational models for scanning potential
substrate sites in nature protein sequences by
integrating multi-level substrate phage display
experiments with bootstrap aggregation machine
learning algorithms. Factor Xa, a key regulatory
serine protease of the blood coagulation system,
was used as a model protease to demonstrate that
the systematically coupled experimental and
computational procedures together were able to
produce computational systems capable of
scanning for substrate sites of significant
biological relevance in nature protein
sequences. The protocol for experimentally
sampling and computationally learning on the
rules governing the substrate specificity can be
generalized to any protease of interest for
which the active form is available for the in
vitro experiments.
|
|
Membrane protein structure prediction |
I) Membrane protein topology prediction:
Membrane proteins are encoded in 30% of all sequenced genomes yet their structures are scarce due to experimental difficulties. They are important for a diverse array of biological functions and prominent pharmaceutical targets. To gain insights into their secondary structure (topology), we developed a method named SVMtop for alpha-helical membrane proteins. The method is based on support vector machines (SVM) in a hierarchical framework in which helix prediction is performed in the first stage, followed by topology prediction in the second. Standard benchmarks showed that SVMtop is one of the top-performing methods, correctly predicting 70% of the protein topology and less than 1% of false positive rate for identifying soluble proteins.
Pubmed link:
http://www.ncbi.nlm.nih.gov/pubmed/18081245
 Web server:
http://bio-cluster.iis.sinica.edu.tw/SVMtop
I) Membrane protein helix-helix interactions prediction:
Interactions between TM helices are important for structure assembly, stability and function of membrane proteins. The molecular interactions are mediated by residue contacts. We developed a novel two-level method to predict helix-helix interactions based on contact residues. In the first level, single contact residues are predicted from sequence, followed by their pairing relationships in the second level. The two-level approach consistently improves the non-hierarchical approach in prediction accuracy, with up to 95% of reduction of input. Our method also outperforms previous methods based on correlated mutation by 14%. Our results demonstrate that a hierarchical framework can be applied in contact prediction to eliminate false positives while reducing computational complexity. Together with the statistical analysis on contact propensities, this method can be used to gain insights into helix-packing in membrane proteins.
Pubmed link:
http://www.ncbi.nlm.nih.gov/pubmed/19244388
 Web server:
http://bio-cluster.iis.sinica.edu.tw/TMhit
II) Lipid accessibility and rotational preference of transmembrane helices:
Membrane protein structures are difficult for experimental determination, thus computational methods are in demand for closing the gap between their sequence and structure space. In addition, to gain insights into membrane protein folding and reconstructing transmembrane (TM) helical bundles for structure prediction, the knowledge of helix-lipid interactions are required. Therefore, it is important to develop sequence-based methods for predicting the lipid exposure of TM residues. We present a new method for predicting both the burial status and real-value lipid exposure surface of TM domains based on random forests (RFs). A knowledge-based propensity scale is calculated and it captures important information in lipid exposure of TM domains. In addition, we integrate the above scale with evolutionary profile and sequence conservation as input features for constructing the RF models. We also further extend our method to infer the rotational preference of TM helices. The propensity scale and the prediction method presented herein can be used to gain insights into the lipid exposure and rotational orientation of TM helices.
|
|
NMR
Backbone resonance assignment |
NMR data from different experiments often contain errors; thus, automated backbone resonance assignment is a very challenging issue. We develop an iterative relaxation algorithm, called RIBRA, for NMR protein backbone assignment. RIBRA applies nearest neighbor and weighted maximum independent set algorithms to solve the problem. We test RIBRA on two real NMR datasets: hbSBD and hbLBD, and perfect BMRB data (with 902 proteins) and four synthetic BMRB data which simulate four kinds of errors. The accuracy of RIBRA on hb-SBD and hbLBD are 91.4% and 83.6%, respectively. The average accuracy of RIBRA on perfect BMRB datasets is 98.28%, and 98.28%, 95.61%, 98.16% and 96.28% on four kinds of synthetic datasets, respectively.
Besides, we also present a method called GANA that uses a genetic algorithm to automatically perform backbone resonance assignment with a high degree of precision and recall. GANA takes spin systems as input data and almost all spin systems can be mapped correctly onto a target protein, even if the data are noisy. The average recall rates of GANA on BMRB and the four simulated test cases are 99.26, 99.19, 98.85, 98.87 and 97.78%, respectively. The precision and recall rates of GANA on hbSBD are 95.12 and 92.86%, respectively, and those of hbLBD are 100 and 97.40%, respectively.
(1)
RIBRA-an Error-Tolerant Algorithm for the NMR
Backbone Assignment Problem (Kuen-Pin Wu,
Jia-Ming Chang)
(2)
GANA ¡V A Genetic Algorithm for NMR Backbone
Resonance Assignment (Hsin-Nan Lin)
(3) An Iterative Relaxation Technique for the
NMR Backbone Assignment Problem (J. M. Chang)
Reference:
1. Wu, K.P., Chang, J.M., Chou, W.C., Chen,
J.B., Sung, T.Y., Chang, C.F., Wu, W.J., Huang,
T.H. and Hsu, W.L. (2006) RIBRA¡XAn
Error-Tolerant Algorithm for the NMR Backbone
Assignment Problem. Journal of Computational
Biology, 13, 229-244.
2. Lin, H.N., Wu, K.P., Chang, J.M., Sung, T.Y.
and Hsu, W.L. (2005) GANA¡Xa genetic algorithm
for NMR backbone resonance assignment. Nucleic
Acids Research, 33, 4593-4601.
3. Wu, K.P., Chang, J.M., Chou, W.C., Chen,
J.B., Sung, T.Y., Chang, C.F., Wu, W.J., Huang,
T.H. and Hsu, W.L. (2005) RIBRA-an
Error-Tolerant Algorithm for the NMR Backbone
Assignment Problem. The Ninth Annual
International Conference on Research in
Computational Molecular Biology (RECOMB 2005),
103-117.
|
|
Predicting RNA-binding sites of proteins using support vector machines and evolutionary information |
Pubmed Link:
http://www.ncbi.nlm.nih.gov/pubmed/19091029
English Abstract
RNA-protein interaction plays an essential role
in several biological processes, such as protein
synthesis, gene expression, post-transcriptional
regulation, and antiviral drug discovery.
Identification of RNA-binding sites in proteins
can provide valuable insights for biologists.
However, experimental determination RNA-protein
interaction remains time-consuming and
labor-intensive. Thus, computational approaches
for the prediction of RNA-binding sites from
protein sequences have become highly desirable.
In this paper, we propose a method, RNAProB, to
predict RNA-binding sites based on support
vector machines and a new encoding scheme for
smoothed position-specific scoring matrix.
Evaluated by five-fold cross-validation, our
method achieves Matthew¡¦s correlation
coefficient (MCC) values of 0.68, 0.58, and 0.42
compared to 0.45, 0.35, and 0.32 by the
state-of-the-art systems for three benchmark
data sets, respectively. Moreover, to avoid data
overfitting, we use a three-way data split
procedure to estimate our predictive
performance, and our approach obtains MCC values
of 0.67, 0.56, and 0.40, respectively. In
conclusion, our method significantly improves
the predictive performance of RNA-binding site
prediction. The proposed encoding scheme for
smoothed PSSM can be used in other research
problems, such as DNA-protein interaction,
protein-protein interaction, and prediction of
post-translational modification, etc.
|
|
Current
research |
1. Transmembrane helix-helix interactions database
The TransMembrane helix-helix interactions database (TMhitDB) is a comprehensive repository of helical interactions from experimentally derived membrane protein structures. In particular, TMhitDB provides pre-calculated geometric descriptors of helix-helix interactions at the helix-packing interface. TMhitDB also includes topology information, lipid accessibility, ligand and binding sites of each transmebrane protein. Each record also contains an overview about the protein such as sequence, name, experimental details, function, and cross references. TMhitDB provides structural classification and allows extensive queries, browsing, and visualization of data.
2. Re-entrant region of membrane proteins and signal peptide prediction
We are developing two major enhancements to our topology prediction method, SVMtop, by integrating predictions methods that deal with potentially problematic sequences containing re-entrant structures and signal peptides. As membrane protein structures often contain the above two types of sequences, clear distinctions must be made about them when predicting membrane protein topology. These works will complement the existing method and strengthen the usability of SVMtop.
3. Transmembrane helix-helix crossing angles prediction
We are investigating the relationship between sequence and structure for helix-helix interactions found in membrane proteins. In particular, the sequence determinants of helix-helix crossing angles remain unclear. Currently, we are developing computational models for predicting the helix-packing geometries from sequence and other structural information. This work will ultimately aid in our constrain-based structure prediction of membrane protein structures.
|
 |
Back to
Research |
|
¡@ |
|
|
 |
| ¡@ |
Wen-Lian Hsu/a>
Professor, IEEE Fellow
Research Fellow
Institute of Information Science ,
Academia Sinica, Taipei,
Taiwan, R. O. C. Phone:
886-2-27883799 ext.1804 Fax:
886-2-27824814 E-mail: hsu@iis.sinica.edu.tw
¡@ |
¡@ |
|
 |
| ¡@ |
Ting-Yi Sung
Research Fellow
Institute of Information Science ,
Academia Sinica, Taipei,
Taiwan, R. O. C. Phone:
886-2-27883799 ext.1711 Fax:
886-2-27824814 E-mail:
tsung iis.sinica.edu.tw¡@ |
¡@ |
|
|