|
|
|
|
| |
|
Bio-informatics
Principal Investigators:
[ Bio-informatics ]
Our research can be divided into the following areas: 1. Genomics;
2. Proteomics; 3. Structural bioinformatics; 4. Biological literature mining;
and 5. Systems biology. Since 2003, at IIS we have started recruiting
Ph.D. students in Bioinformatics of the Taiwan International Graduate
Program (TIGP). There are 15 students currently. The number of applicants
this year is more than the total number of applicants for the past three
years.
(1) Genomics
- EST annotation pipeline system
Large scale DNA sequencing has become
a worldwide-used methodology. EST sequencing is a feasible way for studying
functional genomics under limited resources, so there are a lot of EST
sequencing projects all over the world. However, the essential step of
an EST sequencing project is the bioinformatics analysis. A good annotation
pipeline can provide biologists the most complete information about their
EST data and help them with further studies. In the last few years, we have
in collaboration with biologists in the Institute of Biomedical Sciences
(IBMS) of Academia Sinica, provided the EST annotation service systems,
Bio101 and Bio301, for some functional genomics projects and also developed
a fast gene ontology annotator for the EST annotation, called ESTFastAnnotator.
- An efficient synteny mapping and
annotation system for vertebrate genomes
Synteny mapping, or detecting regions that are
orthologous between genomes, is a key step in studies of comparative genomics.
Synteny mapping of all model vertebrate genomes will make the global view
of genome evolution clear, enhance the functional decoding of genomes
in a comprehensive way beyond genes so far, and serve as a foundation
of comparative vertebrate genomics. Based on our successful experience on
human-mouse synteny mapping by UniMarker- synteny method, we plan to develop
an efficient synteny mapping and annotation system for vertebrate genomes.
- Large-scale and large-number sequence
analysis and its applications to phyloinformatics
In the post-genomic era, the number of sequences available
for reconstructing the evolutionary history of genes and species has increased
20-fold in the past decade. But most tools available to date are limited
to processing only small data sets. Especially, the cost for constructing
the tree of a large group of organisms is very high when using an optimal
character-based method such as maximum likelihood. Thus, to develop new computational
tools for building very large phylogenetic trees and for comparing and visualizing
huge data sets is urgent. In this project, we focus on the study and processing
of large-number sequence alignments and their phylogenies. We plan to build
a problem-solving environment, develop effi cient algorithms for constructing
large-scale phylogenetic trees, such as supertrees, and develop effective
visualization tools for visualizing supertrees. We have developed a versatile
alignment visualization system, SinicView (Sequence-aligning INnovative
and Interactive Comparison VIEWer), which enables users to efficiently compare
and evaluate assorted alignment results obtained by different tools. This
work appeared in BMC Bioinformatics 2006. More information can be found
at the Computational Genomics Lab. website at http:// biocomp.iis.sinica.edu.tw/
in the Genomics Research Center.
- Influenza virus study and vaccine
strain selection
In recent years, the study of molecular phylogenies
of influenza viruses has become increasingly important for epidemiologists
and evolutionary biologists. The yearly flu epidemics often occur as a result
of the lack of immunity against these new strains. The World Health Organization
(WHO) annually recommends three infl uenza strains currently in worldwide
circulation for vaccine manufacturers to massively produce the vaccines.
However, each year Taiwan needs four million fl u vaccines, imported from
USA or other countries. To avoid a lack of drugs and vaccines in case of
a global fl u outbreak, the Department of Health, Executive Yuan, has planned
to establish a domestic GMP plant to produce local flu vaccines. Beginning
in 2006 in collaboration with IBMS we participate in a three-year CDC (Center
for Disease Control, Taiwan) project for development of influenza vaccine.
We plan to study the evolutionary mechanisms of influenza viruses from
sequence data as well as to establish a computational model that can predict
potentially emerging strains in coming epidemic seasons. One of our goals
is to be able to provide a quantitative model for vaccine strain selection.
- Agent-based information integration
for bioinformatics
More than 600 Web-based bioinformatics resources, including databases, links, and analysis tools, are available for biological knowledge discovery. But to integrate them requires tremendous efforts. We have been working on an intelligent agent-based approach to information integration and have built a tool for rapidly training intelligent agents. This tool provides bioinformatics researchers a scalable, fl exible and extensible solution to query these resources with a single uniform interface. With this tool, users can train an agent to learn a specialized session of Web browsing and extraction without programming. Instead, they can train intelligent agents by click-and-drag in a programming-by-example manner. As a result, a large number of agents can be created to integrate a large number of Web sites in a matter of days to answer biologists' queries. We have successfully built and deployed many agent-based information integration systems for biologists. One of the agent-based systems is to support an SNP project of National Genotyping Center (NGC) at IBMS. Fast- SNP allows the researchers at NGC to efficiently prioritize risk-causing SNPs to conduct genotyping and enables them to successfully identify a novel promoter polymorphism that contributes to Adverse Drug Responses (ADRs). More identification will be announced in the near future. One of the agentbased systems won the best paper award in a local bioinformatics conference in 2003. An article about FastSNP will appear in the 2006 Web Server Special Issue of Nucleid Acid Research.
- Development of Gene Ontology browsing/maninpulating
utility
Due to the increasing requirements for functionally annotating gene products with Gene Ontology (GO), more and more pipelines are producing GO-related biological data. In order to facilitate biologists' browsing and manipulation of their data, we designed a Java-based software called GOBU. The underlying architecture of GOBU enables arbitrary data description, data integration and extendable user interface, and thus is feasible for browsing annotations for various pipelines. We have applied this software to our EST annotation pipeline and microarray experiments.
(2) Proteomics
- Quantitative proteomics based
on high-throughput mass spectrometry data
Most current quantitative experiments aim to determine
relative protein expression levels in different cell states or cells
grown under different conditions. Various stable isotope labeling techniques,
e.g., ICAT and iTRAQ, followed by liquid chromatography- tandem mass
spectrometry (LC-MS/MS) are frequently used to quantify protein expressions.
To expedite the analysis of vast amounts of spectral data generated, we
have developed a fully automated tool for multiplexed quantitation using
iTRAQ labeling, called Multi-Q. This tool is designed as a generic platform
that can accommodate various input data formats from different mass spectrometers
and search engines. This work is in collaboration with the Institute of
Chemistry. According to the chemists, our software is the most advanced
tool available in the world. In addition, we are developing a tool for ICAT
labeling quantitation. In the future, we will adapt our tools to two-plex
or multiplexed quantitative analysis using other isotopic labeling strategies.
Moreover, tools for visualization to assist in the biological interpretation
of the data will also be developed.
(3) Structural bioinformatics
- Protein structure prediction
We have developed a hybrid knowledge-based protein secondary
structure prediction algorithm, called HYPROSP II, which combines an existing
machine learning approach, PSIPRED, and a new peptide knowledge based approach
for prediction. The average prediction accuracy of HYPROSP is around 82%,
which is better than both of PSIPRED and the knowledge based approach.
As more protein structures are determined, the knowledge base is expected
to grow and the prediction accuracy is also expected to increase. Related
papers appeared in Nucleic Acids Research 2004 and Bioinformatics 2005.
We have also adopted more biological domain knowledge and machine learning
techniques to predict related structure problems, such as local structure,
b-turn, transmembrane helix prediction, etc. Once protein secondary structures
can be predicted with improved accuracy, we then target to predict tertiary
structures with emphasis on the protein fold recognition problem.
- Protein 3D structure prediction
by fragment assembly
We propose to predict the protein backbone conformation based solely on the sequence information. The objective will be achieved using our previously established fragment library and by further developing new algorithms to dissect the sequencestructure relationship. We have successfully demonstrated that our fragment library provides a good basis set of building blocks for reconstructing and predicting whole protein structures. However, the exact nature of the relationship between a protein!| s sequence and its structure remains one of the open challenges in computational biology. To discover the relationship of protein!|s sequence and its structure is quite important and worth our effort.
- NMR backbone resonance assignment and
NOE experiment
NMR spectroscopy is one of the popular experiments to determine
protein structure. An important stage of protein structure determination
by using NMR is protein backbone resonance assignment. This is a tedious
and time-consuming manual work. We have developed an iterative relaxation
technique for automatic backbone assignment that can tolerate a huge amount
of noise in the data. Our paper was accepted in RECOMB 2005 and was invited
to be published in Journal of Computational Biology. It is the very first
paper from Taiwan accepted by RECOMB since its inception nine years ago.
A related result based on genetic algorithm has appeared in Nucleic Acids
Research 2005. To extract geometric constraints for the structure calculations
from the NMR spectra, we need to consider NOEs and coupling constants that
are transformed into distance and dihedral angle constraints. We shall develop
an efficient algorithm for NOE data analysis and use this data analysis
result to improve backbone assignment. This research is in collaboration
with IBMS.
(4) Biomedical literature mining
- Biological term and relation extraction
The Intelligent Agent System Lab has developed a system
for biological named entities recognition from biomedical literature.
We use Maximum Entropy (ME) model and Conditional Random Fields (CRF)
as the underlying machine learning methods, and incorporate dictionary-based
and rulebased methods as post-processing of ME to enhance the performance.
Once named entities can be recognized, we then aim to recognize relations
between named entities. We collaborate with biologists to work on the problems
of recognizing protein-protein interaction relations and gene-disease relations.
Related paper has appeared in BMC Bioinformatics 2006.
- Genomic information retrieval
The Intelligent Agent Systems Lab (IASL) participated in the TREC 2005 Genomics Track Ad-hoc Retrieval Contest and won the 6th place out of 32 teams. The Genomic information retrieval contest combines natural language queries and table search. Due to the variations of biological terms and the large amount of unknown medical words, the retrieval task is particularly diffi cult. The lab has accumulated many years of experiences in developing information extraction, retrieval, natural language processing and question answering systems, and obtained an accuracy of 24.53% (The best team has 28%). The performance is very close to the top five teams: York Univeristy, IBM, University of Waterloo, UCCI and national Library of Medicine (NLM). In the fi rst year's work, IASL has only employed keyword expansion. In the future they will adopt more biological knowledge to enhance system performance.
(5) Systems biology
- Network analysis of human protein
interactions for Tumorigenesis and infectious diseases using systems biology
Advances in molecular biology, analytical and computational
technologies are enabling us to investigate systematically on complicated
molecular processes through protein interaction networks underlying biological
phenotypes. In this study, we will construct the eukaryotic protein-protein
interaction network from recent high though-put interactome studies for
various species. All the interactions will be converted into domain-domain
interactions and then the conserved network motifs will be extracted
to infer protein interactome related to human diseases. Using this model,
we will build a powerful tool to discover unknown interacting protein
pairs with a probability score. According to the conserved network model
with spatio-temporal information, the interactions between pathogens and
human, and the procession of carcinogenesis will be deciphered. The critical
target proteins in those networks will be unrevealed by the topological
analysis of protein network. The interaction network will provide potential
candidates for developing new therapeutic strategies for human cancer and
infectious diseases. Objectives of this study are to improve our understanding
of the puzzle during the development stage, carcinogenesis and infectious
mechanism, and furthermore to introduce a new paradigm for the diagnosis
and treatment of human disease to revolutionize current medical services
delivered.

|
|