Institute of Information Science
 
research
  :::Print :::Chinese :::Site Map :::Home
 
Bio-informatics
Principal Investigators:

[Wen-Lian Hsu] [Der-Tsai Lee] [Arthur Chun-Chieh Shih]
[Ming-Tat Ko] [Chun-Nan Hsu] [Jan-Ming Ho]
[Ting-Yi Sung] [Chung-Yen Lin] [Huai-Kuang Tsai]


[ Bio-informatics ]
Our research can be divided into the following areas: 1. Genomics; 2. Proteomics; 3. Structural bioinformatics; 4. Biological literature mining; and 5. Systems biology. Since 2003, at IIS we have started recruiting Ph.D. students in Bioinformatics of the Taiwan International Graduate Program (TIGP). There are 15 students currently. The number of applicants this year is more than the total number of applicants for the past three years.

(1) Genomics
  • EST annotation pipeline system
    Large scale DNA sequencing has become a worldwide-used methodology. EST sequencing is a feasible way for studying functional genomics under limited resources, so there are a lot of EST sequencing projects all over the world. However, the essential step of an EST sequencing project is the bioinformatics analysis. A good annotation pipeline can provide biologists the most complete information about their EST data and help them with further studies. In the last few years, we have in collaboration with biologists in the Institute of Biomedical Sciences (IBMS) of Academia Sinica, provided the EST annotation service systems, Bio101 and Bio301, for some functional genomics projects and also developed a fast gene ontology annotator for the EST annotation, called ESTFastAnnotator.
  • An efficient synteny mapping and annotation system for vertebrate genomes
    Synteny mapping, or detecting regions that are orthologous between genomes, is a key step in studies of comparative genomics. Synteny mapping of all model vertebrate genomes will make the global view of genome evolution clear, enhance the functional decoding of genomes in a comprehensive way beyond genes so far, and serve as a foundation of comparative vertebrate genomics. Based on our successful experience on human-mouse synteny mapping by UniMarker- synteny method, we plan to develop an efficient synteny mapping and annotation system for vertebrate genomes.
  • Large-scale and large-number sequence analysis and its applications to phyloinformatics
    In the post-genomic era, the number of sequences available for reconstructing the evolutionary history of genes and species has increased 20-fold in the past decade. But most tools available to date are limited to processing only small data sets. Especially, the cost for constructing the tree of a large group of organisms is very high when using an optimal character-based method such as maximum likelihood. Thus, to develop new computational tools for building very large phylogenetic trees and for comparing and visualizing huge data sets is urgent. In this project, we focus on the study and processing of large-number sequence alignments and their phylogenies. We plan to build a problem-solving environment, develop effi cient algorithms for constructing large-scale phylogenetic trees, such as supertrees, and develop effective visualization tools for visualizing supertrees. We have developed a versatile alignment visualization system, SinicView (Sequence-aligning INnovative and Interactive Comparison VIEWer), which enables users to efficiently compare and evaluate assorted alignment results obtained by different tools. This work appeared in BMC Bioinformatics 2006. More information can be found at the Computational Genomics Lab. website at http:// biocomp.iis.sinica.edu.tw/ in the Genomics Research Center.
  • Influenza virus study and vaccine strain selection
    In recent years, the study of molecular phylogenies of influenza viruses has become increasingly important for epidemiologists and evolutionary biologists. The yearly flu epidemics often occur as a result of the lack of immunity against these new strains. The World Health Organization (WHO) annually recommends three infl uenza strains currently in worldwide circulation for vaccine manufacturers to massively produce the vaccines. However, each year Taiwan needs four million fl u vaccines, imported from USA or other countries. To avoid a lack of drugs and vaccines in case of a global fl u outbreak, the Department of Health, Executive Yuan, has planned to establish a domestic GMP plant to produce local flu vaccines. Beginning in 2006 in collaboration with IBMS we participate in a three-year CDC (Center for Disease Control, Taiwan) project for development of influenza vaccine. We plan to study the evolutionary mechanisms of influenza viruses from sequence data as well as to establish a computational model that can predict potentially emerging strains in coming epidemic seasons. One of our goals is to be able to provide a quantitative model for vaccine strain selection.
  • Agent-based information integration for bioinformatics
    More than 600 Web-based bioinformatics resources, including databases, links, and analysis tools, are available for biological knowledge discovery. But to integrate them requires tremendous efforts. We have been working on an intelligent agent-based approach to information integration and have built a tool for rapidly training intelligent agents. This tool provides bioinformatics researchers a scalable, fl exible and extensible solution to query these resources with a single uniform interface. With this tool, users can train an agent to learn a specialized session of Web browsing and extraction without programming. Instead, they can train intelligent agents by click-and-drag in a programming-by-example manner. As a result, a large number of agents can be created to integrate a large number of Web sites in a matter of days to answer biologists' queries. We have successfully built and deployed many agent-based information integration systems for biologists. One of the agent-based systems is to support an SNP project of National Genotyping Center (NGC) at IBMS. Fast- SNP allows the researchers at NGC to efficiently prioritize risk-causing SNPs to conduct genotyping and enables them to successfully identify a novel promoter polymorphism that contributes to Adverse Drug Responses (ADRs). More identification will be announced in the near future. One of the agentbased systems won the best paper award in a local bioinformatics conference in 2003. An article about FastSNP will appear in the 2006 Web Server Special Issue of Nucleid Acid Research.
  • Development of Gene Ontology browsing/maninpulating utility
    Due to the increasing requirements for functionally annotating gene products with Gene Ontology (GO), more and more pipelines are producing GO-related biological data. In order to facilitate biologists' browsing and manipulation of their data, we designed a Java-based software called GOBU. The underlying architecture of GOBU enables arbitrary data description, data integration and extendable user interface, and thus is feasible for browsing annotations for various pipelines. We have applied this software to our EST annotation pipeline and microarray experiments.
(2) Proteomics
  • Quantitative proteomics based on high-throughput mass spectrometry data
    Most current quantitative experiments aim to determine relative protein expression levels in different cell states or cells grown under different conditions. Various stable isotope labeling techniques, e.g., ICAT and iTRAQ, followed by liquid chromatography- tandem mass spectrometry (LC-MS/MS) are frequently used to quantify protein expressions. To expedite the analysis of vast amounts of spectral data generated, we have developed a fully automated tool for multiplexed quantitation using iTRAQ labeling, called Multi-Q. This tool is designed as a generic platform that can accommodate various input data formats from different mass spectrometers and search engines. This work is in collaboration with the Institute of Chemistry. According to the chemists, our software is the most advanced tool available in the world. In addition, we are developing a tool for ICAT labeling quantitation. In the future, we will adapt our tools to two-plex or multiplexed quantitative analysis using other isotopic labeling strategies. Moreover, tools for visualization to assist in the biological interpretation of the data will also be developed.
(3) Structural bioinformatics
  • Protein structure prediction
    We have developed a hybrid knowledge-based protein secondary structure prediction algorithm, called HYPROSP II, which combines an existing machine learning approach, PSIPRED, and a new peptide knowledge based approach for prediction. The average prediction accuracy of HYPROSP is around 82%, which is better than both of PSIPRED and the knowledge based approach. As more protein structures are determined, the knowledge base is expected to grow and the prediction accuracy is also expected to increase. Related papers appeared in Nucleic Acids Research 2004 and Bioinformatics 2005. We have also adopted more biological domain knowledge and machine learning techniques to predict related structure problems, such as local structure, b-turn, transmembrane helix prediction, etc. Once protein secondary structures can be predicted with improved accuracy, we then target to predict tertiary structures with emphasis on the protein fold recognition problem.
  • Protein 3D structure prediction by fragment assembly
    We propose to predict the protein backbone conformation based solely on the sequence information. The objective will be achieved using our previously established fragment library and by further developing new algorithms to dissect the sequencestructure relationship. We have successfully demonstrated that our fragment library provides a good basis set of building blocks for reconstructing and predicting whole protein structures. However, the exact nature of the relationship between a protein!| s sequence and its structure remains one of the open challenges in computational biology. To discover the relationship of protein!|s sequence and its structure is quite important and worth our effort.
  • NMR backbone resonance assignment and NOE experiment
    NMR spectroscopy is one of the popular experiments to determine protein structure. An important stage of protein structure determination by using NMR is protein backbone resonance assignment. This is a tedious and time-consuming manual work. We have developed an iterative relaxation technique for automatic backbone assignment that can tolerate a huge amount of noise in the data. Our paper was accepted in RECOMB 2005 and was invited to be published in Journal of Computational Biology. It is the very first paper from Taiwan accepted by RECOMB since its inception nine years ago. A related result based on genetic algorithm has appeared in Nucleic Acids Research 2005. To extract geometric constraints for the structure calculations from the NMR spectra, we need to consider NOEs and coupling constants that are transformed into distance and dihedral angle constraints. We shall develop an efficient algorithm for NOE data analysis and use this data analysis result to improve backbone assignment. This research is in collaboration with IBMS.
(4) Biomedical literature mining
  • Biological term and relation extraction
    The Intelligent Agent System Lab has developed a system for biological named entities recognition from biomedical literature. We use Maximum Entropy (ME) model and Conditional Random Fields (CRF) as the underlying machine learning methods, and incorporate dictionary-based and rulebased methods as post-processing of ME to enhance the performance. Once named entities can be recognized, we then aim to recognize relations between named entities. We collaborate with biologists to work on the problems of recognizing protein-protein interaction relations and gene-disease relations. Related paper has appeared in BMC Bioinformatics 2006.
  • Genomic information retrieval
    The Intelligent Agent Systems Lab (IASL) participated in the TREC 2005 Genomics Track Ad-hoc Retrieval Contest and won the 6th place out of 32 teams. The Genomic information retrieval contest combines natural language queries and table search. Due to the variations of biological terms and the large amount of unknown medical words, the retrieval task is particularly diffi cult. The lab has accumulated many years of experiences in developing information extraction, retrieval, natural language processing and question answering systems, and obtained an accuracy of 24.53% (The best team has 28%). The performance is very close to the top five teams: York Univeristy, IBM, University of Waterloo, UCCI and national Library of Medicine (NLM). In the fi rst year's work, IASL has only employed keyword expansion. In the future they will adopt more biological knowledge to enhance system performance.
(5) Systems biology
  • Network analysis of human protein interactions for Tumorigenesis and infectious diseases using systems biology
    Advances in molecular biology, analytical and computational technologies are enabling us to investigate systematically on complicated molecular processes through protein interaction networks underlying biological phenotypes. In this study, we will construct the eukaryotic protein-protein interaction network from recent high though-put interactome studies for various species. All the interactions will be converted into domain-domain interactions and then the conserved network motifs will be extracted to infer protein interactome related to human diseases. Using this model, we will build a powerful tool to discover unknown interacting protein pairs with a probability score. According to the conserved network model with spatio-temporal information, the interactions between pathogens and human, and the procession of carcinogenesis will be deciphered. The critical target proteins in those networks will be unrevealed by the topological analysis of protein network. The interaction network will provide potential candidates for developing new therapeutic strategies for human cancer and infectious diseases. Objectives of this study are to improve our understanding of the puzzle during the development stage, carcinogenesis and infectious mechanism, and furthermore to introduce a new paradigm for the diagnosis and treatment of human disease to revolutionize current medical services delivered.

TOP
 
 
 
 
space
Academia Sinica Institue of Information Science