Institute of Information Science
Bioinformatics Laboratory
Principal Investigators:
:::Arthur Chun-Chieh Shih(Chair) :::Jan-Ming Ho :::Wen-Lian Hsu
:::Chung-Yen Lin :::Ting-Yi Sung :::Huai-Kuang Tsai

Postdoctoral Fellow:
Yu-Jung Chang Chia-Ying Cheng Te-Chin Chu
Chan-Hsien Lin Hsin-Nan Lin Ke-Shiuan Lynn
Zing Tsung-Yeh Tsai Yu-Wei Tsay

[ Group Profile ] Our current research is focused on bioinformatics for “omics” studies, classified into two main areas: (i) genomics and transcriptomics, and (ii) proteomics and metabolomics. These areas are described below
1. Genomics and Transcriptomics Studies
With the ascension of next-generation sequencing (NGS) as a predominant technology for genome and transcriptome studies, we have devoted ourselves to developing new methodologies and tools for analyzing NGS data. First, we have proposed computational methods to assemble high-throughput short read sequences; to this end, we have developed a de novo assembler, called JR-Assembler. This tool can assemble a giga-base-pair genome from Illumina short reads, and is effective in memory usage and efficient in CPU time. Second, we have proposed an automated metagenomic data-processing pipeline, called MetaABC, which integrates several binning tools coupled with methods for removing artifacts, analyzing unassigned reads, and controlling sampling biases, to achieve less biased analysis. Third, in order to uncover secrets within the massive collection of omics data from model and non-model organisms, we have integrated several open source software packages with our own tools, enabling NGS and other omics data to be combined and analyzed at our web server, called Multi-Omics Online Analysis System (available at Fourth, we are working on read alignment; NGS reads are getting longer, but most existing short-read aligners were developed and optimized for 100bp reads or shorter. We have developed a new alignment algorithm, called Kart, which can efficiently produce reliable, longer alignments with a low error rate, and can tackle PacBio reads with high accuracy. Fifth, to address the increase in computation power required for biological research, we have collaborated with colleagues in our institute to implement a user-friendly tool for biologists, called CloudDOE (http://clouddoe.; this tool involves a Hadoop cloud which can substantially reduce the complexity and costs of deployment, execution, enhancement, and management of computation resources.

We have been using the aforementioned methods and tools to tackle various biological problems, and the following in particular: (a) gene duplication in C4 plant leaf evolution, (b) reconstruction of regulatory networks of maize leaf development, (c) integration of transcription factors, miRNAs, and epigenetic information to study gene regulation, (d) reconstruction of miRNA-gene regulatory networks in cardiac hypertrophy and B cell differentiation, (e) identification of structural variations in the autism genome, (f ) functional analysis of non-coding RNAs in human, (g) identification of druggable oncogene fusions and the underlying mechanisms, and (h) viral genome recombination and genotyping.
2. Proteomics and Metabolomics
Mass Spectrometry (MS)-based proteomics and metabolomics. MS has become the predominant technology for proteomics research. Protein identification and quantitation are the two main purposes of mass spectral analysis. Previously we focused on developing bioinformatics systems for quantitation analysis, through which we created three tools, i.e., MaXIC-Q, MaXIC-Q, and IDEAL-Q, for various experimental quantitation approaches. Currently, we are working on improving protein identification. First, we have recently proposed novel methods for glycoprotein identification (including the implementation of an automated tool called MAGIC), since glycosylation is considered to be the most important post-translational modification (PTM), and analysis of MS/MS data acquired from glycoproteomic experiments is challenging. Second, we have proposed a method to utilize SWATH-MS data for protein identification. SWATH is a data-independent acquisition method developed in recent years primarily for targeted proteomics analysis, and this method has since attracted considerable attention. Since the high-throughput data generated from SWATH-MS is mainly used for targeted proteomics analysis, we proposed a method, called ProDIA, to generate in silico MS/MS spectra from SWATH-MS datasets; this would enable MS/MS datasets to be searched using database sequence searching tools, e.g., MASCOT, for protein identification. Third, we are developing a method to generate peak list files from raw data, enabling the generation of peaks consistent with raw data and providing charge information for each peak; our glycoproteomics research has shown that the majority of the currently available converters for generating peak list files from spectra raw data suffer some serious limitations, including the failure to provide the charge state or intensity of each peak in a spectrum, and inconsistency between the m/z or intensity of some peaks and the raw data.
排版插圖 Figure 1. The Web framework for Integrated Omic Data to reveal the hidden biological regulations and pathways.

In addition to proteomics, we have also worked on MS-based metabolomics. Since few tools are available for metabolite quantitation, we have developed an automated and highly accurate metabolite quantitation tool. Moreover, we have proposed a computational method for metabolite identification, which involves an effective clustering technique to group a metabolite with its fragments, and then enables searches against different metabolite databases. The proposed method can lead to identification with high sensitivity and accuracy.

Protein structure and subcellular localization predictions. We work on structure prediction specifically for transmembrane (TM) proteins, since (i) membrane proteins are prominent drug targets, and (ii) TM proteins are a major type of such proteins. We have developed methods for predicting TM topology, helix-helix interaction and contacts, and lipid exposure of each TM residue. Furthermore, we have developed a method for predicting signal peptides, as these can be mistakenly predicted as TM helices. Since determination of protein subcellular localization (PSL) sites through wet-lab experiments is labor intensive and time consuming, we have developed a computational approach, called UniLoc, as a universal predictor for proteins, regardless of organism. UniLoc uses natural language processing techniques to define protein synonyms. A protein synonym is a peptide of n amino acids that indicates a possible sequence variation in the evolution of a protein. UniLoc is built on a proteome-scale database and includes localization sites in prokaryotic and eukaryotic organisms. It can efficiently distinguish between single- and multi-localized proteins and predict localizations with high precision and recall, outperforming most existing predictors. Furthermore, UniLoc can also be used to interpret a prediction with identified template sequences in the database.

Disease-centric human proteome database. The ultimate goal of our MS-based proteomics research is biomarker discovery. To achieve this goal, we have constructed a human proteome database, which contains comprehensive information on the human membrane proteome. Using this database, we have joined other researchers in Taiwan to work on human chromosome 4 in the Chromosome-centric Human Proteome Project (c-HPP), an international project orchestrated by the Human Proteome Organization. In the current stage, the main goal of c-HPP is to detect missing proteins. Using our bioinformatics expertise, we have determined a list of missing proteins in chromosome 4 for experimental detection by our collaborators.

Collaborators: Since bioinformatics is an interdisciplinary research area, we have been collaborating with principal investigators from the Biodiversity Research Center (BRC), the Genomics Research Center (GRC), the Institute of Biomedical Science (IBMS), the Institute of Chemistry (IC), the Institute of Plant Science and Microbiology (IPSM), and the Institute of Cellular and Organismic Biology (ICOB) at Academia Sinica; the National Health Research Institute; the School of Life Sciences at Yang Ming University; and the College of Bioresources and Agriculture and the College of Life Science at National Taiwan University; furthermore, we have also been working with physicians from National Taiwan University Hospital. In addition, we have ongoing collaborative research projects with investigators in the Department of Plant Biology and Medical School, Michigan State University, David Geffen School of Medicine at UCLA, and Microsoft Inc. 排版插圖 Figure 2. Omics Database for Model and non-Model Organisms


Academia Sinica Institue of Information Science Academia Sinica