Institute of Information Science Academia Sinica
Topic: TIGP -- On the Use of GWAS Data
Speaker: Dr. Cathy SJ Fann (Institute of Biomedical Sciences, Academia Sinica)
Date: 2012-10-18 (Thu) 14:00 – 15:30
Location: Auditorium 106 at new IIS Building
Host: TIGP Bioinformatics Program


Current genotyping technology has created an abundance of data that brought forth a golden opportunity for genomic data-mining research. Despite the fact that the dissection of the etiology of complex diseases seems possible, there are many challenges. For example: GWAS data only explains a small portion of heritability, genetic structure variants do not get enough consideration, phenotype definition is usually a broad spectrum which causes replication results to be inconsistent etc.

    In the past, we proposed a constrained two way model (CTWM) to search for expression quantitative trait loci (EQTL) using data from two ethnic populations. The study involved genome wide gene expression and SNP data. On the other hand, since most of the odds ratios obtained from GWAS ranged from 1.1 to 1.5, it is suggested that gene-gene interaction is a prevalent phenomenon in the etiology of common diseases. We explored haplotype-based approaches because they might have greater power than single-locus analyses when SNPs are in strong linkage disequilibrium with the risk locus. Two data mining approaches, multifactor dimensionality reduction (MDR) and classification and regression tree (CART) with the concept of haplotypes considering their haplotype uncertainty were evaluated. High-density genotyping arrays can now screen more than five million genetic markers. As a result, multiple comparisons have become an important issue. We recently proposed a two-stage maximal segmental score procedure (MSS), which uses region-specific empirical p-values to identify genomic segments most likely harboring the disease gene. Through simulations, our results indicate that MSS increases power to detect genetic associations compared with conventional methods. Common diseases are likely caused by a complex interplay between many genes and environmental factors. Patients with the same diagnosis may differ greatly in the number and severity of symptoms, suggesting heterogeneity in causal pathways. To circumvent, one of the alternative approaches is to use endophenotypes to study the association. We proposed an analytical procedure to identify endophenotypes which uses non-negative matrix factorization (NMF) to explore the potential molecular dissimilarities of a complex disease based on microarray data; adjusted rand index was also used to select informative transcripts for each molecular subtypes. A simulation study with gene expression data sets to add genotype information was conducted to examine the performance between our proposed method and principal component analysis with k-means clustering (PCA-K). Our results demonstrated that the proposed procedure provides higher power for different scenarios comparing with PCA-K.