Institute of Information Science, Academia Sinica

Events

Print

Press Ctrl+P to print from browser

Seminar

:::

Computational Method for Improved Detection of Genomic Indels from Next-Gen Genome Sequencing Data

  • LecturerProf. Paul Horton (National Institute for Advanced Industrial Science & Technology (AIST), Japan)
    Host: Chung-Yen Lin
  • Time2016-01-21 (Thu.) 10:30 ~ 12:30
  • LocationAuditorium 106 at IIS new Building
Abstract

BACKGROUND:  Next-Gen Sequencing has made whole exome, and even whole genome sequencing affordable enough to become commonplace in leading cancer centers around the world.  The goal is to compile an accurate list of the genetic differences between the sequenced sample (normal or tumor) and the reference genome.  Since it is not trivial to compile the many millions of sequencing reads into such a list, many computational methods (e.g. GATK) have been developed for this task.  

These tools have proved helpful and do quite well in detecting single nucleotide differences (SNVs), but paradoxically have much more difficulting in detecting medium (say 30bp) size insertions and deletions (indels).  We hypothesize that the difficulty in accurately detecting indels is not cause by the quantity or quality of the data, but is rather an artifact of computational methods which are overly reliant on inferred read alignment, because they work by first fixing the read alignment (sometimes called "the stack") and then performing statistical and heuristic tests given the alignment.  Unfortunately indels often make it difficult or impossible to compute an unambigously correct alignment and therefore lead to false calls.

METHOD:  We propose a probabilistic model to check the plausibility of candidate called indels.  The model is ambitious in that it involved computing the probability of the entire set of reads under two competing hypothesis: the indel call is correct or incorrect.  

Fortunately, we show that by using suffix arrays in an unusual way (indexing on reads instead of the genome), we can in principle efficiently compute good approximations to the desired probabilities.

RESULTS: My talk is a work in progress and we have only partially implemented our algorithm.  Nevertheless, we have been able to show that our prototype can significantly reduce false positives when used as a way of ranking indel candidates generated by GATK.  I will show these preliminary results and discuss future directions.