TR-IIS-05-024    Fulltext

An Adaptive Prototype Classification Method with Applications to Genetic Marker Selection

Ke-Shiuan Lynn, Chin-Chin Lin, Wen-Harn Pan, and Fu Chang


Motivation: Ethnic origin is a complex trait that can be affected via multiple genetic factors. The traditional method, based on studying one gene or a few genes at a time, is not effective in profiling such a complex nature. Due to the advancement in high throughput genotyping, massive polymorphism (marker) information becomes available. Polymorphisms contain information on individuals’ inherited traits including disease susceptibility, physical appearance, ethnic origins, etc. However, typing multiple genetic markers can still be costly, and constructing an appropriate ethnic classifier may involve heavy computation. To cope ith these problems, we propose a new method that can accomplish two things at a low computational cost: finding a minimum number of genetic markers and constructing an ethnic classifier based on this minimum set of markers.
Results: We present the following three types of results: (1) By testing on artificial datasets with specified degrees of separation, our results suggest that, when population groups have distinguished ethnic origins, the number of prototypes and the test accuracy of the classifier constructed by APL are nearly constant with respect to n, as long as n exceeds a threshold. On the other hand, when the groups are of high admixture, both the number of prototypes and the test accuracy of the constructed classifier 2 become unstable. (2) The proposed adaptive prototype learning (APL) method has much lower training cost and comparable test accuracy to two other methods, STRUCTURE and Support Vector Machines (SVM). (3) In the largest dataset consisting of 661 individuals, we are able to achieve 98.8% accuracy at top-36 markers chosen from 431 STRP markers, and 99.4% accuracy at top-48 markers chosen from the same set 431 markers. This is a rather favorable result in comparison with two former studies that achieve lower accuracy rates at higher number of markers.

Availability: The algorithm presented in this paper has been implemented in C. Source code is freely available for download at: