TR-IIS-05-024 Fulltext
An Adaptive Prototype Classification Method with Applications to Genetic Marker Selection
Ke-Shiuan Lynn,
Chin-Chin Lin, Wen-Harn Pan, and Fu Chang
Abstract
Motivation: Ethnic origin is a complex
trait that can be affected via multiple genetic factors. The traditional method,
based on studying one gene or a few genes at a time, is not effective in profiling
such a complex nature. Due to the advancement in high throughput genotyping,
massive polymorphism (marker) information becomes available. Polymorphisms contain
information on individuals’ inherited traits including disease susceptibility,
physical appearance, ethnic origins, etc. However, typing multiple genetic markers
can still be costly, and constructing an appropriate ethnic classifier may involve
heavy computation. To cope ith these problems, we propose a new method that
can accomplish two things at a low computational cost: finding a minimum number
of genetic markers and constructing an ethnic classifier based on this minimum
set of markers.
Results: We present the following three types of results: (1)
By testing on artificial datasets with specified degrees of separation, our
results suggest that, when population groups have distinguished ethnic origins,
the number of prototypes and the test accuracy of the classifier constructed
by APL are nearly constant with respect to n, as long as n exceeds a threshold.
On the other hand, when the groups are of high admixture, both the number of
prototypes and the test accuracy of the constructed classifier 2 become unstable.
(2) The proposed adaptive prototype learning (APL) method has much lower training
cost and comparable test accuracy to two other methods, STRUCTURE and Support
Vector Machines (SVM). (3) In the largest dataset consisting of 661 individuals,
we are able to achieve 98.8% accuracy at top-36 markers chosen from 431 STRP
markers, and 99.4% accuracy at top-48 markers chosen from the same set 431 markers.
This is a rather favorable result in comparison with two former studies that
achieve lower accuracy rates at higher number of markers.
Availability: The algorithm presented
in this paper has been implemented in C. Source code is freely available for
download at:
http://dar.iis.sinica.edu.tw/Download%20area/apl.htm.
Contact: fchang@iis.sinica.edu.tw