Previous [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19] [ 20] [ 21] [ 22] [ 23] [ 24] [ 25]

@

Journal of Information Science and Engineering, Vol. 26 No. 6, pp. 1941-1956 (November 2010)

Using Random Forest for Protein Fold Prediction Problem: An Empirical Study

ABDOLLAH DEHZANGI, SOMNUK PHON-AMNUAISUK AND OMID DEHZANGI*
Center of Artificial Intelligence and Intelligent Computing
Faculty of Information Technology
Multimedia University
Cyberjaya, Selangor, 63100 Malaysia
E-mail: {abdollah.dehzangi07; somnuk.amnuaisuk}@mmu.edu.my
*School of Computer Engineering
Nanyang Technological University
Nanyang Avenue, 639798 Singapore
E-mail: omid0002@ntu.edu.sg

The functioning of a protein in biological reactions crucially depends on its threedimensional structure. Prediction of the three-dimensional structure of a protein (tertiary structure) from its amino acid sequence (primary structure) is considered as a challenging task for bioinformatics and molecular biology. Recently, due to tremendous advances in the pattern recognition field, there has been a growing interest in applying classification approaches to tackle the protein fold prediction problem. In this paper, Random Forest, as a kind of ensemble method, is employed to address this problem. The Random Forest, is a recently introduced method based on bagging algorithm that trains a group of base classifiers by randomly selecting sets of features and then, combining results obtained from base classifiers by majority voting. To investigate the effectiveness of the number of base learners to the performance of the Random Forest, twelve different numbers of base classifiers (between 30 and 600) are applied for this classifier. To study the performance of the Random Forest and compare its results with previously reported results, the dataset produced by Ding and Dubchak is used. Our experimental results show that the Random Forest enhances the prediction accuracy (using same set of features proposed by Dubchak et al.) as well as reduces time consumption of the protein fold prediction task, compared to the previous works found in the literature.

Keywords: protein fold prediction problem, classifier ensemble, random forest, bootstrap sampling, weak learner, feature selection, random sampling, bagging, prediction performance

Full Text () Retrieve PDF document (201011_01.pdf)

Received November 16, 2009; revised February 4, 2010; accepted May 6, 2010.
Communicated by Jorng-Tzong Horng.