Classification of genes related to infectious disease: new hybrid method for high imbalanced data sets

Sima Soltani,1,* Javad sadri,2 Mehrdad jalali,3

1. Department of Computer Engineering, Islamic Azad University, Mashhad branch
2. Department of Computer Science & Software Engineering Faculty of Engineering and Computer Science Concordia University
3. Department of Computer Engineering, Islamic Azad University, Mashhad branch

Abstract


Introduction

Gene functionality explorations has a great importance in health science research. developing gene classifiers having accurate prediction is crucial and desirable research.

Methods

in this paper, we introduce hybrid model for classification of 24 genes related to infectious disease from many unrelated genes, a “high imbalanced dataset”, in which the number of instances of one class is much lower than the other class. problems arise when the dataset is imbalanced, misclassification of minority class sample occurs due to an incorrect learning of the real boundaries samples, therefore our model apply clustering for under sampling of negative genes and a smot oversampling method for increasing positive gene samples. we select a decision tree model for classification, and use ensemble of some classifiers for gene classification using a majority voting technique.

Results

We success to build classifier which classified huge and high imbalanced data set with 81.12% accuracy, 79% sensitivity and 89% specificity. our model could perform on similar data sets.

Conclusion

According to our simulation study it is observed that the proposed approach improves classification performance compared to other similar approaches in the literature.furthermore, it is obvious that the smot method is suitable for reducing error rate.

Keywords

Gene classification, imbalanced data set, cluster based undersampling, smot, ensemble, decision tree