CERP: Classification by Ensembles from Random Partitions

A robust classification procedure is developed based on ensembles of classifiers, with each classifier constructed from a different set of predictors determined by a random partition of the entire set of predictors. CERP combines the results of multiple classifiers to achieve a substantially improved prediction compared to the optimal single classifier. This approach is designed specifically for high-dimensional data sets for which a classifier is sought. By combining classifiers built from each subspace of the predictors, CERP achieves a huge computational advantage in tackling the growing problem of dimensionality. Huge data sets need not be handled as a whole; the subspaces of the feature space created through partitioning may be treated independently and separately until after the classifiers are developed. For each subspace of the predictors, we build a classification tree or logistic regression tree with an optimal size defined by its yielding least cost in terms of misclassification errors. By a random partition, a logistic regression can be used without losing the ensemble accuracy for data with a huge number of available predictor variables and a relatively small number of observations by a random partition without a variable selection. Our study shows that the performance of our methods in terms of overall accuracy is consistently good compared to the other classification methods. For unbalanced data, our approach maintains the balance between sensitivity and specificity more adequately than many other classification methods. A primary area of application is the classification of subjects into cancer-risk or cancer-type categories based on high dimensional biomedical data. It is anticipated that the proposed methods can be used to improve class prediction in many other areas of application involving high dimensional prediction sets.