Open Access Review Article

Scalable Unsupervised Feature Selection for Quantitative Biological Data Using Mixture Models

Marcela Cespedes1*, Amy Chan2, James Doecke1 and for the Alzheimer’s Disease Neuroimaging Initiative3

1*CSIRO Health & Biosecurity/ Australian e-Health Research Centre, Herston, Queensland, Australia

2Polymathian, Brisbane, Queensland, Australia

3Data used in preparation of this article were obtained from the Alzheimers Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/ how_to_apply/ADNI_Acknowledgement_List.pdf

Corresponding Author

Received Date:December 19, 2023;  Published Date:January 03, 2024

Abstract

Supervised feature selection methodologies for quantitative biological data traditionally select only the top few biomarkers, forcing the comparison into two or more groups, and disposing of many interesting correlated features that may provide more information on the disease process. Here, we present an unsupervised feature selection and prediction algorithm (FSPmix), which investigates the univariate mixture distributions of quantitative data in order to identify potential disease group classification and rank selected features by order of importance. In-built into the FSPmix algorithm is a parallelized work flow enabling analyzes of small to large scale data. Validated on 20 simulated features (sample size N= 200) and accounting for underlying confounding covariates, the performance of our algorithm selected similar features by order of importance as other supervised feature selection alternatives; Random Forests, LASSO and generalized boosted regression models. Using this method on our motivating data set (72 human brain regions of interest, PET MR from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study, N=850), we found 46 regions that supported two hidden groups and selected features similar to supervised alternatives. Furthermore, the FSPmix predictions had similar predictive accuracy to unsupervised k-means clustering. This novel algorithm was able to detect underlying groups in both simulated and real data scenarios. FSPmix showed comparable predictive capability with unsupervised clustering alternative as well as comparable feature selection performance with three supervised classification algorithms, making it an ideal and scalable exploratory tool for binary response data.

Keywords:Classification & prediction algorithm; bootstrap; feature selection; parallelised computing; importance feature ranking

Citation
Signup for Newsletter
Scroll to Top