Date of Award
2026
Degree Name
Mathematics
College
College of Science
Type of Degree
M.S.
Document Type
Thesis
First Advisor
Dr. Raid Al-Aqtash
Second Advisor
Dr. Alfred Akinsete
Third Advisor
Dr. Laura Adkins
Abstract
Accurate prediction of disease outcomes is crucial for improving clinical decision-making and enabling early intervention. This study compares the performance of various statistical and machine learning models for clinical risk prediction using two healthcare datasets: diabetic retinopathy and heart disease. The models assessed include Logistic Regression, LASSO, k-Nearest Neighbors (KNN), Support Vector Machines (SVM), Neural Networks, Random Forests, Gradient Boosting Machines (GBM), and a stacked ensemble model. Prior to modeling, datasets were split into train and test sets. Standardization was applied to numeric features whilst categorical features were one-hot encoded. These transformations were later applied to the test set. Principal Component Analysis (PCA) was utilized to tackle multicollinearity among microaneurysm and exudate features before model training on the diabetic retinopathy dataset. 10- fold cross-validation was performed on the training set with the ROC-AUC metric used as a yardstick for determining optimal hyperparameters for model training. Model performance metrics such as accuracy, sensitivity, specificity, precision, F1-score, and the receiver operating characteristic–area under the curve (ROC-AUC) were used for evaluation. The results indicate that ensemble methods performed best on the diabetic retinopathy dataset. Gradient Boosting achieved the highest ROC-AUC (0.836), while Random Forest had the highest accuracy (77.0%). The stacking model showed the highest sensitivity (0.771), indicating improved detection of positive cases of diabetic retinopathy. Conversely, simpler models performed better on the heart disease dataset. Logistic Regression had the highest overall accuracy (79.6%) with balanced sensitivity (70.8%) and specificity (86.7%), while LASSO exhibited the strongest discrimination with a ROC-AUC of 0.85. Ensemble models such as Gradient boosting and Random forests remained competitive with AUC values of 0.835 and 0.822, respectively. In conclusion, the study suggests that predictive performance is influenced by dataset characteristics. Ensemble models proved advantageous for the higher-dimensional diabetic retinopathy data, while simpler linear and regularized models were more effective for the structured clinical variables in the heart disease dataset. These findings underscore the importance of selecting appropriate prediction models based on the statistical properties of the data when developing clinical risk prediction tools.
Subject(s)
Mathematics.
Statistics.
Medical sciences.
Medical sciences -- Statistics.
Probabilities.
Diabetic retinopathy.
Heart -- Diseases.
Machine learning.
Statistics -- Models.
Recommended Citation
Agbley, Mercy Mawusi, "Comparative machine learning models for disease risk prediction" (2026). Theses, Dissertations and Capstones. 2061.
https://mds.marshall.edu/etd/2061
Included in
Applied Mathematics Commons, Applied Statistics Commons, Mathematics Commons, Medicine and Health Sciences Commons, Statistical Methodology Commons, Statistical Models Commons, Statistical Theory Commons, Vital and Health Statistics Commons
