Date of Award

2026

Degree Name

Mathematics

College

College of Science

Type of Degree

M.S.

Document Type

Thesis

First Advisor

Dr. Raid Al-Aqtash

Second Advisor

Dr. Alfred Akinsete

Third Advisor

Dr. Laura Adkins

Abstract

Accurate prediction of disease outcomes is crucial for improving clinical decision-making and enabling early intervention. This study compares the performance of various statistical and machine learning models for clinical risk prediction using two healthcare datasets: diabetic retinopathy and heart disease. The models assessed include Logistic Regression, LASSO, k-Nearest Neighbors (KNN), Support Vector Machines (SVM), Neural Networks, Random Forests, Gradient Boosting Machines (GBM), and a stacked ensemble model. Prior to modeling, datasets were split into train and test sets. Standardization was applied to numeric features whilst categorical features were one-hot encoded. These transformations were later applied to the test set. Principal Component Analysis (PCA) was utilized to tackle multicollinearity among microaneurysm and exudate features before model training on the diabetic retinopathy dataset. 10- fold cross-validation was performed on the training set with the ROC-AUC metric used as a yardstick for determining optimal hyperparameters for model training. Model performance metrics such as accuracy, sensitivity, specificity, precision, F1-score, and the receiver operating characteristic–area under the curve (ROC-AUC) were used for evaluation. The results indicate that ensemble methods performed best on the diabetic retinopathy dataset. Gradient Boosting achieved the highest ROC-AUC (0.836), while Random Forest had the highest accuracy (77.0%). The stacking model showed the highest sensitivity (0.771), indicating improved detection of positive cases of diabetic retinopathy. Conversely, simpler models performed better on the heart disease dataset. Logistic Regression had the highest overall accuracy (79.6%) with balanced sensitivity (70.8%) and specificity (86.7%), while LASSO exhibited the strongest discrimination with a ROC-AUC of 0.85. Ensemble models such as Gradient boosting and Random forests remained competitive with AUC values of 0.835 and 0.822, respectively. In conclusion, the study suggests that predictive performance is influenced by dataset characteristics. Ensemble models proved advantageous for the higher-dimensional diabetic retinopathy data, while simpler linear and regularized models were more effective for the structured clinical variables in the heart disease dataset. These findings underscore the importance of selecting appropriate prediction models based on the statistical properties of the data when developing clinical risk prediction tools.

Subject(s)

Mathematics.

Statistics.

Medical sciences.

Medical sciences -- Statistics.

Probabilities.

Diabetic retinopathy.

Heart -- Diseases.

Machine learning.

Statistics -- Models.

Share

COinS