Doctor of Philosophy (Ph.D.), Bowling Green State University, 2024, Data Science
The use of statistical and machine learning methods in collection, evaluation and
presentation of biological data is very extensive. This reflects a need for precise quantitative
assessment of different types of challenges encountered in the field of healthcare. But the sparse
nature of medical data makes it hard to find the hidden patterns and as a result makes the
prediction a complex task.
This dissertation research discusses several biostatistical methods including sample size
determination in a balanced clinical trial, finding cohort risk from case control information, odds
ratio, Cochran-Mantel-Haenszel odds ratio etc. along with examples and analysis of a real life
dataset to further solidify the concepts.
Moreover, different classification models: Random Forest, Gradient Boosting, Support
vector Machine (SVM), Naive Bayes, K-Nearest Neighbors (KNN), Decision Tree (DT), Logistic
Regression, Artificial Neural Network (ANN) are applied in the analysis of Wisconsin Breast
Cancer (diagnostic and original) dataset and their performance comparison is presented. Later,
these classification models are also used in conjunction with ensemble learning methods; since
ensemble methods significantly improves the predictive outcomes of the classification models.
The evaluation of the classification models is measured using accuracy, AUC score,
precision and recall metrics. In tree-based classification models, Random Forest (solely and in
conjunction with the ensemble learning) gives the highest accuracy; whereas in the later chapter
Artificial Neural Network gives the highest accuracy measure.
Committee: John Chen Ph.D. (Committee Chair); Mohammadali Zolfagharian Ph.D. (Other); Umar Islambekov Ph.D. (Committee Member); Qing Tian Ph.D. (Committee Member)
Subjects: Biostatistics; Statistics