Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
ETD_Data7990Revised7_13_2023V6_SubmittedOhioLink.pdf (10.37 MB)
Digital Accessibility Report
File List
ETD_Data7990Revised7_13_2023V6_SubmittedOhioLink.pdf.accreport.html
(7.97 KB)
ETD Abstract Container
Abstract Header
Cost-Aware Machine Learning and Deep Learning for Extremely Imbalanced Data
Author Info
Ahmed, Jishan
ORCID® Identifier
http://orcid.org/0000-0002-8814-4265
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1688685109278097
Abstract Details
Year and Degree
2023, Doctor of Philosophy (Ph.D.), Bowling Green State University, Data Science.
Abstract
Many real-world datasets, such as those used for failure and anomaly detection, are severely imbalanced, with a relatively small number of failed instances compared to the number of normal instances. This imbalance often results in bias towards the majority class during learning, making mitigation a serious challenge. To address these issues, this dissertation leverages the Backblaze HDD data and makes several contributions to hard drive failure prediction. It begins with an evaluation of the current state of the art techniques, and the identification of any existing shortcomings. Multiple facets of machine learning (ML) and deep learning (DL) approaches to address these challenges are explored. The synthetic minority over-sampling technique (SMOTE) is investigated by evaluating its performance with different distance metrics and nearest neighbor search algorithms, and a novel approach that integrates SMOTE with Gaussian mixture models (GMM), called GMM SMOTE, is proposed to address various issues. Subsequently, a comprehensive analysis of different cost-aware ML techniques applied to disk failure prediction is provided, emphasizing the challenges in current implementations. The research also expands to create explore a variety of cost-aware DL models, from 1D convolutional neural networks (CNN) and long short-term memory (LSTM) models to a hybrid model combining 1D CNN and bidirectional LSTM (BLSTM) approaches to utilize the sequential nature of hard drive sensor data. A modified focal loss function is introduced to address the class imbalance issue prevalent in the hard drive dataset. The performance of DL models is compared to traditional ML algorithms, such as random forest (RF) and logistic regression (LR), demonstrating superior results, suggesting the potential effectiveness of the proposed focal loss function. In addition to these efforts, this dissertation aims to provide a comprehensive understanding of hard drive longevity and the critical factors contributing to their eventual failure through survival analysis. It employs survival analysis to enhance sampling effectiveness, preferentially including observations associated with higher hazards. Techniques like permutation feature importance, Shapley values, and Cox regression are used to identify the key factors influencing drive failure. This work also lays the groundwork for future research on efficient strategies for handling imbalanced data and predictive maintenance in big data framework.
Committee
Robert C. Green II, Ph.D. (Committee Chair)
Liuling Liu, Ph.D. (Other)
Umar D Islambekov, Ph.D. (Committee Member)
Junfeng Shang, Ph.D. (Committee Member)
Pages
160 p.
Subject Headings
Computer Science
;
Statistics
Keywords
Machine learning
;
Cost-sensitive learning
;
Resampling techniques
;
Failure prediction
;
Class imbalance
;
Deep learning
;
Focal loss
;
LSTM
;
BLSTM
;
1D CNN
;
Survival analysis
;
Permutation importance
;
Feature selection
;
SHAP
;
PySpark
;
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Ahmed, J. (2023).
Cost-Aware Machine Learning and Deep Learning for Extremely Imbalanced Data
[Doctoral dissertation, Bowling Green State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1688685109278097
APA Style (7th edition)
Ahmed, Jishan.
Cost-Aware Machine Learning and Deep Learning for Extremely Imbalanced Data.
2023. Bowling Green State University, Doctoral dissertation.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1688685109278097.
MLA Style (8th edition)
Ahmed, Jishan. "Cost-Aware Machine Learning and Deep Learning for Extremely Imbalanced Data." Doctoral dissertation, Bowling Green State University, 2023. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1688685109278097
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
bgsu1688685109278097
Download Count:
295
Copyright Info
© 2023, some rights reserved.
Cost-Aware Machine Learning and Deep Learning for Extremely Imbalanced Data by Jishan Ahmed is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Based on a work at etd.ohiolink.edu.
This open access ETD is published by Bowling Green State University and OhioLINK.