Robust and Efficient Feature Selection for High-Dimensional Datasets

Mo, Dengyao

Keyword Search

School Logo

ucin1299010108.pdf (2.68 MB)

Robust and Efficient Feature Selection for High-Dimensional Datasets

Author Info

Mo, Dengyao

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=ucin1299010108

Year and Degree

2011, PhD, University of Cincinnati, Engineering and Applied Science: Mechanical Engineering.

Abstract

Feature selection is an active research topic in the community of machine learning and knowledge discovery in databases (KDD). It contributes to making the data mining model more comprehensible to domain experts, improving the prediction performance and robustness of the model, and reducing model training. This dissertation aims to provide solutions to three issues that are overlooked by many current feature selection researchers. These issues are feature interaction, data imbalance, and multiple subsets of features. Most of extant filter feature selection methods are pair-wise comparison methods which test each pair of variables, i.e., one predictor variable and the response variable, and provide a correlation measure for each feature associated with the response variable. Such methods cannot take into account feature interactions. Data imbalance is another issue in feature selection. Without considering data imbalance, the features selected will be biased towards the majority class. In high dimensional datasets with sparse data samples, there will be many different feature sets that are highly correlated with the output. Domain experts usually expect us to identify multiple feature sets for them so that they can evaluate them based on their domain knowledge. This dissertation aims to solve these three issues based on a criterion called minimum expected cost of misclassification (MECM). MECM is a model independent evaluation measure. It evaluates the classification power of the tested feature subset as a whole. MECM has adjustable weights to deal with imbalanced datasets. A number of case studies showed that MECM had some favorable properties for searching a compact subset of interacting features. In addition, an algorithm and corresponding data structure were developed to produce multiple feature subsets. The success of this research will have broad applications ranging from engineering, business, to bioinformatics, such as credit card fraud detection, email filter setting for spam classification, gene selection for disease diagnosis.

Committee

Hongdao Huang, PhD (Committee Chair)
Sundararaman Anand, PhD (Committee Member)
Jaroslaw Meller, PhD (Committee Member)
David Thompson, PhD (Committee Member)
Michael Wagner, PhD (Committee Member)

Pages

129 p.

Subject Headings

Information Systems

Keywords

Feature Selection; Data Mining; Machine Learning; Statistical Modeling; Knowledge Discovery in Database

Mo, D. (2011). Robust and Efficient Feature Selection for High-Dimensional Datasets [Doctoral dissertation, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1299010108
APA Style (7th edition)
Mo, Dengyao. Robust and Efficient Feature Selection for High-Dimensional Datasets. 2011. University of Cincinnati, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1299010108.
MLA Style (8th edition)
Mo, Dengyao. "Robust and Efficient Feature Selection for High-Dimensional Datasets." Doctoral dissertation, University of Cincinnati, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1299010108
Chicago Manual of Style (17th edition)

Document number:

ucin1299010108

Download Count:

783

Copyright Info

Robust and Efficient Feature Selection for High-Dimensional Datasets by Dengyao Mo is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Based on a work at etd.ohiolink.edu.
This open access ETD is published by University of Cincinnati and OhioLINK.

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Robust and Efficient Feature Selection for High-Dimensional Datasets

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Robust and Efficient Feature Selection for High-Dimensional Datasets

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations