Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Selecting Small, Diverse Training Sets for Effective QSAR Models

Wiegand, Emily Marie

Abstract Details

2024, Master of Science, Ohio State University, Chemical Engineering.
A quantitative structure-activity relationship (QSAR) model relates the structure and/or physical properties of molecules to their biological activity. A good QSAR model can predict the environmental fate of molecules and assist in quickly screening chemical compounds for potential toxicity at a low cost, reducing or eliminating the need for experimental studies. The dataset used to train a QSAR model impacts performance. Often, training sets are small due to limited experimental data, causing concern over the reliability of a model’s predictions. In other cases, very large training sets may increase computation time and costs. In response to these issues, this work proposes that a smaller, diverse training set of molecules can be used to build a model that performs just as well as one built on a larger training set. A structurally diverse training set of compounds is relatively uniformly spread throughout a given chemical space, increasing the chance that a new molecule is reliably predicted by a model. The dataset used to test this proposal consists of 1603 organic molecules with a molecular endpoint of ready biodegradability. ToxPrints were used as features to represent the compounds. The MaxMin algorithm, paired with either the complement of the Tanimoto coefficient or the Modified Tanimoto Coefficient, was used to select a smaller, diverse set of molecules from a randomly determined whole training set. The performance of models built on these smaller, diverse sets was compared to the performance of models built on the whole training set and on smaller, randomly selected training sets. Various sizes of the diverse and random training sets were examined. Diverse training sets that were at least 60% the size of a whole training set led to similar model performance as the whole training sets. Randomly selected training sets consistently resulted in lower model performance across all sizes. The Tanimoto and Modified Tanimoto coefficients created similarly diverse sets with a similar proportion of features present per compound. In comparison to the random training sets, both diverse sets spanned a broader chemical space and selected a higher number of identical features in common with the whole training set during the feature selection process. Overall, this work demonstrates that diversity selection algorithms can rationally select smaller training sets that yield QSAR model performance comparable to a larger training set.
James Rathman (Advisor)
Isamu Kusaka (Committee Member)
66 p.

Recommended Citations

Citations

  • Wiegand, E. M. (2024). Selecting Small, Diverse Training Sets for Effective QSAR Models [Master's thesis, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1713345925307914

    APA Style (7th edition)

  • Wiegand, Emily. Selecting Small, Diverse Training Sets for Effective QSAR Models. 2024. Ohio State University, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1713345925307914.

    MLA Style (8th edition)

  • Wiegand, Emily. "Selecting Small, Diverse Training Sets for Effective QSAR Models." Master's thesis, Ohio State University, 2024. http://rave.ohiolink.edu/etdc/view?acc_num=osu1713345925307914

    Chicago Manual of Style (17th edition)