Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 16)

Mini-Tools

 
 

Search Report

  • 1. Cunningham, James Efficient, Parameter-Free Online Clustering

    Master of Science, The Ohio State University, 2020, Computer Science and Engineering

    As the number of data sources and dynamic datasets grows so does the need for efficient online algorithms to process them. Clustering is a very important data exploration tool used in various fields from data science to machine learning. Clustering finds structure unsupervised among indecipherable collections of data. Online clustering is the process of grouping samples in a data stream into meaningful collections as they appear over time. As more data is collected in these streams online clustering becomes more important than ever. Unfortunately, the number of available efficient online clustering algorithms is limited due to the difficulty of their design that often requires significant trade-offs in clustering performance for efficiency. Those that do exist require expert domain knowledge of the data space to set hyperparameters for the algorithm such as the desired number of clusters. This domain knowledge is often unavailable, so resources must be spent tuning hyperparameters to get acceptable performance. In this thesis we present an online modification to FINCH, the recent state-of-the-art parameter-free clustering algorithm by Sarfraz et al. called Stream FINCH (S-FINCH). We examine the stages of the FINCH algorithm and leverage key insights to produce an algorithm which reduces the online update complexity of FINCH. We then compare the performance of S-FINCH and FINCH over several toy and real-world datasets. We show theoretically and empirically that our S-FINCH algorithm is more efficient than the FINCH algorithm in the online domain and has reasonable real-time update performance. We also present several alternative cluster representatives which can be used to build different agglomerative cluster hierarchies using the S-FINCH algorithm. We compare the cluster quality and clustering time performance of these new representatives with the original FINCH algorithm. The S-FINCH algorithm presented in this thesis allows for fast and efficient online (open full item for complete abstract)

    Committee: James Davis PhD. (Advisor); Juan Vasquez PhD. (Committee Member); Kyle Tarplee PhD. (Committee Member); Wei-Lun Chao PhD. (Committee Member) Subjects: Computer Science; Information Science; Information Technology
  • 2. Eldridge, Justin Clustering Consistently

    Doctor of Philosophy, The Ohio State University, 2017, Computer Science and Engineering

    Clustering is the task of organizing data into natural groups, or clusters. A central goal in developing a theory of clustering is the derivation of correctness guarantees which ensure that clustering methods produce the right results. In this dissertation, we analyze the setting in which the data are sampled from some underlying probability distribution. In this case, an algorithm is "correct" (or consistent) if, given larger and larger data sets, its output converges in some sense to the ideal cluster structure of the distribution. In the first part, we study the setting in which data are drawn from a probability density supported on a subset of a Euclidean space. The natural cluster structure of the density is captured by the so-called high density cluster tree, which is due to Hartigan (1981). Hartigan introduced a notion of convergence to the density cluster tree, and recent work by Chaudhuri and Dasgupta (2010) and Kpotufe and von Luxburg (2011) has contructed algorithms which are consistent in this sense. We will show that Hartigan's notion of consistency is in fact not strong enough to ensure that an algorithm recovers the density cluster tree as we would intuitively expect. We identify the precise deficiency which allows this, and introduce a new, stronger notion of convergence which we call consistency in merge distortion. Consistency in merge distortion implies Hartigan's consistency, and we prove that the algorithm of Chaudhuri and Dasgupta (2010) satisfies our new notion. In the sequel, we consider the clustering of graphs sampled from a very general, non-parametric random graph model called a graphon. Unlike in the density setting, clustering in the graphon model is not well-studied. We therefore rigorously analyze the cluster structure of a graphon and formally define the graphon cluster tree. We adapt our notion of consistency in merge distortion to the graphon setting and identify efficient, consistent algorithms.

    Committee: Mikhail Belkin PhD (Advisor); Yusu Wang PhD (Advisor); Facundo Mémoli PhD (Committee Member); Vincent Vu PhD (Committee Member) Subjects: Artificial Intelligence; Computer Science; Statistics
  • 3. CAO, BAOQIANG ON APPLICATIONS OF STATISTICAL LEARNING TO BIOPHYSICS

    PhD, University of Cincinnati, 2007, Arts and Sciences : Physics

    In this dissertation, we develop statistical and machine learning methods for problems in biological systems and processes. In particular, we are interested in two problems–predicting structural properties for membrane proteins and clustering genes based on microarray experiments. In the membrane protein problem, we introduce a compact representation for amino acids, and build a neural network predictor based on it to identify transmembrane domains for membrane proteins. Membrane proteins are divided into two classes based on the secondary structure of the parts spanning the bilayer lipids: alpha-helical and beta-barrel membrane proteins. We further build a support regression model to predict the lipid exposed levels for the amino acids within the transmembrane domains in alpha-helical membrane proteins. We also develop methods to predict pore-forming residues for beta-barrel membrane proteins. In the other problem, we apply a context-specific Bayesian clustering model to cluster genes based on their expression levels and cDNA copy numbers. This dissertation is organized as follows. Chapter 1 introduces the most relevant biology and statistical and machine learning methods. Chapters 2 and 3 focus on prediction of transmembrane domains for the alpha-helix and the beta-barrel, respectively. Chapter 4 discusses the prediction of relative lipid accessibility, a different structural property for membrane proteins. The final chapter addresses the gene clustering approach.

    Committee: Dr. Mark Jarrell (Advisor) Subjects: Physics, Molecular
  • 4. Janowski, Alexandra Quantifying Right Ventricular Function in Pulmonary Hypertension: Advanced Hemodynamics, Cardiac MRI Strain Analysis, and Novel Phenotyping

    Doctor of Philosophy, The Ohio State University, 2024, Biomedical Engineering

    Right ventricular (RV) function strongly associates with mortality in patients with pulmonary hypertension (PH). Survival is associated with right ventricular function rather than pulmonary vascular resistance or PA pressure. Optimization of RV assessment is an identified priority in PH clinical research, alongside phenotyping and identifying markers of increased risk. Echocardiography and cardiac magnetic resonance (CMR) imaging are both used to quantify RV function but CMR can be used to reliably obtain quantitative information about RV shape, size, and remodeling. A current unmet clinical need is the identification of clinically meaningful imaging features that clearly distinguishes between normal and abnormal RV structure and function. Right ventricle global longitudinal strain has been shown to associate with outcomes in PH and is typically measured via echocardiography. Measures of strain help to provide insight into the biomechanics of the RV (diastolic stiffness) and function. Changes in ventricular dynamics during ejection and filling may associate with the biomechanics of the RV. We hypothesized RV strain will be different in advanced RV dysfunction and the aim of the study was to evaluate RV strain and instantaneous strain rates throughout the cardiac cycle from CMR images in patients with pulmonary hypertension. We utilized machine learning methods to evaluate advanced hemodynamic variables obtained from right heart catheterization and cardiac magnetic resonance imaging. Unsupervised machine learning algorithms allow us to reduce high-dimensional data into handleable chunks while preserving the intersection of each dimension and its impact. By assessing RV remodeling and function through hemodynamic, clinical, and imaging data, we were able to identify multiple unique functional phenotypes that reflect states of RV dysfunction. Working with advanced hemodynamics, we identified five subphenotypes of RV dysfunction. We found a subtype of particip (open full item for complete abstract)

    Committee: Rebecca Vanderpool (Advisor); Seth Weinberg (Committee Member); Scott Visovatti (Committee Member); Rizwan Ahmad (Committee Member) Subjects: Biomedical Engineering
  • 5. Mariam-Smith, Arshiya Identification and Prediction of Clinical Analogue Cohorts in Electronic Health Records

    Doctor of Philosophy, Case Western Reserve University, 2024, Biomedical and Health Informatics

    Leveraging longitudinal clinical data can provide additional information for improvements in precision medicine. While subtypes of various diseases e.g., Alzheimer's disease, have distinct longitudinal clinical manifestations, this information is seldom used, presenting a missed opportunity for disease characterization. Here, we broadly investigate relevance of longitudinal phenotypes in clinical settings by: i) showing robust identification and prediction of new longitudinal phenotypes in a clinical trial setting. We used a temporal matching algorithm (i.e., dynamic time warping) in clustering HbA1c measurements and identified four subtypes of glycemia response in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial, which investigated diabetes management in patients with high cardiovascular risk (CVD). One subtype, C4, had treatment-mediated reduced CVD risk, and could be predicted with high accuracy (AUC=0.98). ii) Identification of robust temporal matching algorithms in real-world, simulated, longitudinal clinical data. Longitudinal clinical data is distinct from audio and video signal data used to develop these algorithms. We identified robust algorithms through systematic evaluation in simulated data and applied it to identify five distinct body mass index patterns with modified risk of metabolic syndrome in a large pediatric cohort (N>43,000). iii) Development of clinical analogue cohorts (CACs) to set the stage for multi-disease prediction. Approximately 33% adults and 75% older adults (> 65 years old) in developed countries are impacted by multiple chronic conditions (CCs), which are linked to increased medication use, specialist care and emergency services. Using diagnoses trajectories retrieved from the UK Biobank, we identified eight stable CACs across the life cycle in both females and males. These CACs had distinct CC risk profiles and genetic predispositions. For example, CAC-10 (males) had increased risk of prostate (open full item for complete abstract)

    Committee: David Kaelber (Committee Member); Xiaofeng Zhu (Committee Member); Jessica Cooke Bailey (Committee Member); William Bush (Committee Chair); Daniel Rotroff (Advisor) Subjects: Biomedical Research
  • 6. Jami, Prithvi Bayesian Network and Structural Equation Modeling Approaches for Analysis of Asthma Heterogeneity

    MS, University of Cincinnati, 2024, Engineering and Applied Science: Computer Science

    Asthma is a widespread chronic respiratory disease which is poorly understood despite significant research efforts into its etiology. Some of the difficulty involved is due to its heterogeneous nature and the wide variety of factors involved. Developing a predictive model for asthma risk would help us to better model and predict asthma and develop better treatments and protocols for its management. Bayesian networks are a useful tool that have been use to learn relationships embedded in an observed dataset, but causality cannot be inferred from any relationships learned in this way. Bayesian networks also are sensitive to the composition of the given dataset, and relationships present in only subsets of the data may not be identified. We propose a methodology that integrates structural equation modeling with Bayesian networks to facilitate the development of predictive models of asthma to incorporate causal information based on domain knowledge. We also utilize clustering to identify and model unique subsets of the population with potentially unique relationships of various factors relevant towards asthma. We have applied this approach with NHANES data to model the risk of asthma in subpopulations of residents of the United States. We have also validated several steps in our methodology to demonstrate its effectiveness for developing predictive models for asthma risk in specific subpopulations. We believe that this methodology can be extended to many real world complex processes with heterogeneous phenotypes to create robust models and find new insights.

    Committee: Tesfaye Mersha Ph.D. (Committee Chair); Vikram Ravindra Ph.D. (Committee Member); Raj Bhatnagar Ph.D. (Committee Member) Subjects: Computer Science
  • 7. Groeger, Alexander Texture-Driven Image Clustering in Laser Powder Bed Fusion

    Master of Science (MS), Wright State University, 2021, Computer Science

    The additive manufacturing (AM) field is striving to identify anomalies in laser powder bed fusion (LPBF) using multi-sensor in-process monitoring paired with machine learning (ML). In-process monitoring can reveal the presence of anomalies but creating a ML classifier requires labeled data. The present work approaches this problem by printing hundreds of Inconel-718 coupons with different processing parameters to capture a wide range of process monitoring imagery with multiple sensor types. Afterwards, the process monitoring images are encoded into feature vectors and clustered to isolate groups in each sensor modality. Four texture representations were learned by training two convolutional neural network texture classifiers on two general texture datasets for clustering comparison. The results demonstrate unsupervised texture-driven clustering can isolate roughness categories and process anomalies in each sensor modality. These groups can be labeled by a field expert and potentially be used for defect characterization in process monitoring.

    Committee: Tanvi Banerjee Ph.D. (Advisor); Thomas Wischgoll Ph.D. (Committee Member); John Middendorf Ph.D. (Committee Member) Subjects: Computer Science; Materials Science
  • 8. Salem, Iman PSORIATIC FUNGAL AND BACTERIAL MICROBIOMES IDENTIFY PATIENT ENDOTYPES

    Master of Sciences, Case Western Reserve University, 2021, Pathology

    Psoriasis is a chronic, inflammatory disease affecting approximately 3% of the population characterized by immune-mediated keratinocyte hyperproliferation. The aim of our study was to characterize entero- and cutaneotypes that incorporate micro- and myco- biomes using high throughput DNA-sequencing and to examine potential correlations with other metabolic and physiologic markers. Fecal and skin samples were obtained from 67 psoriatic patients and 12 healthy controls. Amplification of the 16S (bacterial) and ITS1 region (fungal) genes was performed using 16S 515f-804r and ITS1f and ITS2r primers, respectively. The targeted amplicons were sequenced using an Ion Torrent S5 system. Cutaneo- and entero-types were identified using unsupervised hierarchical clustering. We report for the first time that distinct patterns of laboratory and clinical measures are associated with respective enterotypes and cutaneotypes, with significant differences in clinical and laboratory features among subjects grouped by fungal cutaneotypes.

    Committee: Mahmoud Ghannoum (Committee Chair); Thomas McCormick (Committee Member); Pamela Wearsch (Committee Member) Subjects: Medicine; Molecular Biology
  • 9. Hussein, Abdul Aziz Identifying Crime Hotspot: Evaluating the suitability of Supervised and Unsupervised Machine learning

    MS, University of Cincinnati, 2021, Education, Criminal Justice, and Human Services: Information Technology

    Crime hotspot locations identification is a very important endeavor to help ensure public safety. Been able to identify these locations effectively and accurately will help provide useful information to law enforcement bodies to help minimize criminal activities. Considering the limited resources available to law enforcements, a more prudent approach will be to deploy these resources at places that record a considerable higher crime rate. We depart from the traditional “higher than” average thresholds and rather rely on a more pragmatic approach in the analysis. We analyze a five-year crime data from the Cincinnati Police Department using clustering algorithms such K-means, DBSCAN, Hierarchical algorithms, and classification machine learning algorithms such as Random Forest, SVM, Logistic Regression, KNN, and Naive Bayes, on the same dataset. The clustering methods are used as a standalone means of identifying crime hotspots rather than used as a data preprocessing step as done in prior experiments. The results from both approaches are compared using their respective evaluation metrics. From the performances, we find classification performed better than clustering on our dataset. The best performing algorithm is the Random Forest when the number of trees is 30. We also find considerable crime concentration along the hotspot street segments that were identified in the dataset.

    Committee: M. Murat Ozer Ph.D. (Committee Chair); Nelly Elsayed Ph.D. (Committee Member) Subjects: Information Technology
  • 10. Kondapalli, Swetha An Approach To Cluster And Benchmark Regional Emergency Medical Service Agencies

    Master of Science in Industrial and Human Factors Engineering (MSIHE) , Wright State University, 2020, Industrial and Human Factors Engineering

    Emergency Medical Service (EMS) providers are the first responders for an injured patient on the field. Their assessment of patient injuries and determination of an appropriate hospital play a critical role in patient outcomes. A majority of states in the US have established a state-level governing body (e.g., EMS Division) that is responsible for developing and maintaining a robust EMS system throughout the state. Such divisions develop standards, accredit EMS agencies, oversee the trauma system, and support new initiatives through grants and training. But to do so, these divisions require data to enable them to first understand the similarities between existing EMS agencies in the state in terms of their resources and activities. Benchmarking them against similar peer groups could then reveal best practices among top performers in terms of patient outcomes. While limited qualitative data exists in the literature based on surveys of EMS personnel related to their working environment, training, and stress, what is lacking is a quantitative approach that can help compare and contrast EMS agencies across a comprehensive set of factors and enable benchmarking. Our study fills this gap by proposing a data-driven approach to cluster EMS agencies (by county level) and subsequently benchmark them against their peers using two patient safety performance measures, under-triage (UT) and over-triage (OT). The study was conducted in three phases: data collection, clustering, and benchmarking. We first obtained data related to the trauma-specific capabilities, volume, and Performance Improvement activities. This data was collected by our collaborating team of health services researchers through a survey of over 300 EMS agencies in the state of OH. To estimate UT and OT, we used 6,002 de-identified patient records from 2012 made available by the state of Ohio's EMS Division. All the data was aggregated at county level. We then used several clustering methods to group counties us (open full item for complete abstract)

    Committee: Pratik J. Parikh Ph.D. (Advisor); Subhashini Ganapathy Ph.D. (Committee Member); Corrine Mowrey Ph.D. (Committee Member) Subjects: Computer Science; Industrial Engineering; Statistics
  • 11. Campbell, Benjamin Supervised and Unsupervised Machine Learning Strategies for Modeling Military Alliances

    Doctor of Philosophy, The Ohio State University, 2019, Political Science

    When modeling interstate military alliances, scholars make simplifying assumptions. However, most recognize these often invoked assumptions are overly simplistic. This dissertation leverages developments in supervised and unsupervised machine learning to assess the validity of these assumptions and examine how they influence our understanding of alliance politics. I uncover a series of findings that help us better understand the causes and consequences of alliances. The first assumption examined holds that states, when confronted by a common external security threat, form alliances to aggregate their military capabilities in an effort to increase their security and ensure their survival. Many within diplomatic history and security studies criticize this widely accepted "Capability Aggregation Model", noting that countries have various motives for forming alliances. In the first of three articles, I introduce an unsupervised machine learning algorithm designed to detect variation in how actors form relationships in longitudinal networks. This allows me to, in the second article, assess the heterogeneous motives countries have for forming alliances. I find that states form alliances to achieve foreign policy objectives beyond capability aggregation, including the consolidation of non-security ties and the pursuit of domestic reform. The second assumption is invoked when scholars model the relationship between alliances and conflict, routinely assuming that the formation of an alliance is exogeneous to the probability that one of the allies is attacked. This stands in stark contrast to the Capability Aggregation Model's expectations, which indicate that an external threat and an ally's expectation of attack by an aggressor influences the decision to form an alliance. In the final article, I examine this assumption and the causal relationship between alliances and conflict. Specifically, I endogenize alliances on the causal path to conflict using supe (open full item for complete abstract)

    Committee: Skyler Cranmer (Committee Chair); Box-Steffensmeier Janet (Committee Member); Braumoeller Bear (Committee Member); Gelpi Christopher (Committee Member) Subjects: Artificial Intelligence; Behavioral Sciences; Computer Science; International Relations; Military History; Peace Studies; Political Science; Statistics; World History
  • 12. Awodokun, Olugbenga Classification of Patterns in Streaming Data Using Clustering Signatures

    MS, University of Cincinnati, 2017, Engineering and Applied Science: Electrical Engineering

    Streaming datasets often pose a myriad of challenges for machine learning algorithms, some of which include insufficient storage and changes in the underlying distributions of the data during different time intervals. This thesis proposes a hierarchical clustering based method (unsupervised learning) for determining signatures of data in a time window and thus building a classifier based on the match between the observed clusters and known patterns of clustering. When new clusters are observed, they are added to the collection of possible global list of clusters, used to generate a signature for data in a time window. Dendrograms are created from each time window, and their clusters were compared to a global list of clusters. The global clusters list is only updated if none of the existing global clusters that can model data points in any later time window. The global clusters were then used in the testing phase to classify novel data chunks according to their Tanimoto similarities. Although the training samples were only taken from 20% of the entire KDD Cup 99 dataset, we validated our approach by using test data from different regions of the datasets at multiple intervals and the classifier performance achieved was comparable to other methods that had used the entire datasets for training.

    Committee: Raj Bhatnagar Ph.D. (Committee Chair); Gowtham Atluri (Committee Member); Nan Niu Ph.D. (Committee Member) Subjects: Computer Science
  • 13. Davis, Casey Using Self-Organizing Maps to Cluster Products for Storage Assignment in a Distribution Center

    Master of Science (MS), Ohio University, 2017, Industrial and Systems Engineering (Engineering and Technology)

    This thesis provides a methodology on how to use self-organizing maps (SOMs) to cluster stock keeping units (SKUs) based on historical order data, in order to effectively slot a forward area in a distribution center. This methodology relies on creating zones that contain SKUs that are commonly ordered together. There are several techniques that improve on the benchmark method tested including a percent reduction of up to 11% in the total time to complete all orders given a zone configuration. Results are discussed as well as possible future work that could improve upon the methodology.

    Committee: Dale Masel Ph.D (Advisor); Gary Weckman Ph.D (Committee Member); Dianna Schwerha Ph.D (Committee Member); William Young Ph.D (Committee Member) Subjects: Industrial Engineering
  • 14. Mirzaei, Golrokh Data Fusion of Infrared, Radar, and Acoustics Based Monitoring System

    Doctor of Philosophy, University of Toledo, 2014, Engineering

    Many birds and bats fatalities have been reported in the vicinity of wind farms. An acoustic, infrared camera, and marine radar based system is developed to monitor the nocturnal migration of birds and bats. The system is deployed and tested in an area of potential wind farm development. The area is also a stopover for migrating birds and bats. Multi-sensory data fusion is developed based on acoustics, infrared camera (IR), and radar. The diversity of the sensors technologies complicated its development. Different signal processing techniques were developed for processing of various types of data. Data fusion is then implemented from three diverse sensors in order to make inferences about the targets. This approach leads to reduction of uncertainties and provides a desired level of confidence and detail information about the patterns. This work is a unique, multifidelity, and multidisciplinary approach based on pattern recognition, machine learning, signal processing, bio-inspired computing, probabilistic methods, and fuzzy reasoning. Sensors were located in the western basin of Lake Erie in Ohio and were used to collect data over the migration period of 2011 and 2012. Acoustic data were collected using acoustic detectors (SM2 and SM2BAT). Data were preprocessed to convert the recorded files to standard wave format. Acoustic processing was performed in two steps: feature extraction, and classification. Acoustic features of bat echolocation calls were extracted based on three different techniques: Short Time Fourier Transform (STFT), Mel Frequency Cepstrum Coefficient (MFCC), and Discrete Wavelet Transform (DWT). These features were fed into an Evolutionary Neural Network (ENN) for their classification at the species level using acoustic features. Results from different feature extraction techniques were compared based on classification accuracy. The technique can identify bats and will contribute towards developing mitigation procedures for reducing bat fata (open full item for complete abstract)

    Committee: Mohsin Jamali Dr. (Committee Chair); Jackson Carvalho Dr. (Committee Member); Mohammed Niamat Dr. (Committee Member); Richard Molyet Dr. (Committee Member); Mehdi Pourazady Dr. (Committee Member) Subjects: Biology; Computer Engineering; Computer Science; Ecology; Electrical Engineering; Energy; Engineering
  • 15. Hu, Ke Speech Segregation in Background Noise and Competing Speech

    Doctor of Philosophy, The Ohio State University, 2012, Computer Science and Engineering

    In real-world listening environments, speech reaching our ear is often accompanied by acoustic interference such as environmental sounds, music or another voice. Noise distorts speech and poses a substantial difficulty to many applications including hearing aid design and automatic speech recognition. Monaural speech segregation refers to the problem of separating speech based on only one recording and is a widely regarded challenge. In the last decades, significant progress has been made on this problem but the challenge remains. This dissertation addresses monaural speech segregation from different interference. First, we research the problem of unvoiced speech segregation which is less studied compared to voiced speech segregation probably due to its difficulty. We propose to utilize segregated voiced speech to assist unvoiced speech segregation. Specifically, we remove all periodic signals including voiced speech from the noisy input and then estimate noise energy in unvoiced intervals using noise-dominant time-frequency units in neighboring voiced intervals. The estimated interference is used by a subtraction stage to extract unvoiced segments, which are then grouped by either simple thresholding or classification. We demonstrate that the proposed system performs substantially better than speech enhancement methods. Interference can be nonspeech signals or other voices. Cochannel speech refers to a mixture of two speech signals. Cochannel speech separation is often addressed by model-based methods, which assume speaker identities and pretrained speaker models. To address this speaker-dependency limitation, we propose an unsupervised approach to cochannel speech separation. We employ a tandem algorithm to perform simultaneous grouping of speech and develop an unsupervised clustering method to group simultaneous streams across time. The proposed objective function for clustering measures the speaker difference of each hypothesized grouping and incorporates pitch (open full item for complete abstract)

    Committee: DeLiang Wang (Committee Chair); Eric Fosler-Lussier (Committee Member); Mikhail Belkin (Committee Member) Subjects: Computer Science
  • 16. Mohiddin, Syed Development of novel unsupervised and supervised informatics methods for drug discovery applications

    Doctor of Philosophy, The Ohio State University, 2006, Chemical Engineering

    As of 2002, the cost of discovering a new drug was nearly $802 million with a timeline of nearly 13.6 years. Despite the large investments in time and money, drugs that were successfully introduced in the market had to be withdrawn later due to efficacy (38%) and safety (20%) reasons. Improving the success rate in drug discovery is linked with two key steps in the process. First, in order to improve efficacy, there is a need for improved understanding of genetic biomarkers (targets for drug action) that are responsible for characterizing a given disease. Second, it is possible to improve drug safety, by predicting the activity/toxicity of potential drug candidates at an early stage prior to the initiation of expensive clinical trials. In this work, we develop a novel unsupervised informatics methodology that addresses characterization of both biological and chemical samples and identification of underlying key non-redundant features responsible for characterization. Biological samples are characterized into different groups (e.g. cancer types) based on gene expression profiling and the genetic biomarkers most responsible for characterization are identified. Similarly, chemical compounds are characterized into different groups with varying activity/toxicity based on structural, physical and chemical property data of the chemical compounds. The methodology developed in this work relies largely on the multivariate aspects of principal component analysis and the application of k-means clustering algorithm in a hierarchically recursive manner to achieve unsupervised multi-class classification. The principal components are replaced by the corresponding partial least square (PLS) components in the supervised scenario. Selection of influential components (principal components in unsupervised case and PLS components in supervised case) for the purpose of classification is demonstrated and is one of the key steps for the success of this methodology. Hierarchical k-means is ap (open full item for complete abstract)

    Committee: James Rathman (Advisor) Subjects: Engineering, Chemical