Search Results (1 - 25 of 107 Results)

Sort By  
Sort Dir
 
Results per page  

Evans, Daniel T.A SNP Microarray Analysis Pipeline Using Machine Learning Techniques
Master of Science (MS), Ohio University, 2010, Computer Science (Engineering and Technology)

A software pipeline has been developed to aide in SNP microarray analysis in case/control genome-wide association (GWA) studies. The pipeline uses data taken from previous GWA studies from the NCBI Gene Expression Omnibus website and analyzes the SNP information from these studies to reate predictive classifiers. These classifiers attempt to accurately predict if individuals have a particular phenotype based on their genotypes. Two dierent methods were used to create these predictive models. One makes use of a popular machine learning technique, support vector machines, and the other is a simpler method that uses genotype total dierences between cases and controls. One major benefit of using the support vector machine method is the ability to integrate and consider many combinations of SNPs in a computationally inexpensive manner.

The GSE13117 dataset, which consists of mentally retarded children and their parents, and the GSE9222 dataset, which consists of autistic patients and their parents, were used to test the software pipeline. A Bayesian confidence interval was used in reporting classifier performance in addition to 5-repeated 10-fold cross-validation (5r-10cv). For the GSE9222 data set, the top performing model achieved a balanced accuracy of 70.8% and a normal accuracy of 71.7% using 5r-10cv. The model that had the distribution with the highest upper bound had a 95% confidence balanced accuracy interval of 62.1% to 75.3%. For the GSE13117 data set, the top performing classifier achieved a balanced accuracy of 56.2% and a normal accuracy of 65.7% using 5r-10cv. The model that had the distribution with the highest upper bound for the GSE13117 data set had a 95% confidence balanced accuracy interval of 49.6% to 68.3%. Such classifiers will eventually lead to new insights into disease and allow for simpler and more accurate diagnoses in the future.

The work in this thesis contains ideas and work that is a continuation of previously published abstracts and poster presentations [1, 2], unpublished class reports [3, 4], and unpublished project reports from personal correspondence [5].

Committee:

Lonnie Welch, Dr. (Committee Chair); Razvan Bunescu, Dr. (Committee Member); Jundong Liu, Dr. (Committee Member); John Kopchick, Dr. (Committee Member)

Subjects:

Bioinformatics; Biology; Computer Science; Genetics

Keywords:

genome-wide association; machine learning; predictive model; machine learning; support vector machines; single nucleotide polymorphisms

Nepal, SrijanLinguistic Approach to Information Extraction and Sentiment Analysis on Twitter
MS, University of Cincinnati, 2012, Engineering and Applied Science: Computer Science

Social media sites are one of the most popular destinations in today’s online world. With millions of users visiting social networking sites like Facebook, YouTube, Twitter etc. every day to share social content at their disposal; from simple textual information about what they are doing at any moment of time, to opinions regarding products, people, events, movies to videos and music, these sites have become massive sources of user generated content. In this work we focus on one such social networking site - Twitter, for the task of information extraction and sentiment analysis.

This work presents a linguistic framework that first performs syntactic normalization of tweets on top of traditional data cleaning, extracts assertions from each tweet in the form of binary relations, and creates a contextualized knowledge base (KB). We then present a Language Model (LM) based classifier trained on a small set of manually tagged corpus, to perform sentence level sentiment analysis on the collected assertions to eventually create a KB that is backed by sentiment values. We use this approach to implement a contextualized sentiment based yes/no question answering system.

Committee:

Kenneth Berman, PhD (Committee Chair); Fred Annexstein, PhD (Committee Member); Anca Ralescu, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

Sentiment Analysis;Twitter;Information Extraction;Language Model;Machine Learning;Sentiment Classification;

Jain, RachanaNovel Computational Methods for Mass Spectrometry based Protein Identification
PhD, University of Cincinnati, 2010, Engineering : Biomedical Engineering

Mass spectrometry (MS) is used routinely to identify proteins in biological samples. Peptide Mass Fingerprinting (PMF) uses peptide masses and a pre-specified search database to identify proteins. It is often used as a complementary method along with Peptide Fragment Fingerprinting (PFF) or de-novo sequencing for increasing confidence and coverage of protein identification during mass spectrometric analysis. At the core of a PMF database search algorithm lies a similarity measure or quality statistics that is used to gauge the level to which an experimentally obtained peaklist agrees with a list of theoretically observable mass-to-charge ratios for a protein in a database. In this dissertation, we use publicly available gold standard data sets to show that the selection of search criteria such as mass tolerance and missed cleavages significantly affects the identification results. We propose, implement and evaluate a statistical (Kolmogorov-Smirnov-based) test which is computed for a large mass error threshold thus avoiding the choice of appropriate mass tolerance by the user. We use the mass tolerance identified by the Kolmogorov-Smirnov test for computing other quality measures. The results from our careful and extensive benchmarks suggest that the new method of computing the quality statistics without requiring the end-user to select a mass tolerance is competitive. We investigate the similarity measures in terms of their information content and conclude that the similarity measures are complementary and can be combined into a scoring function to possibly improve the over all accuracy of PMF based identification methods.

We describe a new database search tool, PRIMAL, for protein identification using PMF. The novelty behind PRIMAL is two-fold. First, we comprehensively analyze methods for measuring the degree of similarity between experimental and theoretical peaklists. Second, we employ machine learning as a means of combining the individual similarity measures into a scoring function. Finally, we systematically test the efficacy of PRIMAL in identifying proteins using highly curated and publicly available data. Our results suggest that PRIMAL is competitive if not better than some of the tools extensively used by the mass spectrometry community. A web server with an implementation of the scoring function is available at http://bmi.cchmc.org/primal.

We also note that the methodology is directly extensible to MS/MS based protein identification problem. We detail how to extend our approaches to the more complex MS/MS data.

Committee:

Michael Wagner, PhD (Committee Chair); Patrick Limbach, PhD (Committee Member); Jaroslaw Meller, PhD (Committee Member)

Subjects:

Bioinformatics

Keywords:

Protein Identification;Peptide Mass Fingerprinting;Machine Learning

Mo, DengyaoRobust and Efficient Feature Selection for High-Dimensional Datasets
PhD, University of Cincinnati, 2011, Engineering and Applied Science: Mechanical Engineering
Feature selection is an active research topic in the community of machine learning and knowledge discovery in databases (KDD). It contributes to making the data mining model more comprehensible to domain experts, improving the prediction performance and robustness of the model, and reducing model training. This dissertation aims to provide solutions to three issues that are overlooked by many current feature selection researchers. These issues are feature interaction, data imbalance, and multiple subsets of features. Most of extant filter feature selection methods are pair-wise comparison methods which test each pair of variables, i.e., one predictor variable and the response variable, and provide a correlation measure for each feature associated with the response variable. Such methods cannot take into account feature interactions. Data imbalance is another issue in feature selection. Without considering data imbalance, the features selected will be biased towards the majority class. In high dimensional datasets with sparse data samples, there will be many different feature sets that are highly correlated with the output. Domain experts usually expect us to identify multiple feature sets for them so that they can evaluate them based on their domain knowledge. This dissertation aims to solve these three issues based on a criterion called minimum expected cost of misclassification (MECM). MECM is a model independent evaluation measure. It evaluates the classification power of the tested feature subset as a whole. MECM has adjustable weights to deal with imbalanced datasets. A number of case studies showed that MECM had some favorable properties for searching a compact subset of interacting features. In addition, an algorithm and corresponding data structure were developed to produce multiple feature subsets. The success of this research will have broad applications ranging from engineering, business, to bioinformatics, such as credit card fraud detection, email filter setting for spam classification, gene selection for disease diagnosis.

Committee:

Hongdao Huang, PhD (Committee Chair); Sundararaman Anand, PhD (Committee Member); Jaroslaw Meller, PhD (Committee Member); David Thompson, PhD (Committee Member); Michael Wagner, PhD (Committee Member)

Subjects:

Information Systems

Keywords:

Feature Selection;Data Mining;Machine Learning;Statistical Modeling;Knowledge Discovery in Database

Geyer, Joseph MichaelIdentification of Candidate Concepts in a Learning-Based Approach to Reverse Engineering
Master of Computer Science, Miami University, 2010, Computer Science and Systems Analysis
Software reverse engineering is the process of extracting knowledge from a software system and then creating high-level abstractions to communicate that knowledge. This is vital to supporting long-term maintenance of the system. One such abstraction, or view, is to split the classes of the system into two sets -domain concept classes and peripheral classes. That is, the classes that relate to the domain of the system, and those classes that just help with the operation and functioning of the system. Supervised machine learning is a technique that can be used to label domain concept classes and peripheral classes given a training set. However, manually creating a training set is inefficient. The goal of the research is to present a method and tool to semi-automate the creation of a training set for using supervised learning to classify domain concept classes and peripheral classes in a software system.

Committee:

Gerald Gannod, PhD (Advisor); James Kiper, PhD (Committee Member); Michael Zmuda, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

machine learning; reverse engineering; concept identification; automatic term recognition; active learning;

Warren, Emily AmandaMachine Learning for Road Following by Autonomous Mobile Robots
Master of Sciences (Engineering), Case Western Reserve University, 2008, EECS - Computer Engineering
This thesis explores the use of machine learning in the context of autonomous mobile robots driving on roads, with the focus on improving the robot's internal map. Early chapters cover the mapping efforts of DEXTER, Team Case's entry in the 2007 DARPA Urban Challenge. Competent driving may include the use of a priori information, such as road maps, and online sensory information, including vehicle position and orientation estimates in absolute coordinates as well as error coordinates relative to a sensed road. An algorithm may select the best of these typically flawed sources, or more robustly, use all flawed sources to improve an uncertain world map, both globally in terms of registration corrections and locally in terms of improving knowledge of obscured roads. It is shown how unsupervised learning can be used to train recognition of sensor credibility in a manner applicable to optimal data fusion.

Committee:

Wyatt Newman, PhD (Advisor); M. Cenk Cavusoglu, PhD (Committee Member); Francis Merat, PhD (Committee Member)

Subjects:

Computer Science; Engineering; Robots

Keywords:

machine learning; autonomous robot; driving; Urban Challenge; sensor fusion; unsupervised learning; global map; road following

SUI, ZHENHUANHierarchical Text Topic Modeling with Applications in Social Media-Enabled Cyber Maintenance Decision Analysis and Quality Hypothesis Generation
Doctor of Philosophy, The Ohio State University, 2017, Industrial and Systems Engineering
Many decision problems are set in changing environments. For example, determining the optimal investment in cyber maintenance depends on whether there is evidence of an unusual vulnerability such as “Heartbleed” that is causing an especially high rate of incidents. This gives rise to the need for timely information to update decision models so that the optimal policies can be generated for each decision period. Social media provides a streaming source of relevant information, but that information needs to be efficiently transformed into numbers to enable the needed updates. This dissertation first explores the use of social media as an observation source for timely decision-making. To efficiently generate the observations for Bayesian updates, the dissertation proposes a novel computational method to fit an existing clustering model, called K-means Latent Dirichlet Allocation (KLDA). The method is illustrated using a cyber security problem related to changing maintenance policies during periods of elevated risk. Also, the dissertation studies four text corpora with 100 replications and show that KLDA is associated with significantly reduced computational times and more consistent model accuracy compared with collapsed Gibbs sampling. Because social media is becoming more popular, researchers have begun applying text analytics models and tools to extract information from these social media platforms. Many of the text analytics models are based on Latent Dirichlet Allocation (LDA). But these models are often poor estimators of topic proportions for emerging topics. Therefore, the second part of dissertation proposes a visual summarizing technique based on topic models, a point system, and Twitter feeds to support passive summarizing and sensemaking. The associated “importance score” point system is intended to mitigate the weakness of topic models. The proposed method is called TWitter Importance Score Topic (TWIST) summarizing method. TWIST employs the topic proportion outputs of tweets and assigns importance points to present trending topics. TWIST generates a chart showing the important and trending topics that are discussed over a given time period. The dissertation illustrates the methodology using two cyber-security field case study examples. Finally, the dissertation proposes a general framework to teach the engineers and practitioners how to work with text data. As an extension of Exploratory Data Analysis (EDA) in quality improvement problems, Exploratory Text Data Analysis (ETDA) implements text as the input data and the goal is to extract useful information from the text inputs for exploration of potential problems and causal effects. This part of the dissertation presents a practical framework for ETDA in the quality improvement projects with four major steps of ETDA: pre-processing text data, text data processing and display, salient feature identification, and salient feature interpretation. For this purpose, various case studies are presented alongside the major steps and tried to discuss these steps with various visualization techniques available in ETDA.

Committee:

Theodore Allen (Advisor); Steven MacEachern (Committee Member); Cathy Xia (Committee Member); Nena Couch (Other)

Subjects:

Finance; Industrial Engineering; Operations Research; Statistics; Systems Science

Keywords:

Natural Language Processing, NLP, Machine Learning, Bayesian Statistics, Hierarchical Text Topic Modeling, Text Analytics, Cyber Maintenance, Decision Analysis, Quality Hypothesis Generation, Latent Dirichlet Allocation, Financial Engineering

Madaris, Aaron T.Characterization of Peripheral Lung Lesions by Statistical Image Processing of Endobronchial Ultrasound Images
Master of Science in Biomedical Engineering (MSBME), Wright State University, 2016, Biomedical Engineering
This thesis introduces the concept of implementing greyscale analysis, also known as intensity analysis, on endobronchial ultrasound (EBUS) images for the purposes of diagnosing peripheral lung tumors. The statistical methodology of using greyscale and histogram analysis allows the characterization of lung tissue in EBUS images. Regions of interest (ROI) will be analyzed in MATLAB and a feature vector will be created. A feature vector of first-order, second-order and histogram greyscale analysis will be created and used for the classification of malignant vs benign peripheral lung tumors. The tools that were implemented were MedCalc for the initial statistical analysis of receiver operating curves (ROC), Multiple Regression and MATLAB for the machine learning and ROI collection. Feature analysis, multiple regression and machine learning methods were used to better classify the malignant and benign EBUS images. The classification is assessed with a confusion matrix, ROC curve, accuracy, sensitivity and specificity. It was found that minimum pixel value, contrast and energy are the best determining factors to discriminate between benign and malignant EBUS images.

Committee:

Ulas Sunar, Ph.D. (Advisor); Jason Parker, Ph.D. (Committee Member); Jaime Ramirez-Vick, Ph.D. (Committee Member)

Subjects:

Biomedical Engineering; Biomedical Research; Biostatistics; Computer Engineering; Engineering; Health Care; Medical Imaging

Keywords:

Endobronchial Ultrasound; Medical Imaging; Image Analysis; Statistical Analysis; Machine learning; MATLAB; Histogram; Texture; Multiple Regression; Feature analysis

Adams, William A.Analysis of Robustness in Lane Detection using Machine Learning Models
Master of Science (MS), Ohio University, 2015, Electrical Engineering (Engineering and Technology)
An appropriate approach to incorporating robustness into lane detection algorithms is beneficial to autonomous vehicle applications and other problems relying on fusion methods. While traditionally rigorous empirical methods were developed for mitigating lane detection error, an evidence-based model-driven approach yields robust results using multispectral video as input to various machine learning models. Branching beyond the few network structures considered for image understanding applications, deep networks with unique optimization functions are demonstrably more robust while making fewer assumptions. This work adopts a simple framework for data collection; retrieving image patches for comparison via regression through a learning model. Along a horizontal scanline, the most probable sample is selected to retrain the network. Models include simple regressors, various autoencoders, and a few specialized deep networks. Samples are compared by robustness and the results favor deep and highly specialized network structures.

Committee:

Mehmet Celenk (Advisor); Jeffrey Dill (Committee Member); Maarten Uijt de Haag (Committee Member); Rida Benhaddou (Committee Member)

Subjects:

Artificial Intelligence; Automotive Engineering; Computer Science; Engineering

Keywords:

Machine Learning; ADAS; Lane Detection; Autoencoder; Regressor; Deep Network; Deep Learning

John, Zubin RPredicting Day-Zero Review Ratings: A Social Web Mining Approach
Master of Science, The Ohio State University, 2015, Computer Science and Engineering
Social Web Mining: is a term closely associated with modern day use of the Internet; with large Internet companies Google, Apple, IBM moving towards integration of intelli- gence into their product eco-system, a large number of different applications have popped up in the Social sphere. With the aid of machine learning techniques there is no dearth of learning that is possible from endless streams of user-generated content. One of the tasks in this domain that has seen relatively less research is the task of predicting review scores prospectively i.e. prior to the release of the entity - a movie, electronic product, game or book in question. It is easy to locate this chatter on social streams such as Twitter; what’s difficulty is extracting relevant information and facts about these entities and even more - the task of predicting these Day-Zero review rating scores which provide insightful information about these products, prior to their release. In this thesis, we propose just such a framework - a setup capable of extracting facts about reviewable entities. Populating a list of potential objects for a year, we follow an approach similar to boot-strapping in order to learn relevant facts about these prospective entities, all geared towards the task of learning to predict scores in a machine learning setting. Towards the end-goal of predicting review scores for potential products - our system supports alternative strategies which perform competitively on the task problem. All the predictions from the learning framework, within a certain allowable error margin output scores comparable to human judgment. The results bode well for potential large-scale predictive tasks on real-time data streams; in addition this framework proposes alternative feature spaces which in aggregation go on to describe a multi-method approach to achieving higher accuracy on tasks which have previously seen lack-luster results.

Committee:

Alan Ritter (Advisor); Eric Fosler-Lussier (Committee Member)

Subjects:

Artificial Intelligence; Computer Science

Keywords:

twitter, social web mining, information extraction, applied machine learning

Plis, Kevin A.The Effects of Novel Feature Vectors on Metagenomic Classification
Master of Science (MS), Ohio University, 2014, Computer Science (Engineering and Technology)
Metagenomics plays a crucial role in our understanding of the world around us. Machine learning and bioinformatics methods have struggled to accurately identify the organisms present in metagenomic samples. By using improved feature vectors, higher classification accuracy can be found when using the machine learning classification approach to identify the organisms present in a metagenomic sample. This research is a pilot study that explores novel feature vectors and their effect on metagenomic classification. A synthetic data set was created using the genomes of 32 organisms from the Archaea and Bacteria domains, with 450 fragments of varying length per organism used to train the classification models. By using a novel feature vector one tenth of the size of the currently used feature vectors, a 6.34%, 21.91%, and 15.07% improvement was found over the species level accuracy on 100, 300, and 500 bp fragments, respectively, for this data set. The results of this study also show that using more features does not always translate to a higher classification accuracy, and that higher classification accuracy can be achieved through feature selection.

Committee:

Lonnie Welch, PhD (Advisor)

Subjects:

Artificial Intelligence; Bioinformatics; Computer Science

Keywords:

Metagenomics; Classification; Machine Learning; SVM; Support Vector Machine; Feature Vector; Feature Selection; Bioinformatics

Doran, Gary BrianMultiple-Instance Learning from Distributions
Doctor of Philosophy, Case Western Reserve University, 2015, EECS - Computer and Information Sciences
I propose a new theoretical framework for analyzing the multiple-instance learning (MIL) setting. In MIL, training examples are provided to a learning algorithm in the form of labeled sets, or "bags," of instances. Applications of MIL include 3-D quantitative structure-activity relationship prediction for drug discovery and content-based image retrieval for web search. The goal of an algorithm is to learn a function that correctly labels new bags or a function that correctly labels new instances. I propose that bags should be treated as latent distributions from which samples are observed. I show that it is possible to learn accurate instance- and bag-labeling functions in this setting as well as functions that correctly rank bags or instances under weak assumptions. Additionally, my theoretical results suggest that it is possible learn to rank efficiently using traditional, well-studied "supervised" learning approaches. These results also indicate that supervised approaches for learning from distributions can be used to directly learn bag-labeling functions efficiently. I perform an extensive empirical evaluation that supports the theoretical predictions entailed by the new framework. In addition to showing how supervised approaches can be applied to MIL, I prove new hardness results on using MI-specific algorithms to learn hyperplane labeling functions for instances. Finally, I propose a new resampling approach for MIL, analyze it under the new theoretical framework, and show that it can improve the performance of MI classifiers when training set sizes are small. In summary, the proposed theoretical framework leads to a better understanding of the relationship between the MI and standard supervised learning settings, and it provides new methods for learning from MI data that are more accurate, more efficient, and have better understood theoretical properties than existing MI-specific algorithms.

Committee:

Soumya Ray (Advisor); Harold Connamacher (Committee Member); Michael Lewicki (Committee Member); Stanislaw Szarek (Committee Member); Kiri Wagstaff (Committee Member)

Subjects:

Artificial Intelligence; Computer Science

Keywords:

machine learning; multiple-instance learning; kernel methods; learning theory; classfiication

Struble, NigelMeasuring Glycemic Variability and Predicting Blood Glucose Levels Using Machine Learning Regression Models
Master of Science (MS), Ohio University, 2013, Computer Science (Engineering and Technology)
This thesis presents research in machine learning for diabetes management. There are two major contributions:(1) development of a metric for measuring glycemic variability, a serious problem for patients with diabetes; and (2) predicting patient blood glucose levels, in order to preemptively detect and avoid potential health problems. The glycemic variability metric uses machine learning trained on multiple statistical and domain specific features to match physician consensus of glycemic variability. The metric performs similarly to an individual physician's ability to match the consensus. When used as a screen for detecting excessive glycemic variability, the metric outperforms the baseline metrics. The blood glucose prediction model uses machine learning to integrate a general physiological model and life-events to make patient-specific predictions 30 and 60 minutes in the future. The blood glucose prediction model was evaluated in several situations such as near a meal or during exercise. The prediction model outperformed the baselines prediction models, and performed similarly to, and in some cases outperformed, expert physicians who were given the same prediction problems.

Committee:

Cynthia Marling (Advisor); Razvan Bunescu (Committee Member); Frank Schwartz (Committee Member); Jundong Liu (Committee Member)

Subjects:

Computer Science

Keywords:

Machine Learning; Glycemic Variability; BGL Prediction; Diabetes

Aranibar, Luis Alfonso QuirogaLearning fuzzy logic from examples
Master of Science (MS), Ohio University, 1994, Industrial and Manufacturing Systems Engineering (Engineering)

Traditional manufacturing schemes, thereon referred as manufacturing planning, have not been able to provide tools with enough flexibility to adapt to the many constraints of every day production. These approaches are quantitative in nature and their application to control problems fails to perform satisfactorily when the information required to answer a query is non-linear or ill-defined. With the advent of high quality requirements, conventional manufacturing systems have been forced to explore innovative alternatives to cope with the pressure of global competitiveness. Neural Networks and Machine Learning, with their ability to learn from examples, have been proposed early on for solving non-linear control problems adaptively. More recently, Fuzzy Logic has emerged as a promising approach for controlling processes.

This work studies the application of Fuzzy Logic through a methodology proposed by Wang and Mendel from the University of Southern California. This methodology is of relevant interest because it provides fuzzy model builders with a unique tool to generate rules from fuzzy sets which combine both numerical and linguistic information.

The methodology is explained in detail and is applied to two uncomplicated case studies. The first case study is a typical control problem an it relates to robot motion control. A simulation is developed to obtain various information describing the movement of a two-degree-of-freedom robot arm. The collected pairs are used for the training of the fuzzy controller which is then tested with new data.

Although Fuzzy Logic is now quite popular in commercial control applications, there is little reported in the area of manufacturing planning and control. In the second case study, the Wang-Mendel methodology is applied to a basic job shop scheduling problem. Performance of the fuzzy machine is compared against a traditional backpropagation neural network and Quinlan's ID3 machine learning technique.

Committee:

Luis Rabelo (Advisor)

Subjects:

Engineering, Industrial

Keywords:

Learning; Fuzzy logic; backpropagation; Quinlan's 1D3 machine learning technique

Jha, Prakash TeknarayanOptimizing Paired Kidney Transplant by Applying Machine Learning
Master of Science in Engineering, University of Toledo, 2011, College of Engineering
In this research a tree-based machine learning algorithm has been used to build a robust mechanism that optimizes paired kidney transplants in a very proficient manner. The system predicts how good or bad a specific kidney transplant match is. The system successfully classifies and predicts donor-quality. Potential donors were classified into categories based on numeric attributes generated. Based on these numeric attributes, potential donors were classified into linguistic variables such as Good, Average and Bad. To choose the data mining technique to be employed by the system, several algorithms such as J48 (tree-based machine learning algorithm), JRip (rule-based machine learning algorithm) and SMO (Sequential minimal optimization) were considered. It was found that J48 was the better choice of all the three. The donor-classification was based on donor parameters such as age, HLA A, HLA B, HLA DR, center, CMV, EBV and blood group. The system is built using JAVA, SQL and Weka (a machine learning suite). The system also provides a visual mode of communication for doctors and surgeons to consider key factors like donor quality and donor blood group before carrying out a transplant. This feature also facilitates the doctors’ decision-making, as to where should a chain of transplants be broken so as to ensure better and desired results as well as to provide more leads to feasible transplant chains in the future. The system developed has an accuracy of 97.18% which was generated by correctly classified instances by J48 and a promising 0.9567 kappa statistic achieved by J48. This statistic is a measure to assess the decision making capability of the system built in comparison to a real physician. The predicted donor quality was then incorporated into a matching system where possible matches were visually displayed as optimal matches and other top matches. The optimal matches also showed every donor’s blood group and donor quality, which are important in making transplant decisions.

Committee:

Devinder Kaur, PhD (Committee Chair); Mansoor Alam, PhD (Committee Member); Henry Ledgard, PhD (Committee Member)

Subjects:

Artificial Intelligence; Computer Science; Technology

Keywords:

paired kidney transplant; kidney; transplant; prakash; jha; teknarayan; university of toledo; thesis; rees; kaur; selman; ut; paired; optimization; algorithm; machine learning; prakash teknarayan jha; computer science; ms; masters

Venkatesan, VaidehiCuisines as Complex Networks
MS, University of Cincinnati, 2011, Engineering and Applied Science: Computer Science
Cuisines are among the richest and most distinctive artifacts of human culture, reflecting social structures, religious constraints, ecological choices, economic factors, historical events and, of course, gastronomical preferences. Cuisines are exemplars of human thinking and ideation process evolved over a number of civilizational constraints. While cuisines all over the world use many ingredients in common, they are distinguished by the patterns of usage of ingredients. Every cuisine can be seen as network whose nodes are the ingredients used in its recipes with links indicating which ingredients are used together. In this thesis, several cuisine networks are constructed incorporating semantic relevance between nodes using data mining and natural language processing techniques. These networks are compared in terms of their structural and statistical characteristics using techniques from graph theory and multivariate analysis. This includes calculating features such as degree distributions, clustering, diameters, assortativity, etc., as well as building correlation and co-occurrence matrices of ingredients. The features obtained from this analysis are used to organize the cuisines into a cluster hierarchy, thus clarifying the relationships between them. A set of algorithms based on machine learning and graph theoretic techniques is developed to build a culinary evaluation and recommendation system called Critic which can check the degree to which a given set of ingredients is consistent within a given cuisine for a culinary recipe. Such a system can serve as the core of a generic ideation system for generating new (recipe) ideas and for classifying given (recipes) ideas systematically into specific (cuisines) contexts.

Committee:

Ali Minai, PhD (Committee Chair); Raj Bhatnagar, PhD (Committee Member); Yizong Cheng, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

cuisines;complex networks;semantic analysis;machine learning;ideation;information retreival

Fang, ChunshengNovel Frameworks for Mining Heterogeneous and Dynamic Networks
PhD, University of Cincinnati, 2011, Engineering and Applied Science: Computer Science and Engineering
Graphs serve as an important tool for discrete data representation. Recently, graph representations have made possible very powerful machine learning algorithms, such as manifold learning, kernel methods, semi-supervised learning. With the advent of large-scale real world networks, such as biological networks (disease network, drug target network, etc.), social networks (DBLP Co-authorship network, Facebook friendship, etc.), machine learning and data mining algorithms have found new application areas and have contributed to advance our understanding of properties, and phenomena governing real world networks. When dealing with real world data represented as networks, two problems arise quite naturally: I) How to integrate and align the knowledge encoded in multiple and heterogeneous networks? For instance, how to find out the similar genes in co-disease and protein-protein interaction networks? II) How to model and predict the evolution of a dynamic network? A real world example is, given N years snapshots of an evolving social network, how to build a model that can capture the temporal evolution and make reliable prediction? In this dissertation, we present an innovative graph embedding framework, which identifies the key components of modeling the evolution in time of a dynamic graph. Different from the many state-of-the-art graph link prediction and modeling algorithms, it formulates the link prediction problem from a geometric perspective that can capture the dynamics of the intrinsic continuous graph manifold evolution. It is attractive due to its simplicity and the potential to relax the mining problem into a feasible domain which enables standard machine learning and regression models to utilize historical graph time series data. To address the first problem, we first propose a novel probability-based similarity measure which led to promising applications in content based image retrieval and image annotation, followed by a manifold alignment framework to align multiple heterogeneous networks, which demonstrate its power in mining biological networks. Finally, the dynamic graph mining framework generalizes most of the current graph embedding dynamic link prediction algorithms. Comprehensive experimental results on both synthesized and real-world datasets demonstrate that our proposed algorithmic framework for multiple heterogeneous networks and dynamic networks, can lead to better and more insightful understanding of real world networks. Scalability of our algorithms is also considered by employing MapReduce cloud computing architecture.

Committee:

Anca Ralescu, PhD (Committee Chair); Anil Jegga, DVMMRes (Committee Member); Fred Annexstein, PhD (Committee Member); Kenneth Berman, PhD (Committee Member); Yizong Cheng, PhD (Committee Member); Dan Ralescu, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

machine learning;social network;data mining;manifold learning;graph embedding;dynamic graph

Goeringer, TylerMassively Parallel Reinforcement Learning With an Application to Video Games
Master of Sciences, Case Western Reserve University, 2013, EECS - Computer and Information Sciences
We propose a framework for periodic policy updates of computer controlled agents in an interactive scenario. We use the graphics processing unit (GPU) to accelerate an offline reinforcement learning algorithm which periodically updates an online agent's policy. The main contributions of this work are the use of GPU acceleration combined with a periodic update model to provide reinforcement learning in a performance constrained environment. We show empirically that given certain environment properties, the GPU accelerated implementation provides better performance than a traditional implementation utilizing the central processing unit (CPU). In addition, we show that while an online machine learning algorithm can function in some performance constrained environments, an offline algorithm reduces the performance constraints allowing for application to a wider variety of environments. Finally, we demonstrate combining these techniques to control an agent in the world of Quake III Arena, resulting in a computer controlled agent capable of adapting to different opponents.

Committee:

Soumya Ray (Advisor); Swarup Bhunia (Committee Member); Michael Lewicki (Committee Member); Frank Merat (Committee Member)

Subjects:

Computer Science

Keywords:

gpu; artificial intelligence; parallel programming; reinforcement learning; machine learning; least-squares policy iteration; lspi; quake iii

Jiang, KeSmall-Variance Asymptotics for Bayesian Models
Doctor of Philosophy, The Ohio State University, 2017, Computer Science and Engineering
Bayesian models have been used extensively in various machine learning tasks, often resulting in improved prediction performance through the utilization of (layers of) latent variables when modeling the generative process of the observed data. Extending the parameter space from finite to infinite-dimensional, Bayesian nonparametric models can infer the model complexity directly from the data and thus also adapt with the amount of the observed data. This is especially appealing in the age of big data. However, such benefits come at a price: the parameter training and the prediction are notoriously difficult even for parametric models. Sampling and variational inference techniques are two standard methods for inference in Bayesian models, but for many problems, neither approach scales effectively to large-scale data. Currently, there is significant ongoing research trying to scale these methods using ideas from stochastic differential equations and stochastic optimization. A recent thread of research has considered small-variance asymptotics of latent-variable models as a way to capture the benefits of rich probabilistic models while also providing a framework for designing more scalable combinatorial optimization algorithms. Such models are often motivated by the well-known connection between mixtures of Gaussians and K-means: as the variances of the Gaussians tend to zero, the mixture of Gaussians model approaches K-means, both in terms of objectives and algorithms. In this dissertation, we will study small-variance asymptotics of Bayesian models, yielding new formulations and algorithms which may provide more efficient solutions to various unsupervised learning problems. Firstly, we consider clustering problems: exploring small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. Secondly, we consider sequential models: exploring the small-variance asymptotic analysis of the infinite hidden Markov models, yielding a combinatorial objective function for discrete-data sequence observations with a non-fixed number of states. This involves a k-means-like term along with penalties based on state transitions and the number of states. We also present a simple, scalable, and flexible algorithm to optimize it. Lastly, we consider the topic modeling problems, which have emerged as fundamental tools in unsupervised machine learning. We approach it via combinatorial optimization, and take a small-variance limit of the latent Dirichlet allocation model to derive a new objective function. We minimize this objective by using ideas from combinatorial optimization, obtaining a new, fast, and high-quality topic modeling algorithm. In particular, we show that our results are not only significantly better than traditional small-variance asymptotic based algorithms, but also truly competitive with popular probabilistic approaches.

Committee:

Mikhail Belkin (Advisor); Brian Kulis (Advisor); Alan Ritter (Committee Member)

Subjects:

Computer Science

Keywords:

Bayesian models, machine learning, Small-Variance Asymptotics

Prasanna, PrateekNOVEL RADIOMICS FOR SPATIALLY INTERROGATING TUMOR HABITAT: APPLICATIONS IN PREDICTING TREATMENT RESPONSE AND SURVIVAL IN BRAIN TUMORS
Doctor of Philosophy, Case Western Reserve University, 2017, Biomedical Engineering
Cancer is not a bounded, self-organized system. Most malignant tumors have heterogeneous growth, leading to disorderly proliferation well beyond the surgical margins. In fact, the impact of certain tumors is observed not just within the visible tumor, but also in the immediate peritumoral, as well as in seemingly normal-appearing adjacent field. Visual inspection is often not a reliable instrument in cancer diagnosis, providing only qualitative analysis of an image, thereby missing subtle disease signatures. These, and other imaging limitations can lead to unnecessary surgical interventions. Computerized image analysis has shown promise in comprehending disease heterogeneity through quantification and detection of sub-visual patterns. In this work, we present novel radiomic tools to identify subtle radiologic cues (radiomic descriptors) and address clinical challenges in cancer diagnosis, prognosis, and treatment-evaluation. The developed tools and techniques are modality- and domain-agnostic. They can be applied in a pan-cancer setting to mine information from radiographic images and discover associations with underlying molecular (radio-genomics) or histological (radio-pathomics) characteristics to provide a holistic characterization of disease. We have demonstrated their efficacy in addressing problems in prognosis and treatment management of brain tumors. The challenges we target specifically include (1) inability to estimate survival at a pre-treatment stage and (2) inability to avoid highly-invasive surgeries in patients with radiation-induced treatment changes that mimic tumor recurrence. Underlying heterogeneity is linked to poor prognosis and tumor recurrence. Cellular level differences associated with the distinct physiological pathways might also manifest at the radiographic (i.e. MRI) length scale. We present two radiomic descriptors, Co-occurrence of Local Anisotropic Gradient Orientations (CoLlAGe) and radiographic-Deformation and Textural Heterogeneity (r-DepTH), which attempt to capture voxel-level textural and structural heterogeneity associated with brain tumors on MRI. These radiomic features are extracted not only from the solid tumor regions, but also from the adjacent tumor habitat and the healthy parenchyma. Subsequently, they are used in a machine learning setting to predict survival on treatment-naive imaging and characterize radiation-induced effects on post-treatment MRI. Further, via human-machine comparison experiments, we demonstrate the utility of radiomic-based frameworks as a second read decision support in cancer management.

Committee:

Anant Madabhushi (Advisor); Pallavi Tiwari (Committee Chair); David Wilson (Committee Member); Lisa Rogers (Committee Member); Charles Lanzieri (Committee Member)

Subjects:

Biomedical Engineering; Biomedical Research

Keywords:

Radiomics; Texture; Brain; Necrosis; Recurrence; Machine Learning; Radiogenomics; GBM; CoLlAGe; Cancer; Image analysis; Habitat; Survival; Prognosis; Diagnosis; treatment evaluation

Yu, Andrew SeohwanNBA ON-BALL SCREENS: AUTOMATIC IDENTIFICATION AND ANALYSIS OF BASKETBALL PLAYS
Master of Computer and Information Science, Cleveland State University, 2017, Washkewicz College of Engineering
The on-ball screen is a fundamental offensive play in basketball; it is often used to trigger a chain reaction of player and ball movement to obtain an effective shot. All teams in the National Basketball Association (NBA) employ the on-ball screen on offense. On the other hand, a defense can mitigate its effectiveness by anticipating the on-ball screen and its goals. In the past, it was difficult to measure a defender’s ability to disrupt the on-ball screen, and it was often described using abstract words like instincts, experience, and communication. In recent years, player motion-tracking data in NBA games has become available through the development of sophisticated data collection tools. This thesis presents methods to construct a framework which can extract, transform, and analyze the motion-tracking data to automatically identify the presence of on-ball screens. The framework also provides assistance for NBA players and coaches to adjust their game plans regarding the on-ball screen using trends from past games. With the help of support vector machines, the framework identifies on-ball screens with an accuracy of 85%, which shows considerable improvement from the current published results in existing literature.

Committee:

Sunnie Chung, Ph.D. (Committee Chair); Yongjian Fu, Ph.D. (Committee Member); Nigamanth Sridhar, Ph.D. (Committee Member)

Subjects:

Artificial Intelligence; Computer Science

Keywords:

NBA; Basketball; Basketball Analytics; NBA Analytics; Data Mining; Web Scraping; Machine Learning; Support Vector Machine; Classification;

Wood, Nicholas LinderA Novel Kernel-Based Classification Method using the Pythagorean Theorem
Master of Science, The Ohio State University, 2016, Chemical Engineering
The drug discovery process is a long and expensive one, and therefore methods which can reduce the required time and money are desirable. In this work, we develop a novel classification model, resembling other kernel-based methods in machine learning, which can be used as a sieve to select the drug candidates most likely to reach a desired endpoint. We argue for the efficacy of this model by giving it the task of predicting the outcome of the Ames mutagenicity test, a biological assay indicating whether a molecule is likely carcinogenic. The model was run twice using two sets of molecular fingerprints, Molecular Access (MACCS) keys and Toxprints. The results of the model yielded a concordance, specificity, and sensitivity of 0.79, 0.75, and 0.81, respectively, for the MACCS keys and 0.78, 0.72, and 0.82 respectively for the Toxprints. Additionally, the area under the ROC curve (AUC) was calculated as 0.82 and 0.81 for the MACCS keys and Toxprints respectively. These results are comparable to other commercially and publicly available models, even when very little cognizance was taken to the fingerprints chosen. We conclude that the model shows promise especially when the underlying mechanism causing the endpoint is unknown. We conclude by proposing further ways in which the model, and the method underlying the model, could be applied.

Committee:

James Rathman (Advisor)

Subjects:

Chemical Engineering; Information Science

Keywords:

classification, machine learning, kernel, chemoinformatics, Ames mutagenicity, Pythagorean theorem

Cai, XiaoshuDEVELOPMENT OF COMPUTATIONAL APPROACH FOR DRUG DISCOVERY
Master of Sciences, Case Western Reserve University, 2016, EECS - Computer and Information Sciences
Computational drug discovery is still in its early stage. Effective approaches are needed to understand disease mechanisms and link disease to potential drug treatments through large scale data. In this thesis, I present two data mining approaches to understand disease genetic basis for colorectal cancer (CRC) and identify candidate drug therapies for inflammatory bowel disease (IBD), respectively. In the first study, I developed a computational pipeline to systematically screen millions of genomic signature of FDA-approved drugs and IBD, and prioritize potential candidate drugs for IBD. The prediction precision improved by 67% for clinical trial drugs and 43% for off-target drugs comparing to a state-of-art approach. In the second study, I developed a supervised machine learning approach to predict CRC tumor progression using patient level genomic features. Pathway analysis and network visualization were also applied to significantly expressed cytokine genes. I demonstrated the significant role of cytokines in CRC metastasis.

Committee:

Rong Xu (Advisor); Xiang Zhang (Committee Member); Xiaofeng Zhu (Committee Member)

Subjects:

Bioinformatics; Computer Science

Keywords:

Drug repositioning; Machine learning; LINCS; Cytokines

Rawashdeh, AhmadSemantic Similarity of Node Profiles in Social Networks
PhD, University of Cincinnati, 2015, Engineering and Applied Science: Computer Science and Engineering
It can be said, without exaggeration, that social networks have taken a large segment of population by a storm. Regardless of the actual geographical location, of socio-economic status, as long as access to an internet connected computer is available, a person has access to the whole world, and to a multitude of social networks. By being able to share, comment, and post on various social networks sites, a user of social networks becomes a "citizen of the world", ensuring presence across boundaries (be they geographic, or socio-economic boundaries). At the same time social networks have brought forward many issues interesting from computing point of view. One of these issue is that of evaluating similarity between nodes/profiles in a social network. Such evaluation is not only interesting, but important, as the similarity underlies the formation of communities (in real life or on the web), of acquisition of friends (in real life and on the web). In this thesis, several methods for finding similarity, including semantic similarity, are investigated, and a new approach, Wordnet-Cosine similarity is proposed. The Wordnet-Cosine similarity (and associated distance measure) combines both a lexical database, Wordnet, with Cosine similarity (from information retrieval) to find possible similar profiles in a network. In order to assess the performance of Wordnet-Cosine similarity measure, two experiments have been conducted. The first experiment illustrates the use for Wordnet-Cosine similarity in community formation. Communities are considered to be clusters of profiles. The results of using Wordnet-Cosine are compared with those using four other similarity measures (also described in this thesis). In the second set of experiments, Wordnet-Cosine was applied to the problem of link prediction. Its performance of predicting links in a random social graph was compared with a random link predictor and was found to achieve better accuracy.

Committee:

Anca Ralescu, Ph.D. (Committee Chair); Irene Diaz, Ph.D. (Committee Member); Rehab M. Duwairi, Ph.D. (Committee Member); Kenneth Berman, Ph.D. (Committee Member); Chia Han, Ph.D. (Committee Member); Dan Ralescu, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

Social Networks;Wordnet;Semantic;Machine Learning;Link Prediction

Natale, JamesA Strategy for Reducing Congestive Heart Failure Readmissions Through the Use of Interventions Targeted by Machine Learning
Doctor of Philosophy, University of Akron, 2015, Mechanical Engineering
Hospitals are faced with a financial penalty when patients are readmitted to the hospital within thirty days of discharge. However preventing readmissions is a difficulty for hospitals. There are a wide variety of interventions which are aimed towards improved patient outcomes, as well as preventing patient readmissions. However these interventions are expensive in terms of financial outlay, as well as in the time spent by staff. However, if interventions are applied indiscriminately, these methods would cost more than they would have saved due to readmissions prevention. We describe an all encompassing strategy which would allow for hospitals to reduce their readmissions in a cost-effective manner. We apply an analytical approach to all aspects of the problem. By creating a predictive model with machine learning methods on hospital records, we can determine the risks of patients being readmitted. We detail how the literature on intervention strategies can be condensed and utilized to determine prospective strategies which may be of interest. Utilizing the risk predicted by the model, as well as published literature on interventions we determine the optimal solution of which patients will receive which interventions through a genetic algorithm heuristic search. Only by combining the three aspects together can we formulate an analytics driven approach to reducing readmissions in a cost-effective manner.

Committee:

Shengyong Wang, Dr. (Advisor); Chien-Chung Chan, Dr. (Committee Member); Richard Einsporn, Dr. (Committee Member); Ping Yi, Dr. (Committee Member); Jiang Zhe, Dr. (Committee Member)

Subjects:

Artificial Intelligence; Industrial Engineering; Statistics

Keywords:

Systems Engineering, Machine Learning, Optimization

Next Page