Search Results (1 - 25 of 88 Results)

Sort By  
Sort Dir
 
Results per page  

Myneni, GreeshmaA System for Managing Experiments in Data Mining
Master of Science, University of Akron, 2010, Computer Science
Data Mining is the process of extracting patterns from data. There are many methods in Data Mining but our research mainly focuses on the classification methods. We present the existing data mining systems that are available and the missing features in those systems. An experiment in our research refers to a data mining task. In this research we present a system that manages data mining tasks. This research provides various advantages of managing the data mining tasks. The system to be dealt with in our research is the “Rule-based Data Mining System”. We present all the existing features in the Rule-based Data Mining System, and show how the features are redesigned to manage the data mining tasks in the system. Some of the new features are managing the datasets accordingly with respective to the data mining task, recording the detail of every experiment held, giving a consolidated view of experiments held and providing a feature to retrieve any experiment with respect to a data mining task . After that we discuss the design and implementation of the system in detail. We present also the results obtained by using this system and the advantages of the new features. Finally all the features in the system are demonstrated with a suitable example. The main contribution of this thesis is to provide a management feature for a data mining system.

Committee:

Chien-Chung Chan, Dr. (Advisor)

Subjects:

Computer Science; Mining

Keywords:

datasets; Test File; Data Mining; File; Learn and Test; data mining task

Mo, DengyaoRobust and Efficient Feature Selection for High-Dimensional Datasets
PhD, University of Cincinnati, 2011, Engineering and Applied Science: Mechanical Engineering
Feature selection is an active research topic in the community of machine learning and knowledge discovery in databases (KDD). It contributes to making the data mining model more comprehensible to domain experts, improving the prediction performance and robustness of the model, and reducing model training. This dissertation aims to provide solutions to three issues that are overlooked by many current feature selection researchers. These issues are feature interaction, data imbalance, and multiple subsets of features. Most of extant filter feature selection methods are pair-wise comparison methods which test each pair of variables, i.e., one predictor variable and the response variable, and provide a correlation measure for each feature associated with the response variable. Such methods cannot take into account feature interactions. Data imbalance is another issue in feature selection. Without considering data imbalance, the features selected will be biased towards the majority class. In high dimensional datasets with sparse data samples, there will be many different feature sets that are highly correlated with the output. Domain experts usually expect us to identify multiple feature sets for them so that they can evaluate them based on their domain knowledge. This dissertation aims to solve these three issues based on a criterion called minimum expected cost of misclassification (MECM). MECM is a model independent evaluation measure. It evaluates the classification power of the tested feature subset as a whole. MECM has adjustable weights to deal with imbalanced datasets. A number of case studies showed that MECM had some favorable properties for searching a compact subset of interacting features. In addition, an algorithm and corresponding data structure were developed to produce multiple feature subsets. The success of this research will have broad applications ranging from engineering, business, to bioinformatics, such as credit card fraud detection, email filter setting for spam classification, gene selection for disease diagnosis.

Committee:

Hongdao Huang, PhD (Committee Chair); Sundararaman Anand, PhD (Committee Member); Jaroslaw Meller, PhD (Committee Member); David Thompson, PhD (Committee Member); Michael Wagner, PhD (Committee Member)

Subjects:

Information Systems

Keywords:

Feature Selection;Data Mining;Machine Learning;Statistical Modeling;Knowledge Discovery in Database

Agarwal, KhushbuA partition based approach to approximate tree mining : a memory hierarchy perspective
Master of Science, The Ohio State University, 2007, Computer and Information Science
In the last decade or so, database community is entering a new arena with generation of petabytes of data. In addition to the storage challenges associated with this data, Data Mining has become an imperative discipline to extract the relevant information from this data. Tree Mining,has also received attention in last couple of years with applicability to web mining, bio-informatics etc. Up to now, a majority of the tree mining algorithms have primarily focused on developing efficient algorithms by use of compact tree representations. However, these approaches may not be beneficial for datasets, which have large trees. In this dissertation, we focus on an approximation approach by partitioning individual trees. We design root based, edge based}, size based and node based partitioning schemes. Our evaluations show that balanced partitioning approaches show significant improvements over the existing algorithms. We observe interesting trend between the approximation and the execution time.

Committee:

Srinivasan Parthasarathy (Advisor)

Subjects:

Computer Science

Keywords:

Data Mining; Tree Partitioning; Memory Based Approximation; Data Preprocessing

Shiao, GraceDesign and Implementation of Data Analysis Components
Master of Science, University of Akron, 2006, Computer Science
This thesis describes the design and implementation of the data analysis components. Many features of modern database systems facilitate the decision-making process. Recently, Online Analytical Processing (OLAP) and data mining are increasingly being used in a wide range of applications. OLAP allows users to analyze data from a wide variety of viewpoints. Data mining is the process of selecting, exploring, and modeling large amounts of data to discover previously unknown patterns for business advantage. Microsoft® SQL server™ 2000 Analysis Services provides a rich set of tools to create and to maintain OLAP and data mining objects. In order to use these tools, users need to fully understand the underlying architectures and the specialized technological terms, which are not related to the data analysis. The complexities in the development challenges prevent the data analysts to use these tools effectively. In this work, we developed several components, which can be used as the foundation in the analytical applications. Using these components in the software applications can hide the technical complexities and can provide tools to build the OLAP and mining model and to access data information from these model systems. Developers can also reuse these components without coding from scratch. The reusability of these components enhances the application’s reliability and reduces the development costs and time.

Committee:

Chien-Chung Chan (Advisor)

Subjects:

Computer Science

Keywords:

OLAP; Data Mining; Analysis Services

Cederquist, AaronFrequent Pattern Mining among Weighted and Directed Graphs
Master of Sciences, Case Western Reserve University, 2009, EECS - Computer and Information Sciences
Mining frequent graph patterns has great practical implications, since data in numerous application domains such as biology, sociology, and finance, can be represented as graphs. While a large amount of research has been performed in this area, most of it assumes that the input data is accurate. In many applications, however, the data is noisy, and an edge can only be assumed to exist with a certain probability. This study presents a match model which measures the “expected” occurrences of a pattern in a graph. In addition, many relationships are not symmetric, and directed graphs are necessary to model more complex networks. While weight and directionality are separate, they occur together. Due to the sophisticated nature of the match model, it is necessary to develop a new algorithm (WHIM) to mine patterns in weighted directed graphs. Additionally, a novel canonical form is devised for faster mining of directed graphs.

Committee:

Jiong Yang (Committee Chair); Mehmet Koyuturk (Committee Member); Michael Rabinovich (Committee Member)

Subjects:

Computer Science

Keywords:

data mining; graph; canonical form; frequent pattern mining

Liu, PengAdaptive Mixture Estimation and Subsampling PCA
Doctor of Philosophy, Case Western Reserve University, 2009, Sciences
Data mining is important in scientific research, knowledge discovery and decision making. A typical challenge in data mining is that a data set may be too large to be loaded all together, at one time, into computer memory for analyses. Even if it can be loaded all at once for an analysis, too many nuisance features may mask important information in the data. In this dissertation, two new methodologies for analyzing large data are studied. The first methodology is concerned with adaptive estimation of mixture parameters in heterogeneous populations of large-n data. Our adaptive estimation procedures, the partial EM (PEM) and its Bayesian variants (BMAP and BPEM) work well for large or streaming data. They can also handle the situation in which later stage data may contain extra components (a.k.a. "contaminations" or "intrusions") and hence have applications in network traffic analysis and intrusion detection. Furthermore, the partial EM estimate is consistent and efficient. It compares well with a full EM estimate when a full EM procedure is feasible. The second methodology is about subsampling large-p data for selecting important features under the principal component analysis (PCA) framework. Our new method is called subsampling PCA (SPCA). Diagnostic tools for choosing parameter values, such as subsample size and iteration number, in our SPCA procedure are developed. It is shown through analysis and simulation that the SPCA can overcome the masking effect of nuisance features and pick up the important variables and major components. Its application to gene expression data analysis is also demonstrated.

Committee:

Jiayang Sun, PhD (Advisor); Joe Sedransk, PhD (Committee Member); Guoqiang Zhang, PhD (Committee Member); Mark Schluchter, PhD (Committee Member); Patricia Williamson, PhD (Committee Member)

Subjects:

Statistics

Keywords:

large data; data mining; mixture models; Gaussian mixtures; parameter estimation; adaptive procedure; partial EM; high-dimensional data; large p small n; dimension reduction; feature selection; subsampling

Weng, ZhouyangApplication of Data Mining Techniques in Human Population Genetic Structure Analysis
MS, University of Cincinnati, 2017, Medicine: Biostatistics (Environmental Health)
The success of genome-wide association study (GWAS) depends on genotyping a large number of SNPs and determining which of these SNPs are significantly associated with the outcome of disease. While studying for these associations, it is important to take into account the effects caused by differences of ethnicities and population groups. The study of human population genetic structure focused on analyzing the human genetic variations between different populations and on assigning individuals to subpopulations based on the degree of human genetic variations. Currently the leading statistical method for uncovering population structure in GWAS is Principal Component Analysis (PCA). However one major problem of using PCA on SNPs data is that the principal components that are defined do not correspond to actual SNP variables, we need to find ways that can map the principal components to measure the importance of actual SNP variables in terms of ancestry information. To overcome these limitations, Sparse Principal Component Analysis (SPCA) has been proposed to identify a small set of structure informative markers more efficiently by modifying the alternating regression equation for PCA with including a penalty term during optimization that encourages SNPs with negligible loadings to vanish. Yet the computation costs of selecting a small subset of actual ancestry informative SNP variables via SPCA can still be expensive, especially where a large number of non-zero loadings across multiple principal components are required for structure analysis. Given these limitations, it is desirable to find some methods which not only achieve the population classification but also reduce the number of explicitly used variables and can select actual SNP variables that are ancestry informative markers in a cost-effective manner. The goals of this study will not only focus on making inferences on the application of major data mining methods in human population genetics structure analysis but also on introducing a two-stage approach which combines two popular methods to improve efficiency and accuracy in population classification and variable selection. Specifically, the first step of the proposed two-stage method is to identify a subset of SNP markers that capture major genetic variations between the population groups using SPCA; the second step is to estimate population structure based on the selected SNP markers and conducted the variable selection of ancestry informative markers using Random Forest (RF). Our two-step SPCA-RF approach was tested using empirical and simulated datasets. The empirical dataset came from the simulated next generation sequence data, which was provided for the Genetic Analysis Workshop (GAW) 17 based on the real exome sequence data from the 1000 Genome Project. Results from the two-step SPCA-RF algorithm suggested higher population prediction accuracy with relatively fewer markers are possible. In comparison with the existing methods, the proposed SPCA-RF approach steadily gave a similar or lower value of error rates and retained all important variables that are ancestry informative. Moreover, the implementation of all methods has been carried out in the open source R software, which provides the future researchers with the source code to replicate the research for further investigation.

Committee:

Marepalli Rao, Ph.D. (Committee Chair); Tesfaye Mersha, Ph.D. (Committee Member); Changchun Xie, Ph.D. (Committee Member)

Subjects:

Biostatistics

Keywords:

Data Mining;Population Genetic Structure Analysis;Variable Selection;Population Classification

Satish, SnehaA Mechanism Design Approach for Mining 3-clusters across Datasets from Multiple Domains
MS, University of Cincinnati, 2016, Engineering and Applied Science: Computer Science
Cross-domain data analysis has been an ongoing field of research in Data Mining in the recent past for studying relationships between distinct domains and uncovering interesting patterns across data. The notion of 3-clusters has emerged as a novel approach in such situations where distinct datasets describing a same set of objects need to be mined concurrently. A 3-cluster is, in essence, a set of clusters exhibiting common similarity patterns across two or more datasets. Most 3-cluster mining algorithms in literature are search based and hence computationally exhaustive. We introduce a mechanism design perspective to the problem of mining 3-clusters from a pair of related datasets. Our algorithm designs a set of strategies and incentives with the goal of maximizing an objective measure to produce quality 3-clusters. The objective measure that we choose takes into the account two kinds of similarity patterns of data objects – one, which is within the local cluster of one dataset and the other, which is shared across clusters of the other dataset. Unlike the centralized approaches to clustering, our scheme uses concept of agents to preserve the privacy of interrelated datasets by preventing unnecessary transaction of private information. Our results show that our negotiation-based algorithm achieves convergence at a much faster rate compared to search-based algorithms, is robust to variation in parameters, and produces good quality clusters.

Committee:

Raj Bhatnagar, Ph.D. (Committee Chair); Nan Niu, Ph.D. (Committee Member); Carla Purdy, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

Cross-domain data analysis;Data Mining;3-clusters;similarity;mechanism design;incentives

Faisal, S MTowards Energy Efficient Data Mining & Graph Processing
Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering
Ever increasing energy cost is one of the most critical concerns for large scale deployments of data centers. As the demand for large scale data processing increases, it is paramount that energy efficiency is taken into account for designing architectures as well as algorithms for large scale data processing. While cost is a critical issue, it is not the only point of interest; Increased energy consumption has severe impact on the environment. Hence, it is important to pay close attention towards energy efficient data mining and graph processing algorithms that leverage architectural as well as algorithmic features to reduce energy consumption while serving respective purposes with a reduced carbon footprint. In this work, we take a close look at energy efficiency in the broad area of data mining and graph processing and approach the problem from multiple fronts. First, we take a pure software centric approach where we focus on developing frameworks that provide faster solutions to problems that are expensive otherwise and save energy thereby – following the race-to-halt phenomenon. Our proposed framework allows space efficient representation, scalable distributed processing and ease of programming for large, power law graphs. We also develop parallel, distributed implementations of a popular graph clustering algorithm, Regularized Markov Clus- tering (RMCL), on various distributed memory programming frameworks. Next we analyze commonly used data mining, multimedia and graph clustering algorithms to explore their energy profile and tolerance to random bit errors induced by low voltage computation. At the core of any research on energy efficient, low voltage computing is reliable error models for functional units at low voltage. We find that existing models lack sufficient detail and fail to capture the behavior in a realistic manner. Driven by the necessity, we propose a set of accurate, robust and realistic models for functional units’ behavior at low voltage. Finally, We take a hardware- software co-design approach where a combination of energy efficient hardware and energy conscious software cooperate with each other to execute jobs efficiently with respect to energy consumption. We propose a novel framework for energy efficient graph processing that identifies important edges in a graph and applies energy efficient computing across edges that are not important for the graph. In this dissertation we propose various solutions to improving energy efficiency of large scale data mining and graph processing applications. Our parallel framework provides scalability and efficiency while processing large graphs. Our error models provide estimation within 1-3% of comprehensive analog simulations and 5-17x higher accuracy compared to existing error models. Our energy efficient graph processing framework allows for processing of large, modern graphs while saving 3-30% in power consumption. All of our proposed techniques provide high quality output for various data mining and graph processing algorithms while saving significant amount of energy.

Committee:

Srinivasan Parthasarathy (Advisor); P. Sadayappan (Committee Member); Radu Teodorescu (Committee Member)

Subjects:

Computer Science

Keywords:

Energy Efficiency; Data Mining; Graph Processing; Energy Efficient Computing

Joshi, VineetUnsupervised Anomaly Detection in Numerical Datasets
PhD, University of Cincinnati, 2015, Engineering and Applied Science: Computer Science and Engineering
Anomaly detection is an important problem in data mining with diverse applications such as financial fraud detection, medical diagnosis and computer systems intrusion detection. Anomalies are data points that are substantially different from the rest of the population. These generally represent valuable information about the system for which the analyst is interested in detecting anomalies accurately and efficiently, and then taking appropriate actions in response. There are scenarios where tremendous impact can be made by detecting anomalies in a timely and accurate manner, e.g. early detection of spurious credit card transactions can prevent financial damages to a credit card holder as well as the banking institution that issued the credit card. Similarly, abnormal readings by a sensor monitoring an industrial plant can help detect system faults and avert damages. All these applications have led to an interest in finding efficient methods for detecting anomalies. Anomaly detection continues to be an active research area within data mining. In this dissertation we investigate various aspects of anomaly detection problem. To determine anomalies in a dataset, a concrete definition of anomalous behavior is required. There is no single universally applicable definition of anomalies because each definition presents perspective of an anomalous behavior which may not necessarily apply across diverse datasets. In this work we investigate a new definition of anomalous behavior. We compare this definition with an existing definition of outlier-ness and demonstrate the effectiveness of the new definition. We further present a refinement of the metric of outlier-ness that we have mentioned above. We discovered that the metric initially proposed can be altered to yield a new metric of outlier-ness that accentuates the difference in the outlier-ness scores of strong outliers as compared to the non-anomalous datapoints. We compare this updated metric with the metric we first presented, and also with an established metric of outlier-ness. As the number of attributes increases, the distances between the nearest and the farthest data points tend to converge resulting in distance concentration. Thus the anomalies reported by most definitions of anomalous behavior tend to lose meaning with increasing numbers of attributes. It has been suggested that in such datasets, the anomalies are located in smaller subspaces of attributes. Hence, anomalies should be searched in subspaces of the attributes, instead of the complete attribute space. However the number of subspaces increases very rapidly as the number of attributes increases. The number of possible subspaces for a given set of attributes in the dataset is a combinatorial number. This makes, an exhaustive search through all possible subspaces infeasible. In this dissertation, after presenting a novel definition of anomalous behavior, we present an efficient method of exploring the possible subspaces arising from the attributes of a dataset. The subspaces of attributes in any dataset can be arranged in a lattice. The anomalous behavior of data points as we traverse this lattice conveys meaningful information about the structure of the data. In the fourth problem that we address, we present a method that investigates the anomalous behavior of data points across the different subspaces in the lattice in which the same point displays anomalous behavior. Further, our method also computes the contiguous regions of the subspace lattice where the same data point demonstrates anomalous behavior.

Committee:

Raj Bhatnagar, Ph.D. (Committee Chair); Prabir Bhattacharya, Ph.D. (Committee Member); Karen Davis, Ph.D. (Committee Member); Anil Jegga, D.V.M. M.Res. (Committee Member); Mario Medvedovic, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

Data Mining;Anomaly Detection;Outlier Detection;Subspaces

Quan, AaronBatch Sequencing Methods for Computer Experiments
Doctor of Philosophy, The Ohio State University, 2014, Statistics
In early research, computer experiments were assumed to have been performed by a single powerful, but expensive, computer. With the availability of cheaper, powerful computers, it is now more common to run simulations simultaneously on several computers, and we consider the possibility of using multiple computers. In such cases, computer experiments need to be run in specific batches of runs, one run per computer, with the batch size determined by the number of computers. Various methods to perform these simulations in batches have been proposed. Here, we investigate batch methods that make use of a combination of expected improvement and space filling criteria in order to construct efficient experimental designs for sequentially-adaptive computer experiments problems. These methods select the site with the highest expected improvement, with the rest of the batch sites selected through various methods that balance high expected improvement with being space filling. These methods are tested on various problem types, and the results are compared and contrasted with each other, as well as with one-at-a-time methods and other batch methods in the literature, to determine the effectiveness of the proposed methods for sampling in batches. We conclude that some of the proposed batch methods have advantages over existing methods.

Committee:

William Notz (Advisor); Christopher Hans (Committee Member); Matthew Pratola (Committee Member)

Subjects:

Statistics

Keywords:

statistics; computer experiments; batches; batch sequencing; sampling; data mining; batch sampling; design of experiments; experimental design;

Goyder, MatthewKnowledge Accelerated Algorithms and the Knowledge Cache
Master of Science, The Ohio State University, 2012, Computer Science and Engineering

Knowledge discovery through data mining is the process of automatically extract- ing actionable information from data, that is, the information or knowledge found within data which provides insight beyond that which may be found by observing the cardinal state of the data itself. This process is human driven; there is always a human at the core.

Knowledge discovery is inherently iterative, a human discovers information by posing questions to a data mining system, which in turn provides answers. New questions are developed upon receipt of these answers and these new questions are asked. Clearly these answers need to be provided in as timely a fashion as possible in order for the human at the core to form ideas and solidify hypotheses. Unfortunately many questions take too long to be answered to be useful to the human. Is there anything we can do to speed up the response to these questions if the answer is based in part upon answers previously provided?

What we can do is when a query (question) is submitted (asked) to a data mining system, we can store the result (answer) as well as information about the result in a cache and then re-use this information to help respond to the next query in a more timely fashion. If a query partially contains a result which was found in the past, we can combine this information with new information to provide the result much faster than if we were to re-run a query incorporating no prior information.

This thesis explores this idea by introducing a high performance information cache called a Knowledge Cache with remote access capabilities, as well as a programming model and API for clients to both store, query, share and retrieve knowledge objects from within it. These knowledge objects can then be used in conjunction with a modified data mining algorithm to reduce query processing time for new queries where prior information is useful. We explain the usage model of the Knowledge Cache and API, as well as demonstrate performance gains by using the Knowledge Cache in the context of two classic data mining algorithms: k-means clustering and frequent itemset mining.

Committee:

Srinivasan Parthasarathy, PhD (Advisor); Gagan Agrawal, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

data mining;knowledge caching;clustering;frequent itemset mining;

SHENCOTTAH K.N., KALYANKUMARFINDING CLUSTERS IN SPATIAL DATA
MS, University of Cincinnati, 2007, Engineering : Computer Science
Spatial data mining is the discovery of patterns in spatial databases. The driving factor for research in spatial data mining is the increase in collection of spatial data through business and geographical database systems. Some of the spatial data collected include remotely sensed images, geographical information with spatial attributes such as location, digital sky survey data, mobile phone usage data, and medical data. Spatial data mining takes into account spatial and non-spatial data attributes in these data. The goal of our research is to discover and merge cluster regions which contain spatial data values in a given standard deviation range. Our approach involves visualizing the data in a 2-D grid, and involves mining the data using quad-tree, a spatial data structure. Quad-tree stores the entire 2-D grid in its leaf node at level k. Cluster information such as standard deviation, mean, x-coordinate, y-coordinate, and number of nodes are calculated at k-1 level, and synthesized up the quad-tree. Based on the input standard deviation range, we discover clusters, determine adjacency of the clusters and merge interesting clusters. By increasing or decreasing the input standard deviation range, we observe that the cluster boundary changes and we discover clusters of different shapes such as rectangle, triangle, rhombus and concave. Our algorithm can be useful in identifying patterns such as increase or decrease in crime rate or spread of disease in a given region. The 2-D grid refers to the physical space, and the values in the cells refers to non-spatial attribute values such as crime, temperature etc. Spatial abstractions such as region, location etc. are a result of clustering of non-spatial attribute values.

Committee:

Dr. Raj Bhatnagar (Advisor)

Subjects:

Computer Science

Keywords:

clusters; Spatial data mining; Quad-Tree; Spatial clustering

Fang, ChunshengNovel Frameworks for Mining Heterogeneous and Dynamic Networks
PhD, University of Cincinnati, 2011, Engineering and Applied Science: Computer Science and Engineering
Graphs serve as an important tool for discrete data representation. Recently, graph representations have made possible very powerful machine learning algorithms, such as manifold learning, kernel methods, semi-supervised learning. With the advent of large-scale real world networks, such as biological networks (disease network, drug target network, etc.), social networks (DBLP Co-authorship network, Facebook friendship, etc.), machine learning and data mining algorithms have found new application areas and have contributed to advance our understanding of properties, and phenomena governing real world networks. When dealing with real world data represented as networks, two problems arise quite naturally: I) How to integrate and align the knowledge encoded in multiple and heterogeneous networks? For instance, how to find out the similar genes in co-disease and protein-protein interaction networks? II) How to model and predict the evolution of a dynamic network? A real world example is, given N years snapshots of an evolving social network, how to build a model that can capture the temporal evolution and make reliable prediction? In this dissertation, we present an innovative graph embedding framework, which identifies the key components of modeling the evolution in time of a dynamic graph. Different from the many state-of-the-art graph link prediction and modeling algorithms, it formulates the link prediction problem from a geometric perspective that can capture the dynamics of the intrinsic continuous graph manifold evolution. It is attractive due to its simplicity and the potential to relax the mining problem into a feasible domain which enables standard machine learning and regression models to utilize historical graph time series data. To address the first problem, we first propose a novel probability-based similarity measure which led to promising applications in content based image retrieval and image annotation, followed by a manifold alignment framework to align multiple heterogeneous networks, which demonstrate its power in mining biological networks. Finally, the dynamic graph mining framework generalizes most of the current graph embedding dynamic link prediction algorithms. Comprehensive experimental results on both synthesized and real-world datasets demonstrate that our proposed algorithmic framework for multiple heterogeneous networks and dynamic networks, can lead to better and more insightful understanding of real world networks. Scalability of our algorithms is also considered by employing MapReduce cloud computing architecture.

Committee:

Anca Ralescu, PhD (Committee Chair); Anil Jegga, DVMMRes (Committee Member); Fred Annexstein, PhD (Committee Member); Kenneth Berman, PhD (Committee Member); Yizong Cheng, PhD (Committee Member); Dan Ralescu, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

machine learning;social network;data mining;manifold learning;graph embedding;dynamic graph

Burji, Supreeth JagadishReverse Engineering of a Malware : Eyeing the Future of Computer Security
Master of Science, University of Akron, 2009, Computer Science

Reverse engineering malware has been an integral part of the world of security. At best it has been employed for signature logging malware until now. Since the evolution of new age technologies, this is now being researched as a robust methodology which can lead to more reactive and proactive solutions to the modern security threats that are growing stronger and more sophisticated. This research in its entirety has been an attempt to understand the in and outs of reverse engineering pertaining to malware analysis, with an eye to the future trends in security.

Reverse engineering of malware was done with Nugache P2P malware as the target showing that signature based malware identification is ineffective. Developing a proactive approach to quickly identifying malware was the objective that guided this research work. Innovative malware analysis techniques with data mining and rough sets methodologies have been employed in this research work in the quest of a proactive and feasible security solution.

Committee:

Kathy J. Liszka, PhD (Advisor)

Subjects:

Computer Science; Engineering; Experiments; Systems Design

Keywords:

malware; reverse engineering; data mining; rough sets; rogue malwares; lifecycle of a malware; P2P malware; computer security

Ghareeb, AhmedData mining for University of Dayton campus buildings to predict future demand
Master of Science (M.S.), University of Dayton, 2017, Mechanical Engineering
The ability to forecast demand for large facilities will be increasingly important as real-time power pricing scenarios become increasingly present. Accurate prediction will inform data-driven power shedding to reduce energy costs most effectively with minimal sacrifice of comfort. A number of previous researchers have researched this topic, achieving results with varying amount of success. This study looks to forecast demand for a university complex of buildings, subject to the unique occupancy variation of such institutions. Specifically addressed is the use of academic institutional data associated with temporal enrollment and the academic calendar. As well, it addresses use of demand data in all buildings in an effort to more accurately predict this aggregate demand of the university. A data mining based approach based upon a Random Forest regression tree algorithm is used to develop the forecast model. The mean absolute percentage error (MAPE) value associated with the model applied to a validation set of data is on the order of 2.21 % based upon actual weather data. Using forecasted weather data, the MAPE increases to approximately 6.65 % in predicted day-ahead demand.

Committee:

Kevin Hallinan (Committee Chair); Andrew Chiasson (Committee Member); Zhongmei Yao (Committee Member)

Subjects:

Artificial Intelligence; Climate Change; Energy; Engineering; Environmental Engineering; Mechanical Engineering; Statistics

Keywords:

Data mining; energy prediction; energy demand; energy demand forecasting; energy; prediction; forecasting; modeling;

Yu, Andrew SeohwanNBA ON-BALL SCREENS: AUTOMATIC IDENTIFICATION AND ANALYSIS OF BASKETBALL PLAYS
Master of Computer and Information Science, Cleveland State University, 2017, Washkewicz College of Engineering
The on-ball screen is a fundamental offensive play in basketball; it is often used to trigger a chain reaction of player and ball movement to obtain an effective shot. All teams in the National Basketball Association (NBA) employ the on-ball screen on offense. On the other hand, a defense can mitigate its effectiveness by anticipating the on-ball screen and its goals. In the past, it was difficult to measure a defender’s ability to disrupt the on-ball screen, and it was often described using abstract words like instincts, experience, and communication. In recent years, player motion-tracking data in NBA games has become available through the development of sophisticated data collection tools. This thesis presents methods to construct a framework which can extract, transform, and analyze the motion-tracking data to automatically identify the presence of on-ball screens. The framework also provides assistance for NBA players and coaches to adjust their game plans regarding the on-ball screen using trends from past games. With the help of support vector machines, the framework identifies on-ball screens with an accuracy of 85%, which shows considerable improvement from the current published results in existing literature.

Committee:

Sunnie Chung, Ph.D. (Committee Chair); Yongjian Fu, Ph.D. (Committee Member); Nigamanth Sridhar, Ph.D. (Committee Member)

Subjects:

Artificial Intelligence; Computer Science

Keywords:

NBA; Basketball; Basketball Analytics; NBA Analytics; Data Mining; Web Scraping; Machine Learning; Support Vector Machine; Classification;

Atahary, TanvirAcceleration of Cognitive Domain Ontologies
Doctor of Philosophy (Ph.D.), University of Dayton, 2016, Electrical Engineering
This thesis examined several acceleration efforts of knowledge mining from Cognitive Domain Ontologies (CDOs), which is a knowledge repository in the Cognitively Enhanced Complex Event Processing (CECEP) architecture. The CECEP architecture was developed at US Air force research laboratory. This is an autonomous decision support tool that reasons and learns like a human and enables enhanced agent-based decision-making. This architecture has applications in both military and civilian domains. Real time agents require massively linked knowledge databases to be searched using a large set of constraints to generate intelligent decisions in run time. One of the most computationally challenging aspects of CECEP is mining the domain knowledge captured in CDOs. The CDO mining process employed in the CECEP architecture is cast as a constraint-satisfaction problem (CSP). It falls into the category of NP-complete problems, which are very likely to require massive computing to solve. Even a small instance of an NP-complete problem in some cases could take years of computing to solve. Typically searching is a ubiquitous procedure to solve CSP problems, but sometimes constraint consistency is good enough to find a valid solution without performing a search. This thesis explored several CSP algorithms and deployed two different algorithms on heterogeneous hardware platform in order to mine CDOs. We initially examined the Exhaustive depth first search (EDFS) algorithm on a cluster of GPGPUs and Intel Xeon Phi co-processors. We achieved around 100 times speedup on a GPGPU compare to single CPU. Since the search space grows exponentially with the EDFS algorithm, this study explored an intelligent search algorithm that can prune the search space according to the constraints. We modified the conventional Forward Checking (FC) algorithm and introduced a novel path-based forward checking algorithm to mine CDOs and compared with a commonly utilized CSP solver. Conventional single step forward checking algorithm is highly serial in nature. This thesis developed a three-step forward checking algorithm that enabled parallelism. The serial version of the proposed path-based forward checking algorithm provides a million times speedup compared to a conventional CSP solver. Furthermore we have parallelized this novel algorithm on a cluster environment and achieved approximately 200 times speed up over a serial implementation. This path-based forward checking algorithm was deliberately designed to enable the listing of the solutions in a highly compact and efficient manner. This compact representation does not generate actual solutions but represents them as a set. In real-world applications, exact solutions are required to make decisions. Typically CDOs generate all equally weighted solutions; it is impossible to find the best one from this solution set without imposing relative importance of certain criteria. This thesis implemented several optimization conditions through objective functions and processed as Multiple Constraint Optimization Problems (MCOPs) using modified Branch and Bound algorithm to rank solutions. Additionally we parallelized this optimization procedure on a GPGPU and ended up achieving 85-100 times speedup. This amount of speedup would allow autonomous agents to make decision in real time based on massive knowledge bases.

Committee:

Tarek Taha, Dr. (Committee Chair); Vijayan Asari, Dr. (Committee Member); Eric Balster, Dr. (Committee Member); Scott Douglass, Dr. (Committee Member)

Subjects:

Cognitive Psychology; Computer Engineering; Computer Science; Electrical Engineering

Keywords:

Acceleration of constraint satisfaction problem; parallel CSP; Parallel forward checking algorithm; parallel DFS; cognitive agent; knowledge mining; data mining

Zhang, YiApplication of Hyper-geometric Hypothesis-based Quanti cation and Markov Blanket Feature Selection Methods to Generate Signals for Adverse Drug Reaction Detection
MS, University of Cincinnati, 2012, Engineering and Applied Science: Mechanical Engineering
Pharmacovigilance is the science relating to all concerns about drug safety, especially of managing the risk associated with medications. It serves as a complementary approach to clinical trial. Spontaneous Reporting Systems (SRS) had been constructed world-widely to facilitate tracking the risk of post-marketing drugs. Data mining algorithms had been used for years in detecting possible adverse eects of drugs by analyzing the large amount of data in SRS. This study consists of two parts. One is to propose a statistically sound bivariate analysis method. The objective is to provide a method with a sound theoretical base and an ability of conguring itself for dierent demanded performances. The bivariate analysis method pro- posed in the study is termed Hyper-geometric Hypothesis-based method. This new method is inspired by statistical acceptance sampling techniques used in quality control. It is proposed as an alternative to conventional disproportionality analysis methods such as reporting odds ratio (ROR) and proportional reporting ratio (PRR). The second is to investigate the eec- tiveness of a feature selection approach to reduce false alarms through the identication of confounding drugs. Confounding drug is one of the major sources for false signals generated by established methods. The feature selection method is based on the concept of a Markov blanket that removes features that do not have unique contribution to distinguishing the target concept. It is proposed as an alternative to the emerging Bayesian logistic regression method for detecting adverse drug reaction. Experiments have been conducted using the Adverse Event Reporting System (AERS) main- tained by the US Food and Drug Administration. The results showed that the performance of the Hyper-geometric Hypothesis based quantication method was comparable to that of ROR and PRR by adopting the threshold, P-value = 0.0409, which had been trained through the experiment data. The feature selection approach was able to partially detect confound- ing drugs in the meantime it left a number of dangerous drugs not alarmed. In contrast, Bayesian logistic regression method fails to live up to returning any results to make alarms on drugs.

Committee:

Hongdao Huang, PhD (Committee Chair); Alex Lin, PhD (Committee Member); David Thompson, PhD (Committee Member)

Subjects:

Mechanical Engineering

Keywords:

Pharmacovigilance;Data Mining;Feature Selection;

Asur, SitaramA Framework for the Static and Dynamic Analysis of Interaction Graphs
Doctor of Philosophy, The Ohio State University, 2009, Computer Science and Engineering

Data originating from many different real-world domains can be represented meaningfully as interaction networks. Examples abound, ranging from gene expression networks to social networks, and from the World Wide Web to protein-protein interaction networks. The study of these complex networks can result in the discovery of meaningful patterns and can potentially afford insight into the structure, properties and behavior of these networks. Hence, there is a need to design suitable algorithms to extract or infer meaningful information from these networks. However, the challenges involved are daunting.

First, most of these real-world networks have specific topological constraints that make the task of extracting useful patterns using traditional data mining techniques difficult. Additionally, these networks can be noisy (containing unreliable interactions), which makes the process of knowledge discovery difficult. Second, these networks are usually dynamic in nature. Identifying the portions of the network that are changing, characterizing and modeling the evolution, and inferring or predicting future trends are critical challenges that need to be addressed in the context of understanding the evolutionary behavior of such networks.

To address these challenges, we propose a framework of algorithms designed to detect, analyze and reason about the structure, behavior and evolution of real-world interaction networks. The proposed framework can be divided into three components:

1. A static analysis component where we propose efficient, noise-resistant algorithms taking advantage of specific topological features of these networks to extract useful functional modules and motifs from interaction graphs.

2. An event detection component where we propose algorithms to detect and characterize critical events and behavior for evolving interaction graphs

3. A temporal reasoning component where we propose approaches wherein one can make useful inferences on events, communities, individuals and their interactions over time.

For each component, we propose either new algorithms, or suggest ways to apply existing techniques in a previously-unused manner. Where appropriate, we compare against traditional or accepted standards. We evaluate the proposed framework on real datasets drawn from clinical, biological and social domains.

Committee:

Srinivasan Parthasarathy, PhD (Advisor); Gagan Agrawal, PhD (Committee Member); P Sadayappan, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

data mining; interaction graphs

ZHU, YAOYAOUNSUPERVISED DATABASE DISCOVERY BASED ON ARTIFICIAL INTELLIGENCE TECHNIQUES
MS, University of Cincinnati, 2002, Engineering : Computer Engineering
Competitive business pressures and a desire to leverage existing information technology investments have led people to explore the benefits of data mining technology. This technology is designed to help businesses discover hidden patterns in their data - patterns that can help them understand the purchasing behavior of their key customers, detect likely credit card or insurance claim fraud, predict probable changes in financial markets, etc. One approach of data mining is to use some form of supervised discovery. However, supervised discovery limits results as it is necessary to determine in advance what is of interest. This is contra-intuitive to the broadest goals of finding unexpected, interesting things.In this thesis, we make an experimental investigation into autonomous or unsupervised discovery. It is based on the novel paradigm proposed by Dr. L. J. Mazlack [ 1996]. This testable approach is that increasing coherence increases conceptual information; and this in turn reveals previously unrecognized, useful, implicit information. This can be done by recursive partitioning. In order to refine partitioning, we use some artificial intelligence techniques and also proposed the algorithms in clustering and generalizing on both scalar and non-scalar data. The algorithms are tested on some data sets and the results are discussed.

Committee:

Dr. Lawrence J. Mazlack (Advisor)

Subjects:

Computer Science

Keywords:

data mining; artificial intelligence; database discovery; mountain method

Beese, Elizabeth BrottA vision of the curriculum as student self-creation: A philosophy and a system to manage, record, and guide the process
Master of Arts, The Ohio State University, 2012, EDU Policy and Leadership
This thesis draws upon the interrelated philosophies of constructivism, individualism, self-creation, and narrative identity, to propose a radically liberated and individualized vision of the curriculum. The curriculum is re-framed, here, not as a culturally-prescribed canon of important knowledge and skills, but as a process of aided student self-creation toward their own projected professional and social identities. Finally, a system – with applications of emerging technologies and descriptions of interfaces – is tentatively suggested, towards the aim of recording, managing, and guiding such a profoundly individualized curriculum.

Committee:

Bryan Warnick, PhD (Advisor); Rick Voithofer, PhD (Committee Member)

Subjects:

Education Philosophy

Keywords:

learning pathways; self-directed learning; self-creation; narrative identity; aspirational identities; student-set goals; student agency; individualizing the curriculum; learning management systems; data-mining in education; computerized student guidance

Subburayalu, Sakthi KumaranApplication of machine learning for soil survey updates: A case study in southeastern Ohio
Doctor of Philosophy, The Ohio State University, 2008, Soil Science
Machine learning techniques were used to build predictive soil-landscape models for two counties (Monroe and Noble) in southeastern Ohio. Twenty five different environmental correlates including 10m resolution raster coverages of terrain and its derivatives, climate, geology, and historic vegetation were used as predictor variables for soil class. Randomly sampled points proportionate to the area of the different soil classes from the published soil survey of Monroe County (SSURGO) were used to train the soil-landscape model. Since map units can contain more than one component soil series, each sample point within a map unit can possibly belong to any one of them. Hence there is ambiguity in labeling of the training instances with appropriate soil series. A kNN-based heuristic approach was used to disambiguate the training set labels. The training sets were further preprocessed for removal of outliers and for selection of fewer attributes. Modeling was performed using two learning algorithms namely J48 classification tree and Random Forest (RF). The map models were then evaluated for the quality of prediction using two prediction rate measures and two landscape fragmentation statistics. Generally Random Forest recorded a higher prediction rate and greater contiguity when compared to J48. However, Random Forest over predicted soils such as Gilpin, Guernsey, Zanesville and Captina Series which occupy large areas, at the cost of prediction accuracy of soils which occurred in smaller proportions. The results showed that the highest prediction rate based on the dominant soil series (> 0.5) and higher values of contiguity index (0.83) and aggregation index (84.2) for RF was observed in the model built using the training set preprocessed for disambiguation. This suggests an improvement in the quality of predicted maps as a result of disambiguation. The model predictions were helpful in locating many individual component series in soil consociations and associations. The maps were useful in identifying areas of uncertainty such as misplacement of polygon boundaries, incorrect labeling and disparity along the county edges, which could serve as a guide for further field investigations. The predicted models also provided valuable information for rationalizing the mapping intensity for adjacent SSURGO maps.

Committee:

Brian Slater (Advisor)

Subjects:

Agriculture, Soil Science

Keywords:

machine learning; data mining; soil survey; SSURGO updates; soil-landscape modeling; predictive soil modeling; Random Forest

Liu, Larry YoungInterplay Between Traumatic Brain Injury and Intimate Partner Violence: A Data-Driven Approach Utilizing Electronic Health Records
Master of Sciences, Case Western Reserve University, 2017, Systems Biology and Bioinformatics
Intimate partner violence (IPV) is a prevalent issue that results in overwhelming physical and mental health consequences. It is also known that majority of victims su¿er from blunt force in the head, neck and the face area. Injuries to head and neck are among the causes for traumatic brain injury (TBI). TBI often linked to neurological conditions and permanent behavioral disorders. In this study, we aim to characterize the key associations between IPV and TBI by mining de-identified electronic health records (EHR) data from the Explorys platform. We formulate a novel, data-driven, three-step analytical method to find key health associations by comparing prevalent health conditions among IPV, TBI, and six control cohorts. Our analysis suggests that health effects attributed to substance and alcohol abused livers are highly significant in contributing IPV and TBI interplay. Our results would greatly assist in improving existing screening, diagnostic, and treatment procedures of IPV-induced TBI victims, especially with increasing risk correlated with substance and alcohol abuse.

Committee:

Mehmet Koyuturk, Ph.D (Advisor); Gunnur Karakurt, Ph.D (Committee Member); William Bush, Ph.D (Committee Member)

Subjects:

Bioinformatics; Clinical Psychology; Public Health

Keywords:

TBI; IPV; EHR; electronic health records; data analysis; intimate partner violence; traumatic brain injuries; sexual violence; domestic violence; sexual abuse; domestic abuse; data mining;

Wei, RanOn Estimation Problems in Network Sampling
Doctor of Philosophy, The Ohio State University, 2016, Statistics
With the popularity of online social networks such as Facebook, Twitter and LinkedIn, the scale of network data has become enormous. How to take samples that are representative of the full network has been a major research focus. For example, link-tracing sampling methods are effective for obtaining samples from hard-to-reach populations. However link-tracing methods often result in substantially biased samples. In this dissertation we study different link-tracing sampling methods and network models. We compare these methods with simple random sampling in terms of sampling mechanism, estimation bias and variances in estimating parameters and attributes of networks. We explore the root cause for simple average estimators to have large biases and variances. We investigate the interplay among network structure, sampling methods and the variable of interest for both simulated data and real-world social networks data. We propose new estimation methods to correct bias and improve estimation performances. Judgement Post-Stratification (JPS) is a data analysis method based on ideas similar to those in ranked set sampling. Besides serving a variance reduction role, as for traditional sampling schemes, post-stratification also helps reduce bias in a size-biased sampling scheme by down-weighting units that are more likely to be selected and up-weighting units that are less likely to be sampled. In this dissertation, we discuss the applications of JPS in improving estimation performance when using link-tracing sampling methods. We compare the JPS estimator with traditional size-bias compensation methods such as Horvitz-Thompson Estimators (HTE) and Volz-Hackethorn Estimators (VHE). We use machine learning methods to build and improve ranking functions and compare machine learning based JPS with traditional machine learning methods without post-stratification. Extending the ideas of JPS and VHE, we propose a new method, JVE, which is a combination of both. JPS and JVE have been demonstrated to provide a flexible framework that is able to incorporate various information to better rank observed units. We conduct analyses on both simulated data and real world social networks data.

Committee:

David Sivakoff, Dr. (Advisor); Elizabeth Stasny, Dr. (Advisor); Catherine Calder, Dr. (Committee Member)

Subjects:

Statistics

Keywords:

social networks, network sampling, sampling bias, estimation, machine learning, judgement post-stratification, data mining

Next Page