Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 189)

Mini-Tools

 
 

Search Report

  • 1. Crossman, Nathaniel Stream Clustering And Visualization Of Geotagged Text Data For Crisis Management

    Master of Science (MS), Wright State University, 2020, Computer Science

    In the last decade, the advent of social media and microblogging services have inevitably changed our world. These services produce vast amounts of streaming data, and one of the most important ways of analyzing and discovering interesting trends in the streaming data is through clustering. In clustering streaming data, it is desirable to perform a single pass over incoming data, such that we do not need to process old data again, and the clustering model should evolve over time not to lose any important feature statistics of the data. In this research, we have developed a new clustering system that clusters social media data based on their textual content and displays the clusters and their locations on the map. Our system takes advantage of a text stream clustering algorithm, which uses the two-phase clustering process. The online micro-clustering phase incrementally creates micro-clusters, called text droplets, that represent enough information about topics occurring in the text stream. The off-line macro-clustering phase clusters micro-clusters for a user-specified time interval and can change macro-clustering algorithms dynamically. Our experiments demonstrated that the performance of our system is scalable; and it can be easily used by first responders and crisis management personnel to quickly determine if a crisis is happening, where it is concentrated, and what resources are best to deploy to the situation.

    Committee: Soon M. Chung Ph.D. (Advisor); Nikolaos Bourbakis Ph.D. (Committee Member); Vincent A. Schmidt Ph.D. (Committee Member) Subjects: Computer Science
  • 2. Madhamsetty, Lakshmi Poojitha Approximate N-Clustering on Heterogeneous Information Networks with Star Schema

    MS, University of Cincinnati, 2023, Engineering and Applied Science: Computer Science

    Clustering techniques are becoming a growing need in today's world where data is being accumulated on a large scale. Given a set of objects, clustering helps in dividing these objects into groups called clusters, where the objects in one cluster exhibit similarities while objects from different clusters are dissimilar. Clustering analysis is essential in data mining to find underlying patterns and information. Many complex systems in the real world are formed by multiple data type objects and interactions between them and such systems can be modeled as Heterogeneous Information Networks (HINs). A heterogeneous information network (HIN) is a network that consists of nodes of different object types and links representing relations between the nodes. Cluster analysis of heterogeneous information networks helps in revealing the underlying information between these complex systems. Most real-world applications that handle big data including social networks, medical information systems, online e-commerce systems, and most movie database systems (such as IMDB, Netflix, etc.,) can be structured into heterogeneous information networks. Therefore, effective clustering analysis of large-scale heterogeneous information networks poses an interesting challenge. In this research, we have developed an ‘approximate N-Clustering' model, which is based on the A* (pronounced as A-star) search algorithm, and Chernoff Upper Bound is used as the approximation limiting criterion (the heuristic function). Here ‘N' represents the number of databases/dimensions/object types. In our thesis, we have used a star distribution pattern (or star schema) for clustering on HINs. In a star network schema, there is one central object type and all other object types are connected to this central object type. The approximate n-clusters generated from our algorithm are the most informative occurrences (i.e., the probability of occurrence of any new n-cluster with higher priorities will (open full item for complete abstract)

    Committee: Raj Bhatnagar Ph.D. (Committee Chair); Chong Yu Ph.D. (Committee Member); Vikram Ravindra Ph.D. (Committee Member) Subjects: Computer Science
  • 3. Cunningham, James Efficient, Parameter-Free Online Clustering

    Master of Science, The Ohio State University, 2020, Computer Science and Engineering

    As the number of data sources and dynamic datasets grows so does the need for efficient online algorithms to process them. Clustering is a very important data exploration tool used in various fields from data science to machine learning. Clustering finds structure unsupervised among indecipherable collections of data. Online clustering is the process of grouping samples in a data stream into meaningful collections as they appear over time. As more data is collected in these streams online clustering becomes more important than ever. Unfortunately, the number of available efficient online clustering algorithms is limited due to the difficulty of their design that often requires significant trade-offs in clustering performance for efficiency. Those that do exist require expert domain knowledge of the data space to set hyperparameters for the algorithm such as the desired number of clusters. This domain knowledge is often unavailable, so resources must be spent tuning hyperparameters to get acceptable performance. In this thesis we present an online modification to FINCH, the recent state-of-the-art parameter-free clustering algorithm by Sarfraz et al. called Stream FINCH (S-FINCH). We examine the stages of the FINCH algorithm and leverage key insights to produce an algorithm which reduces the online update complexity of FINCH. We then compare the performance of S-FINCH and FINCH over several toy and real-world datasets. We show theoretically and empirically that our S-FINCH algorithm is more efficient than the FINCH algorithm in the online domain and has reasonable real-time update performance. We also present several alternative cluster representatives which can be used to build different agglomerative cluster hierarchies using the S-FINCH algorithm. We compare the cluster quality and clustering time performance of these new representatives with the original FINCH algorithm. The S-FINCH algorithm presented in this thesis allows for fast and efficient online (open full item for complete abstract)

    Committee: James Davis PhD. (Advisor); Juan Vasquez PhD. (Committee Member); Kyle Tarplee PhD. (Committee Member); Wei-Lun Chao PhD. (Committee Member) Subjects: Computer Science; Information Science; Information Technology
  • 4. Sardana, Divya Analysis of Meso-scale Structures in Weighted Graphs

    PhD, University of Cincinnati, 2017, Engineering and Applied Science: Computer Science and Engineering

    Many real world systems across multiple disciplines, like social, biological and information networks can be described as complex networks, i.e., assemblies of nodes and edges having nontrivial topological properties. Along with network topology, most real world networks provide useful connectivity strength information between nodes in the form of edge weights. For example, in a co-authorship network, edge weights could represent the number of publications shared by two authors. This information of edge weights can be leveraged along with the network topological properties to uncover many hidden and interesting properties inherent in these networks. Meso-scale structures can help in analysing network properties which are not very evident at either the global level (eg. network diameter) or at the local level (eg. node degree). In this dissertation, we develop novel methodologies for the analysis of weighted networks or graphs at the meso-scale. Specifically, we develop algorithms to find two types of meso-scale structures in graphs, namely, community structure and core periphery structure. The first problem that we solve involves finding density-based, disjoint clusters or communities in a weighted graph. We develop a novel graph clustering algorithm called G-MKNN based upon a node affinity measure called Mutual K-nearest neighbors. Our algorithm is based upon a new definition of density called SE-density which tries to achieve a balance between structural density and edge-weight homogeneity in the final clustering. We compare our algorithm with other state-of-the-art weighted and un-weighted graph clustering algorithms using both synthetic and real world protein-protein interaction (PPI) datasets. The second problem that we solve involves extracting core periphery structures in weighted graphs. In this work, first, we formalize the definition of core periphery structures for weighted networks. Next, we build two algorithms to extract core-p (open full item for complete abstract)

    Committee: Raj Bhatnagar PhD (Committee Chair); Yizong Cheng PhD (Committee Member); Anil Jeggal PhD (Committee Member); Ali Minali PhD (Committee Member); Tomas Stepinski PhD (Committee Member) Subjects: Computer Science
  • 5. Gurram, Abhinav Multi-Domain Clustering using the A* Search

    MS, University of Cincinnati, 2016, Engineering and Applied Science: Computer Science

    Identification of interesting bi-clusters in real-valued datasets is a computationally hard problem. The problem is not easily scalable with the increasing size of the data sets. Most of the emerging and interesting data mining problems are encountering data sets of increasingly larger sizes and complexity. For finding interesting bi-clusters in a dataset we need to examine all possible subsets of rows and columns and determine the bi-clusters that meet some interestingness criteria. When there exist monotonic or anti-monotonic properties of bi-clusters that increase or decrease with the sizes of the subsets forming bi-clusters, apriori kind of pruning can be used to speed up the search for the interesting bi-clusters. But such monotonic properties are not easily available for most mining tasks. Another useful avenue for pruning the search is the requirement for the bi-clusters to be consistent with the data residing in a second dataset. As the hypotheses for bi-clusters are examined by a search algorithm, their merit value is influenced by their consistency with the data in the second dataset. We have developed and tested one such heuristic search based 3-clustering algorithm and its details are presented in this thesis. We have successfully demonstrated that a number of different heuristics can be used to identify clusters having different types of properties, especially when these properties are derived from two datasets storing different types of information about the same sets of row objects. Performance of the algorithm and quality of clusters discovered has been studied in detail and results have been enumerated. Our conclusion is that search based algorithms are applicable for identification of interesting bi-clusters & 3-clusters in situations of multiple related datasets.

    Committee: Raj Bhatnagar (Committee Chair); Yizong Cheng (Committee Member); Paul Talaga (Committee Member) Subjects: Engineering
  • 6. Brown, Kyle Topological Hierarchies and Decomposition: From Clustering to Persistence

    Doctor of Philosophy (PhD), Wright State University, 2022, Computer Science and Engineering PhD

    Hierarchical clustering is a class of algorithms commonly used in exploratory data analysis (EDA) and supervised learning. However, they suffer from some drawbacks, including the difficulty of interpreting the resulting dendrogram, arbitrariness in the choice of cut to obtain a flat clustering, and the lack of an obvious way of comparing individual clusters. In this dissertation, we develop the notion of a topological hierarchy on recursively-defined subsets of a metric space. We look to the field of topological data analysis (TDA) for the mathematical background to associate topological structures such as simplicial complexes and maps of covers to clusters in a hierarchy. Our main results include the definition of a novel hierarchical algorithm for constructing a topological hierarchy, and an implementation of the MAPPER algorithm and our topological hierarchies in pure Python code as well as a web app dashboard for exploratory data analysis. We show that the algorithm scales well to high-dimensional data due to the use of dimensionality reduction in most TDA methods, and analyze the worst-case time complexity of MAPPER and our hierarchical decomposition algorithm. Finally, we give a use case for exploratory data analysis with our techniques.

    Committee: Derek Doran Ph.D. (Advisor); Michael Raymer Ph.D. (Committee Member); Vincent Schmidt Ph.D. (Committee Member); Nikolaos Bourbakis Ph.D. (Committee Member); Thomas Wischgoll Ph.D. (Committee Member) Subjects: Computer Science
  • 7. KC, Rabi Study of Some Biologically Relevant Dynamical System Models: (In)stability Regions of Cyclic Solutions in Cell Cycle Population Structure Model Under Negative Feedback and Random Connectivities in Multitype Neuronal Network Models

    Doctor of Philosophy (PhD), Ohio University, 2020, Mathematics (Arts and Sciences)

    We study some dynamical system models where clustering of elements influences the dynamics of the model. First, we study the stability of periodic `cyclic' solutions of a population model of the cell cycle under negative feedback. We build on previous results showing that stability of cyclic solutions is determined by the values of model parameters s and r, 0 ≤ s ≤ r ≤ 1, and by which of two possible orderings of certain events the cyclic solution follows. The parameter triangle ∆ = {0 ≤ s≤ r ≤ 1} is subdivided into sub-triangles on which the stability of all cyclic solutions are the same. Earlier work completely characterized the stability for sub-triangles on the boundary of T in terms of number theoretic relationships between the number of clusters k and certain indices of the sub-triangles. In the present work, we focus on interior sub-triangles. A previous study has shown that for the order of events called sr1, solutions corresponding to all interior sub-triangles are unstable. We show that when k is prime, then the cyclic solutions are unstable for all interior sub-triangles with the other order of events rs1. When k is even, we show that there always exist a small number of sub-triangles on which cyclic solutions are at least neutrally stable. For k odd and composite, we show that there are stable sub-triangles when k = 9 and k = 15 and no others. Next, we study a class of discrete-time neuronal network dynamical systems that exhibit the phenomenon of dynamic clustering. In particular, we analyze the network dynamics in a class of neuronal network models where the connectivities are specified by directed multitype Erdos-R'enyi random graphs (NERThmer models) that are randomly drawn from a distribution. Our models generalize models that have been extensively studied by Ahn et al. [6, 37, 44, 46], where the connectivities are specified by networks that are directed Erdos-Renyi random graphs (NERTher models) by introducing a new variable that specifies types (open full item for complete abstract)

    Committee: Todd Young (Advisor); Winfried Just (Committee Member); Vardes Melkonian (Committee Member); Mitchell Day (Committee Member) Subjects: Applied Mathematics; Mathematics
  • 8. Sutharzan, Sreeskandarajan CLUSTERING AND VISUALIZATION OF GENOMIC DATA

    Doctor of Philosophy, Miami University, 2019, Botany

    Applications of clustering and visualization approaches are essential in uncovering biological insights from the large and complex genomic datasets. The ability to efficiently cluster large sets of nucleotide sequences can aid in performing many genomics tasks, such as the taxonomic assignment of metagenomics reads, identification of sequencing errors, and exploring virus genome variations. Effective visualization approaches are essential in interpreting the complex biological processes associated with the differentially expressed genes obtained from transcriptomics studies. In this dissertation a novel prime number-based feature extraction approach was proposed with applications in nucleotide sequence clustering. The feasibility of the proposed approach was explored by incorporating the approach as a filter into the nucleotide clustering tool PEACE (Parallel Environment for Assembly and Clustering of Gene Expression) and testing it on sequencing reads and virus genomes. The filter was effective in accelerating the clustering of Influenza A virus segment 4 and Dengue virus genomic sequences. The utility of the prime number-based feature extraction approach was further explored by using it to develop a self organizing map-based tool for clustering Influenza A virus segment 4 sequences. Additionally, network-based visualization methods were utilized to uncover the biological processes associated with the retinal pigment epithelium (RPE) reprogramming during chicken retina regeneration, using transcriptomic data. The findings associated with this study will aid to better understand the clustering and the visualization of genomic data. Chapter 1 of this dissertation provides an introduction to the usage of clustering and visualization approaches in genomics. Chapter 2 provides the details of the study performed to investigate the feasibility of the proposed filter in accelerating PEACE clustering. Chapter 3 gives in details of the network-based visualization approaches (open full item for complete abstract)

    Committee: Chun Liang (Advisor); Bruce Cochrane (Committee Member); Richard Moore (Committee Member); Meixia Zhao (Committee Member); Dhananjai Rao (Committee Member) Subjects: Bioinformatics; Biology; Botany
  • 9. Bhusal, Prem Scalable Clustering for Immune Repertoire Sequence Analysis

    Master of Science (MS), Wright State University, 2019, Computer Science

    The development of the next-generation sequencing technology has enabled systems immunology researchers to conduct detailed immune repertoire analysis at the molecule level. Large sequence datasets (e.g., millions of sequences) are being collected to com- prehensively understand how the immune system of a patient evolves over different stages of disease development. A recent study has shown that the hierarchical clustering (HC) algorithm gives the best results for B-cell clones analysis - an important type of immune repertoire sequencing (IR-Seq) analysis. However, due to the inherent complexity, the classical hierarchical clustering algorithm does not scale well to large sequence datasets. Surprisingly, no algorithms have been developed to address this scalability issue for im- munology research. In this thesis, we study two different strategies, aiming at finding the best scalable methods that can preserve the quality of hierarchical clustering structure. The two strategies include (1) non-Euclidean indexing methods for speeding up the clas- sical hierarchical clustering(HC), (2) a new tree-based sequence summarization approach - SCT that scans the large sequence dataset once and generates summaries for hierarchi- cal clusters(HC). And we also experimented with the Spark based minimum-spanning-tree algorithm (SparkMST) that generates the equivalent result of single linkage hierarchical clustering (SLINK) for comparative analysis. We have implemented all these algorithms and experimented with real sequence datasets for B-cell clones analysis. The result shows that (1) the indexing-enhanced HC (e.g., us- ing the Vantage-Point tree for indexing) preserves the clustering quality very well, while also significantly reducing the time complexity of the original HC; (2) SCT with HC is the fastest approximate HC method with slightly sacrificed quality; and (3) SparkMST scales out satisfactorily and gives significant performa (open full item for complete abstract)

    Committee: Keke Chen Ph.D. (Advisor); Krishnaprasad Thirunarayan Ph.D. (Committee Member); Tanvi Banerjee Ph.D. (Committee Member) Subjects: Computer Science
  • 10. Loganathan, Satish Kumar Distributed Hierarchical Clustering

    MS, University of Cincinnati, 2018, Engineering and Applied Science: Computer Science

    Hierarchical clustering is an inherently sequential algorithm designed for datasets that can fit in the memory of a single stand-alone system. In this thesis, we extend the agglomerative hierarchical clustering algorithm to distributed data environments where both the storage and computational resources are decentralized. We specifically target environments where the data is horizontally partitioned.

    Committee: Raj Bhatnagar Ph.D. (Committee Chair); Gowtham Atluri Ph.D. (Committee Member); Ali Minai Ph.D. (Committee Member) Subjects: Computer Science
  • 11. Piekenbrock, Matthew Discovering Intrinsic Points of Interest from Spatial Trajectory Data Sources

    Master of Science (MS), Wright State University, 2018, Computer Science

    This paper presents a framework for intrinsic point of interest discovery from trajectory databases. Intrinsic points of interest are regions of a geospatial area that are innately derivable by the spatial and temporal aspects of trajectory data. In contrast with other definitions of a point of interest, which often require a knowledge base or external location data, intrinsic points of interest are completely data-driven. The framework unifies recent developments from the field of density level-set estimation, applied density-based clustering techniques, and common practices in spatial point pattern analysis, offering a more theoretically grounded framework towards how a point of interest may be defined. Experiments are performed comparing the results across several modern approaches to POI discovery under thousands of parameter settings, and a marked improvement in fidelity by the proposed approach is shown in both synthetic and real world data sets.

    Committee: Derek Doran Ph.D. (Advisor); Michael Raymer Ph.D. (Committee Member); William Romine Ph.D. (Committee Member); Krishnaprasad Thirunarayan Ph.D. (Committee Member) Subjects: Computer Science
  • 12. Yallamelli, Pavankalyan A Power Iteration Based Co-Training Approach to Achieve Convergence for Multi-View Clustering

    Master of Science (MS), Wright State University, 2017, Computer Science

    Collecting diversified opinions is the key to achieve "the Wisdom of Crowd". In this work, we propose to use a novel multi-view clustering method to group the crowd so that diversified opinions can be effectively sampled from different groups of people. Clustering is the process of dividing input data into possible subsets, where every element (entity) in each subset is considered to be related by some similarity measure. For example, a set of social media users can be clustered using their locations or common interests. However, real-world data is often best represented by multiple views/dimensions. For example, a set of social media users have a friend/follower network as well as a conversation network (different from a follower network). Multiple views enable a better understanding of data by improving knowledge accuracy through cross verification across different views; it also improves the performance by integrating multiple views. Multi-view clustering enables this. Clustering quality, clustering agreement (consensus) and scalability are the three essential qualities for achieving higher correspondence between the clusters and the real underlying groups in multi-view clustering. Existing algorithms either lack scalability or achieve cluster convergence (consistent clusters across the views) very slowly. Most of the existing and recent multi-view clustering algorithms make use of spectral clustering. Spectral clustering which ensures higher accuracy is computationally costly because of eigenvector computation. To address this gap, in this paper we propose a clustering mechanism based on a co-training approach that achieves the three qualities. The two main contributions of our work are as follows: (1) a learning method using power-iteration clustering for clustering a single data view, and (2) an efficient and scalable update method that uses the cluster label information for updating other data views iteratively to achieve convergence (clustering ag (open full item for complete abstract)

    Committee: Amit Sheth Ph.D. (Advisor); Keke Chen Ph.D. (Committee Member); Brandon Minnery Ph.D. (Committee Member) Subjects: Computer Science; Social Research
  • 13. Eldridge, Justin Clustering Consistently

    Doctor of Philosophy, The Ohio State University, 2017, Computer Science and Engineering

    Clustering is the task of organizing data into natural groups, or clusters. A central goal in developing a theory of clustering is the derivation of correctness guarantees which ensure that clustering methods produce the right results. In this dissertation, we analyze the setting in which the data are sampled from some underlying probability distribution. In this case, an algorithm is "correct" (or consistent) if, given larger and larger data sets, its output converges in some sense to the ideal cluster structure of the distribution. In the first part, we study the setting in which data are drawn from a probability density supported on a subset of a Euclidean space. The natural cluster structure of the density is captured by the so-called high density cluster tree, which is due to Hartigan (1981). Hartigan introduced a notion of convergence to the density cluster tree, and recent work by Chaudhuri and Dasgupta (2010) and Kpotufe and von Luxburg (2011) has contructed algorithms which are consistent in this sense. We will show that Hartigan's notion of consistency is in fact not strong enough to ensure that an algorithm recovers the density cluster tree as we would intuitively expect. We identify the precise deficiency which allows this, and introduce a new, stronger notion of convergence which we call consistency in merge distortion. Consistency in merge distortion implies Hartigan's consistency, and we prove that the algorithm of Chaudhuri and Dasgupta (2010) satisfies our new notion. In the sequel, we consider the clustering of graphs sampled from a very general, non-parametric random graph model called a graphon. Unlike in the density setting, clustering in the graphon model is not well-studied. We therefore rigorously analyze the cluster structure of a graphon and formally define the graphon cluster tree. We adapt our notion of consistency in merge distortion to the graphon setting and identify efficient, consistent algorithms.

    Committee: Mikhail Belkin PhD (Advisor); Yusu Wang PhD (Advisor); Facundo Mémoli PhD (Committee Member); Vincent Vu PhD (Committee Member) Subjects: Artificial Intelligence; Computer Science; Statistics
  • 14. Storer, Jeremy Computational Intelligence and Data Mining Techniques Using the Fire Data Set

    Master of Science (MS), Bowling Green State University, 2016, Computer Science

    Forest fires are a dangerous and devastating phenomenon. Being able to accurately predict the burned area of a forest fire could potentially limit biological damage as well as better prepare for ensuing economical and ecological damage. A data set from the Montesinho Natural Park in Portugal provides a difficult regression task regarding the prediction of forest fire burn area due to the limited amount of data entries and the imbalanced nature of the data set. This thesis focuses on improving these results through the use of a Backpropagation trained Artificial Neural Network which is systematically evaluated over a variety of configurations, activation functions, and input methodologies, resulting in approximately 30% improvements to regression error rates. A Particle Swarm Optimization (PSO) trained Artificial Neural Network is also evaluated in a variety of configurations providing approximately 75% improvement of regression error rates. Going further, the data is also clustered on both inputs and outputs using k-Means and Spectral algorithms in order to pursue the task of classification where near perfect classification is achieved when clustering on inputs is considered and an accuracy of roughly 60% is achieved when clustering on output values.

    Committee: Robert Green PhD. (Advisor); Jong Kwan Lee PhD. (Committee Member); Robert Dyer PhD. (Committee Member) Subjects: Computer Science
  • 15. Dixit, Siddharth Density Based Clustering using Mutual K-Nearest Neighbors

    MS, University of Cincinnati, 2015, Engineering and Applied Science: Computer Science

    Density-based clustering is an important problem of research for data scientists and has been investigated with interest in the past. Due to data proliferation, datasets of different sizes are getting introduced which involve high-dimensional data with varying densities. Such datasets include data with high-density regions surrounded by data with sparse density. The existing approaches to clustering are unable to handle these data situations well. We present a novel clustering algorithm that utilizes the concept of Mutual K-nearest neighbor relationship that overcomes the shortcomings of existing approaches on density based datasets. Our approach requires a single input parameter; works well for high-dimensional density based datasets and is CPU time efficient. We experimentally demonstrate the efficacy and robustness of our algorithm on synthetic and real-world density based datasets.

    Committee: Raj Bhatnagar Ph.D. (Committee Chair); Nan Niu Ph.D. (Committee Member); Zhe Shan Ph.D. (Committee Member) Subjects: Computer Science
  • 16. Xie, Qing Yan K-Centers Dynamic Clustering Algorithms and Applications

    PhD, University of Cincinnati, 2013, Engineering and Applied Science: Computer Science and Engineering

    Every day large and increasing amounts of unstructured information are created, putting ever more demands on retrieval methods, classification, automatic data analysis and management. Clustering is an important and efficient way for organizing and analyzing information and data. One of the most widely used dynamic clustering algorithms is K-Means clustering. This dissertation presents our K-Centers Min-Max dynamic clustering algorithm (KCMM) and K-Centers Mean-shift Reverse Mean-shift dynamic clustering algorithm (KCMRM). These algorithms are designed to modify K-Means in order to achieve improved performance and help with specific goals in certain domains. These two algorithms can be applied to many fields such as wireless sensor networks, server or facility location optimization, and molecular networks. Their application in wireless sensor networks are described in this dissertation. The K-Centers Min-Max clustering algorithm uses a smallest enclosing disk/sphere algorithm to attain a minimum of the maximum distance between a cluster node and data nodes. Our approach results in fewer iterations, and shorter maximum intra-cluster distances than the standard K-Means clustering algorithm with either uniform distribution or normal distribution. Most notably, it can achieve much better performance when the size of clusters is large, or when the clusters includes large numbers of member nodes in normal distribution. The K-Centers Mean-shift Reverse Mean-shift clustering algorithm is proposed to solve the "empty cluster" problem which is caused by random deployment. It employs a Gaussian function as a kernel function, discovers the relationship between mean shift and gradient ascent on the estimated density surface, and iteratively moves cluster nodes away from their weighted means. This results in cluster nodes which better accommodate the distribution of data nodes. The K-Centers Mean-shift Reverse Mean-shift algorithm can not only reduce the number of empty clu (open full item for complete abstract)

    Committee: Yizong Cheng Ph.D. (Committee Chair); Kenneth Berman Ph.D. (Committee Member); Wen Ben Jone Ph.D. (Committee Member); Anca Ralescu Ph.D. (Committee Member); Xuefu Zhou Ph.D. (Committee Member) Subjects: Computer Science
  • 17. Singh, Siddharth Non-parametric Clustering and Topic Modeling via Small Variance Asymptotics with Local Search

    Master of Science, The Ohio State University, 2013, Computer Science and Engineering

    Clustering of data has been a very well studied topic in the Machine Learning community, with various different methods trying to solve the same problem of grouping similar objects together. Traditional approaches have been algorithmically simpler and easier to implement with reasonable results. More recently, algorithms derived from asymptotics on Bayesian Non-parametric Infinite Mixture Models have appeared as an alternate. These algorithms in general have pointed at a very clear relation between probabilistic methods like Expectation Maximization, and hard assignment based algorithms like K-Means. They provide both the flexibility of a Bayesian Non-parametric model and scalability of hard clustering algorithms like K-Means. Aysmptotics on further complex mixture models have been used to derive algorithmsthat resemble hierarchical clustering and hard Topic Modeling. Although these new algorithms are highly scalable and they open a new dimension in modeling data based on different similarity measures, they still suffer from problems traditionally seen in any optimization based method, like that of local optima. Also, being non-parametric in nature, parameters like number of clusters are not fixed upfront in these algorithms. This leads to a new problem of choosing the right set of initial values for the parameters existing in the model. This work primarily addresses the issues of local optima and how to reject sub-optimal solutions and get better solutions, and also the initialization of the values of parameters present in the model.We achieve this by adding certain new steps to the existing algorithm, and at last we quantitatively verify the improvements. Our focus would be mostly on the K-Means like algorithm derived from asymptotics on the Dirichlet Process Mixture Model of infinite Gaussians, and its extension to hierarchies via the Hierarchical Dirchlet Process. We will also focus on a similar mixture model more generalized by replacing Gaussians with the Exp (open full item for complete abstract)

    Committee: Brian Kulis (Advisor); Eric Fosler-Lussier (Committee Member) Subjects: Computer Science
  • 18. Hu, Zhen Multi-Domain Clustering on Real-Valued Datasets

    PhD, University of Cincinnati, 2011, Engineering and Applied Science: Computer Science and Engineering

    Clustering is an important research problem for knowledge discovery from databases. It focuses on finding hidden structures embedded in datasets. It is non-trivial to arrive at a clustering in a dataset such that each pair of data points within the same cluster is similar to each other, and each pair in different clusters is distinct from each other. This is due to the multiplicity of meanings of similarity between data points and also from criteria determining the number, shape, and boundaries of clusters. Despite a large body of published research, new clustering problems keep arising requiring novel solutions. Such a situation is evolving in the field of biomedical research which is generating a large number of interrelated and interdependent datasets, and also in many other domains of science and business. We have developed three novel methodologies for clustering to meet these newly emerging needs. The first problem we have solved relates to the grouping of data points with “similar” density in the data space into distinct clusters, using full dimensional clustering. Based on the pair-wise similarity matrix among data points, we define a new type of relationship among them - that of the point pairs being Mutual K-Nearest Neighbors (MKNN) of each other, and design clustering algorithms based on this new notion to capture the data density. Compared with traditional Euclidean distance based clustering algorithms for datasets having different densities, our MKNN-based clustering algorithm allows users to form density-based clusters with significantly lower sensitivity to parameters . We have analytically and empirically demonstrated, using both synthetic and real-world datasets, the increased capability, precision, efficiency, and robustness of our algorithm. The second clustering algorithm which we have developed incorporates prior domain knowledge, provided as pair-wise similarity matrix in one dataset, into the clustering performed for data in another dataset. (open full item for complete abstract)

    Committee: Raj Bhatnagar PhD (Committee Chair); Yizong Cheng PhD (Committee Member); Karen Davis PhD (Committee Member); Mario Medvedovic PhD (Committee Member); John Schlipf PhD (Committee Member) Subjects: Computer Science
  • 19. Alqadah, Faris Clustering of Multi-Domain Information Networks

    PhD, University of Cincinnati, 2010, Engineering : Computer Science and Engineering

    Clustering is one of the most basic mental activities used by humans to handle the huge amount of information they receive every day. As such, clustering has been extensively studied in different disciplines including: statistics, pattern recognition, machine learning and data mining. Nevertheless, the body of knowledge concerning clustering has focused on objects represented as feature vectors stored in a single dataset. Clustering in this setting aims at grouping objects of a single type in a single table into clusters using the feature vectors. On the other hand, modern real-world applications are composed of multiple, large interrelated datasets comprising distinct attribute sets and containing objects from many domains; typically such data is stored in an information network. The types of patterns and knowledge desired in these applications goes far beyond grouping similar homogeneous objects, but rather involves unveiling dependency structures in the data in addition to pinpointing hidden associations across objects in multiple datasets and domains. For example consider an information network that contains the domains of authors, papers and conferences. Two authors a1 and a2 may work in the same research field but never publish in the same conference. Hence clustering only the domains of authors and conference would fail to place a1 and a2 in the same cluster; however considering the entire information network would reveal a hidden link via the papers domain, placing a1 and a2 in the same cluster. This form of relational clustering is essential for knowledge discovery in several applications such as: bioinformatics, social networking, and recommender systems. Information-network clustering advances knowledge discovery in two manners. First, hidden associations amongst objects from differing domains are unveiled, leading to a better understanding of the hidden structure of the entire network. Second, local clusters of the objects within a domain are sharpened a (open full item for complete abstract)

    Committee: Raj Bhatnagar PhD (Committee Chair); Karen Davis PhD (Committee Member); Anil Jegga DVM, MRes (Committee Member); Ali Minai PhD (Committee Member); John Schlipf PhD (Committee Member); Yizong Cheng PhD (Committee Member) Subjects: Computer Science
  • 20. Freudenberg, Johannes Bayesian Infinite Mixture Models for Gene Clustering and Simultaneous Context Selection Using High-Throughput Gene Expression Data

    PhD, University of Cincinnati, 2009, Engineering : Biomedical Engineering

    Applying clustering algorithms to identify groups of co-expressed genes is an important step in the analysis of high-throughput genomics data in order to elucidate affected biological pathways and transcriptional regulatory mechanisms. As these data are becoming ever more abundant the integration with both, existing biological knowledge and other experimental data becomes as crucial as the ability to perform such analysis in a meaningful but virtually unsupervised fashion.Clustering analysis often relies on ad-hoc methods such as k-means or hierarchical clustering with Euclidean distance but model-based methods such as the Bayesian Infinite Mixtures approach have been shown to produce better, more reproducible results. Further improvements have been accomplished by context-specific gene clustering algorithms designed to determine groups of co-expressed genes within a given subset of biological samples termed context. The complementary problem of finding differentially co-expressed genes given two or more contexts has been addressed but relies on the a priori definition of contexts and has not been used to facilitate the clustering of biological samples. Here we describe a new computational method using Bayesian infinite mixture models to cluster genes simultaneously utilizing the concept of differential co-expression as a unique similarity measure to find groups of similar samples. We compute a novel per-gene differential co-expression score that is reproducible and biologically meaningful. To evaluate, annotate, and display clustering results we present the integrated software package CLEAN which contains functionality for performing Clustering Enrichment Analysis, a method to functionally annotate clustering results and to assign a novel gene-specific functional coherence score. We apply our method to a number of simulated datasets comparing it to other commonly used clustering algorithms, and we re-analyze several breast cancer studies. We find that our unsuper (open full item for complete abstract)

    Committee: Mario Medvedovic PhD (Committee Chair); Bruce Aronow PhD (Committee Member); Michael Wagner PhD (Committee Member); Jaroslaw Meller PhD (Committee Member) Subjects: Bioinformatics