PhD, University of Cincinnati, 2011, Engineering and Applied Science: Computer Science and Engineering
Clustering is an important research problem for knowledge discovery from databases. It focuses on finding hidden structures embedded in datasets. It is non-trivial to arrive at a clustering in a dataset such that each pair of data points within the same cluster is similar to each other, and each pair in different clusters is distinct from each other. This is due to the multiplicity of meanings of similarity between data points and also from criteria determining the number, shape, and boundaries of clusters. Despite a large body of published research, new clustering problems keep arising requiring novel solutions. Such a situation is evolving in the field of biomedical research which is generating a large number of interrelated and interdependent datasets, and also in many other domains of science and business. We have developed three novel methodologies for clustering to meet these newly emerging needs.
The first problem we have solved relates to the grouping of data points with “similar” density in the data space into distinct clusters, using full dimensional clustering. Based on the pair-wise similarity matrix among data points, we define a new type of relationship among them - that of the point pairs being Mutual K-Nearest Neighbors (MKNN) of each other, and design clustering algorithms based on this new notion to capture the data density. Compared with traditional Euclidean distance based clustering algorithms for datasets having different densities, our MKNN-based clustering algorithm allows users to form density-based clusters with significantly lower sensitivity to parameters . We have analytically and empirically demonstrated, using both synthetic and real-world datasets, the increased capability, precision, efficiency, and robustness of our algorithm.
The second clustering algorithm which we have developed incorporates prior domain knowledge, provided as pair-wise similarity matrix in one dataset, into the clustering performed for data in another dataset. (open full item for complete abstract)
Committee: Raj Bhatnagar PhD (Committee Chair); Yizong Cheng PhD (Committee Member); Karen Davis PhD (Committee Member); Mario Medvedovic PhD (Committee Member); John Schlipf PhD (Committee Member)
Subjects: Computer Science