Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Approximate N-Clustering on Heterogeneous Information Networks with Star Schema

Madhamsetty, Lakshmi Poojitha

Abstract Details

2023, MS, University of Cincinnati, Engineering and Applied Science: Computer Science.
Clustering techniques are becoming a growing need in today’s world where data is being accumulated on a large scale. Given a set of objects, clustering helps in dividing these objects into groups called clusters, where the objects in one cluster exhibit similarities while objects from different clusters are dissimilar. Clustering analysis is essential in data mining to find underlying patterns and information. Many complex systems in the real world are formed by multiple data type objects and interactions between them and such systems can be modeled as Heterogeneous Information Networks (HINs). A heterogeneous information network (HIN) is a network that consists of nodes of different object types and links representing relations between the nodes. Cluster analysis of heterogeneous information networks helps in revealing the underlying information between these complex systems. Most real-world applications that handle big data including social networks, medical information systems, online e-commerce systems, and most movie database systems (such as IMDB, Netflix, etc.,) can be structured into heterogeneous information networks. Therefore, effective clustering analysis of large-scale heterogeneous information networks poses an interesting challenge. In this research, we have developed an ‘approximate N-Clustering’ model, which is based on the A* (pronounced as A-star) search algorithm, and Chernoff Upper Bound is used as the approximation limiting criterion (the heuristic function). Here ‘N’ represents the number of databases/dimensions/object types. In our thesis, we have used a star distribution pattern (or star schema) for clustering on HINs. In a star network schema, there is one central object type and all other object types are connected to this central object type. The approximate n-clusters generated from our algorithm are the most informative occurrences (i.e., the probability of occurrence of any new n-cluster with higher priorities will not increase). Our algorithm can be particularly useful in the domains such as the medical domain where information and patterns between genes, diseases, mutations, chemicals, etc. can be mined and analyzed as a whole. Other areas where the relationship between different factors can be studied as an entity like streaming networks where user age, location, gender, and preference for a genre can be useful to suggest a movie or a series.
Raj Bhatnagar, Ph.D. (Committee Chair)
Chong Yu, Ph.D. (Committee Member)
Vikram Ravindra, Ph.D. (Committee Member)
155 p.

Recommended Citations

Citations

  • Madhamsetty, L. P. (2023). Approximate N-Clustering on Heterogeneous Information Networks with Star Schema [Master's thesis, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1703170985010961

    APA Style (7th edition)

  • Madhamsetty, Lakshmi Poojitha. Approximate N-Clustering on Heterogeneous Information Networks with Star Schema. 2023. University of Cincinnati, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1703170985010961.

    MLA Style (8th edition)

  • Madhamsetty, Lakshmi Poojitha. "Approximate N-Clustering on Heterogeneous Information Networks with Star Schema." Master's thesis, University of Cincinnati, 2023. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1703170985010961

    Chicago Manual of Style (17th edition)