Clustering is one of the most basic mental activities used by humans to handle the huge amount of information they receive every day. As such, clustering has been extensively studied in different disciplines including: statistics, pattern recognition, machine learning and data mining. Nevertheless, the body of knowledge concerning clustering has focused on objects represented as feature vectors stored in a single dataset. Clustering in this setting aims at grouping objects of a single type in a single table into clusters using the feature vectors. On the other hand, modern real-world applications are composed of multiple, large interrelated datasets comprising distinct attribute sets and containing objects from many domains; typically such data is stored in an information network. The types of patterns and knowledge desired in these applications goes far beyond grouping similar homogeneous objects, but rather involves unveiling dependency structures in the data in addition to pinpointing hidden associations across objects in multiple datasets and domains. For example consider an information network that contains the domains of authors, papers and conferences. Two authors a1 and a2 may work in the same research field but never publish in the same conference. Hence clustering only the domains of authors and conference would fail to place a1 and a2 in the same cluster; however considering the entire information network would reveal a hidden link via the papers domain, placing a1 and a2 in the same cluster. This form of relational clustering is essential for knowledge discovery in several applications such as: bioinformatics, social networking, and recommender systems.
Information-network clustering advances knowledge discovery in two manners. First, hidden associations amongst objects from differing domains are unveiled, leading to a better understanding of the hidden structure of the entire network. Second, local clusters of the objects within a domain are sharpened and put into greater context, leading to more accurate local clustering. In order to extract knowledge of this form, information network clustering algorithms must consider 1) overlapping clusters and 2) a clustering structure that relates the clusters. In this dissertation we develop a framework and several algorithms for information network clustering by leveraging the above two fundamental aspects that facilitate knowledge discovery.
Current state-of-the-art information network clustering algorithms have had relative success in addressing the computational challenge of high dimensional data; however, the majority of these approaches have not addressed the fundamental aspects of overlapping clusters and clustering structure. In this dissertation, we address the information-network clustering problem from a fresh perspective and introduce a novel framework based on Formal Concept Analysis (FCA). Based on mathematical
order theory, in particular, the theory of complete lattices, FCA provides a rich theoretical basis for investigating and structuring overlapping relational clustering in a single dataset. Shortcomings of previous methods were overcome by extending FCA to information networks, yielding effective and efficient information network clustering algorithms. Several empirical evaluations performed on a large variety of real-world information networks reveal that the FCA-based algorithms work more effectively and efficiently than the current state of the art.