Data mining, also known as knowledge discovery in database (KDD), is the process to discover interesting unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract interesting and nontrivial information and knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups.
This research focuses on improving the performance of text clustering. We investigated the text clustering algorithms in four aspects: document representation, documents closeness measurement, high dimension reduction and parallelization. We propose a group of high performance text clustering algorithms, which target the unique characteristics of unstructured text database.
First, two new text clustering algorithms are proposed. Unlike the vector space model, which treats document as a bag of words, we use a document representation which keeps the sequential relationship between words in the documents. In these two algorithms, the dimension of the database is reduced by considering the frequent word (meaning) sequences, and the closeness of two documents is measured based on the sharing of frequent word (meaning) sequences.
Second, a text clustering algorithm with feature selection is proposed. This algorithm gradually reduces the high dimension of database by performing feature selection during the clustering. The new feature selection method applied is based on the well-known chi-square statistic and a new statistical data which can measure the positive and negative term-category dependence.
Third, a group of new text clustering algorithms is developed based on the k-means algorithm. Instead of using the cosine function, a new function involving global information is proposed to measure the closeness between two documents. This new function utilizes the neighbor matrix introduced in [Guha:2000]. A new method for selecting initial centroids and a new heuristic function for selecting a cluster to split are adopted in the proposed algorithms.
Last, a new parallel algorithm for bisecting k-means is proposed for the message-passing multiprocessor systems. This new algorithm, named PBKP, fully utilizes the data-parallelism of the bisecting k-means algorithm, and adopts a prediction step to balance the workloads of multiple processors to achieve a high speedup.
Comprehensive performance studies were conducted on all the proposed algorithms. In order to evaluate the performance of these algorithms, we compared them with existing text clustering algorithms, such as k-means, bisecting k-means [Steinbach:2000] and FIHC [Fung:2003]. The experimental results show that our clustering algorithms are scalable and have much better clustering accuracy than existing algorithms. For the parallel PBKP algorithm, we tested it on a 9-node Linux cluster system and analyzed its performance. The experimental results suggest that the speedup of PBKP is linear with the number of processors and data points. Moreover, PBKP scales up better than the parallel k-means with respect to the desired number of clusters.