High dimensional data analysis involves, among many other tasks, modeling the unknown underlying data generating process and predicting one of the few possible sources that might have generated the data. If data have some underlying pattern or structure, it is conceivable that such a hidden pattern or structure might provide important cues toward predicting the possible data generating source. On the other hand, for modeling purposes, it is important to understand the computational aspects of the problem of learning model parameters, especially the dependence on dimension. This thesis addresses the computational aspects of such a modeling problem especially in high dimension and also describes a novel predictive framework that exploits the underlying structure/pattern present in data.
The first part of this thesis addresses the problem of Gaussian mixture learning. Gaussian mixture model is a fundamental model in applied statistics and is a very popular choice for many scientific/engineering modeling tasks. However, computational aspects of learning the parameters of this model, especially in high dimension, is not well understood. The question of polynomial learnability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and learning theory communities. However, despite major progress, the general question of polynomial learnability of Gaussian mixture distributions still remained open. The result presented in this thesis resolves the question of polynomial learnability for Gaussian mixtures in high dimension with an arbitrary but fixed number of components.
The second part of this thesis describes studies on semi-supervised learning for multi-modal data, especially when data have natural clusters. Many real world datasets in scientific/engineering applications have clusters, e.g., hand written digit dataset or breast cancer dataset. To deal with such multi-modal data, this thesis proposes a novel framework which establishes a natural connection between “cluster assumption” in semi-supervised learning and sparse approximation. Specifically, when the number of labeled data is limited, the proposed method can learn a classifier which has a sparse representation in an appropriate set of bases and whose performance is comparable to the state of the art semi-supervised learning algorithms on a number of real world datasets.