Enhancing Exponential Family PCA: Statistical Issues and Remedies

Huang, Ruochen

Keyword Search

School Logo

Ohio_State_University_dissertation_Ruochen_submitted.pdf (4.09 MB)

Enhancing Exponential Family PCA: Statistical Issues and Remedies

Author Info

Huang, Ruochen

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=osu1689927863478667

Year and Degree

2023, Doctor of Philosophy, Ohio State University, Statistics.

Abstract

Exponential family PCA (Collins et al., 2001) is a widely used dimension reduction tool for capturing a low-dimensional latent structure of exponential family data such as binary data or count data. As an extension of principal component analysis (PCA), it imposes a low-rank structure on the natural parameter matrix, which can be factorized into two matrices, namely, the principal component loadings matrix and scores matrix. These loadings and scores share the same interpretation and functionality as those in PCA. Loadings enable exploration of associations among variables, scores can be utilized as low-dimensional data embeddings, and estimated natural parameters can impute missing data entries. Despite the popularity of exponential family PCA, we find several statistical issues associated with this method. We investigate these issues from a statistical perspective and propose remedies in this dissertation. Our primary concern arises from the joint estimation of loadings and scores through the maximum likelihood method. As in the well-known incidental parameter problem, this formulation with scores as separate parameters may result in inconsistency in the estimation of loadings under the classical asymptotic setting where the data dimension is fixed. We examine the population version of this formulation and show that it lacks Fisher consistency in loadings. Additionally, estimating scores can be viewed as performing a generalized linear model with loadings as covariates. Maximum likelihood estimation (MLE) bias is naturally involved in this process but is often ignored. Upon identifying two major sources of bias in the estimation process, we propose a bias correction procedure to reduce their effects. First, we deal with the discrepancy between true loadings and their estimates under a limited sample size. We use the iterative bootstrap method to debias loadings estimates. Then, we account for sampling errors in loadings by treating them as covariates with measurement error to improve score estimates. Moreover, MLE biases in scores are properly addressed through well-known MLE bias reduction methods. While exponential family PCA is applicable to a wide range of data types, its original formulation may be unsuitable for data with such salient features as excessive zeros, overdispersion, or mixed data types that may not be fully described by the exponential family distribution. One such case is data generated from high-throughput sequencing technologies. Due to technical limitations, measurements of low expression levels in the form of counts are often recorded as zeros, resulting in an excessive number of zeros. These technical zeros may confound biologically meaningful zeros. To better accommodate such zero-inflated count data, we propose a new dimension reduction method by incorporating a zero-inflated probability structure into a Poisson PCA formulation (Landgraf and Lee, 2020b) and introduce an efficient minorization-maximization (MM) algorithm for parameter estimation. Through extensive experiments on simulated data and real data, including CAL500 dataset for binary data, the million song dataset for count data, and a human bone marrow dataset for zero-inflated count data, we demonstrate the utility of the remedies proposed above in identifying the latent low-rank structure of data more effectively and improving the performance in downstream tasks.

Committee

Yoonkyung Lee (Advisor)
Asuman Turkmen (Committee Member)
YunZhang Zhu (Committee Member)

Pages

183 p.

Subject Headings

Statistics

Keywords

Bias correction; Binary data; Exponential family PCA; Fisher consistency; Iterative bootstrap; Measurement error; Sparse count data; Zero inflation; Zero-inflated Poisson PCA

Huang, R. (2023). Enhancing Exponential Family PCA: Statistical Issues and Remedies [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1689927863478667
APA Style (7th edition)
Huang, Ruochen. Enhancing Exponential Family PCA: Statistical Issues and Remedies. 2023. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1689927863478667.
MLA Style (8th edition)
Huang, Ruochen. "Enhancing Exponential Family PCA: Statistical Issues and Remedies." Doctoral dissertation, Ohio State University, 2023. http://rave.ohiolink.edu/etdc/view?acc_num=osu1689927863478667
Chicago Manual of Style (17th edition)

Document number:

osu1689927863478667

Download Count:

198

Copyright Info

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Enhancing Exponential Family PCA: Statistical Issues and Remedies

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Enhancing Exponential Family PCA: Statistical Issues and Remedies

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations