Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
Ohio_State_University_dissertation_Ruochen_submitted.pdf (4.09 MB)
ETD Abstract Container
Abstract Header
Enhancing Exponential Family PCA: Statistical Issues and Remedies
Author Info
Huang, Ruochen
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=osu1689927863478667
Abstract Details
Year and Degree
2023, Doctor of Philosophy, Ohio State University, Statistics.
Abstract
Exponential family PCA (Collins et al., 2001) is a widely used dimension reduction tool for capturing a low-dimensional latent structure of exponential family data such as binary data or count data. As an extension of principal component analysis (PCA), it imposes a low-rank structure on the natural parameter matrix, which can be factorized into two matrices, namely, the principal component loadings matrix and scores matrix. These loadings and scores share the same interpretation and functionality as those in PCA. Loadings enable exploration of associations among variables, scores can be utilized as low-dimensional data embeddings, and estimated natural parameters can impute missing data entries. Despite the popularity of exponential family PCA, we find several statistical issues associated with this method. We investigate these issues from a statistical perspective and propose remedies in this dissertation. Our primary concern arises from the joint estimation of loadings and scores through the maximum likelihood method. As in the well-known incidental parameter problem, this formulation with scores as separate parameters may result in inconsistency in the estimation of loadings under the classical asymptotic setting where the data dimension is fixed. We examine the population version of this formulation and show that it lacks Fisher consistency in loadings. Additionally, estimating scores can be viewed as performing a generalized linear model with loadings as covariates. Maximum likelihood estimation (MLE) bias is naturally involved in this process but is often ignored. Upon identifying two major sources of bias in the estimation process, we propose a bias correction procedure to reduce their effects. First, we deal with the discrepancy between true loadings and their estimates under a limited sample size. We use the iterative bootstrap method to debias loadings estimates. Then, we account for sampling errors in loadings by treating them as covariates with measurement error to improve score estimates. Moreover, MLE biases in scores are properly addressed through well-known MLE bias reduction methods. While exponential family PCA is applicable to a wide range of data types, its original formulation may be unsuitable for data with such salient features as excessive zeros, overdispersion, or mixed data types that may not be fully described by the exponential family distribution. One such case is data generated from high-throughput sequencing technologies. Due to technical limitations, measurements of low expression levels in the form of counts are often recorded as zeros, resulting in an excessive number of zeros. These technical zeros may confound biologically meaningful zeros. To better accommodate such zero-inflated count data, we propose a new dimension reduction method by incorporating a zero-inflated probability structure into a Poisson PCA formulation (Landgraf and Lee, 2020b) and introduce an efficient minorization-maximization (MM) algorithm for parameter estimation. Through extensive experiments on simulated data and real data, including CAL500 dataset for binary data, the million song dataset for count data, and a human bone marrow dataset for zero-inflated count data, we demonstrate the utility of the remedies proposed above in identifying the latent low-rank structure of data more effectively and improving the performance in downstream tasks.
Committee
Yoonkyung Lee (Advisor)
Asuman Turkmen (Committee Member)
YunZhang Zhu (Committee Member)
Pages
183 p.
Subject Headings
Statistics
Keywords
Bias correction
;
Binary data
;
Exponential family PCA
;
Fisher consistency
;
Iterative bootstrap
;
Measurement error
;
Sparse count data
;
Zero inflation
;
Zero-inflated Poisson PCA
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Huang, R. (2023).
Enhancing Exponential Family PCA: Statistical Issues and Remedies
[Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1689927863478667
APA Style (7th edition)
Huang, Ruochen.
Enhancing Exponential Family PCA: Statistical Issues and Remedies.
2023. Ohio State University, Doctoral dissertation.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=osu1689927863478667.
MLA Style (8th edition)
Huang, Ruochen. "Enhancing Exponential Family PCA: Statistical Issues and Remedies." Doctoral dissertation, Ohio State University, 2023. http://rave.ohiolink.edu/etdc/view?acc_num=osu1689927863478667
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
osu1689927863478667
Download Count:
165
Copyright Info
© 2023, all rights reserved.
This open access ETD is published by The Ohio State University and OhioLINK.