Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 1072)

Mini-Tools

 
 

Search Report

  • 1. SUN, HAN High-dimensional Variable Selection: A Novel Ensemble-based Method and Stability Investigation

    Doctor of Philosophy, Case Western Reserve University, 2025, Epidemiology and Biostatistics

    Variable selection in high-dimensional data analysis poses substantial methodological challenges. While numerous penalized variable selection methods and machine learning approaches exist, many demonstrate instability in real-world applications. This thesis makes two primary contributions: developing a novel ensemble algorithm for variable selection in competing risks modeling and conducting a comprehensive stability analysis of established variable selection methods. The first component introduces the Random Approximate Elastic Net (RAEN), an innovative methodology that offers a stable and generalizable solution for large-p-small-n variable selection in competing risks data. RAEN's flexible framework enables its application across various time-to-event regression models, including competing risks quantile regression and accelerated failure time models. We demonstrate that our computationally-intensive algorithm substantially improves both variable selection accuracy and parameter estimation in a numerical study. We have implemented RAEN in a user-friendly R package, freely available for public use. To demonstrate its practical utility, we apply RAEN to a cancer study, successfully identifying influential genes associated with mortality and disease progression in bladder cancer patients. The second component comprises a systematic evaluation of eight variable selection methods' stability under varying conditions. Through comprehensive numerical studies, we examine how factors such as sample sizes, number of predictors, correlation levels, and signal strength influence performance. Based on these findings, we provide evidence-based recommendations for implementing variable selection methods in real-world data analysis.

    Committee: Xiaofeng Wang (Advisor); John Barnard (Committee Member); Mark Schluchter (Committee Member); William Bush (Committee Chair) Subjects: Bioinformatics; Biostatistics; Genetics; Statistics
  • 2. Zhang, Haichao Bayesian Modeling and Variable Selection in Dependent Zero-Inflated Count Data

    PhD, University of Cincinnati, 2024, Arts and Sciences: Statistics

    This dissertation explores Bayesian methods in modeling count data with excessive zeros through generalized linear mixed-effect models. It focuses on developing efficient computational algorithms, modeling shared independent and dependent random effects, and implementing effective variable selection. In Bayesian modeling of the zero-inflated count data, a challenge lies in the lack of closed-form posteriors and thus efficiency in Markov chain Monte Carlo (MCMC) samplings. We implement a data augmentation strategy based on Polya-Gamma latent variables. Under this approach, the binomial likelihood is represented as a Gaussian mixture with regard to the Polya-Gamma latent variables, leading to closed-form posteriors for regression coefficients. This greatly facilitates the computation related to the binomial likelihood, which is needed in modeling the zero-inflation using the logistic regression model as well as the over-dispersion using the negative binomial regression. In Chapter 2, we propose several Zero-Inflated Poisson (ZIP) and Zero-Inflated Negative Binomial (ZINB) models to address the unique challenges which are commonly observed in ecological studies, including overdispersion, excess zeros, missing data, and temporal dependencies. These models include shared year effects and autoregressive random components to capture both within- and between-year correlations. Simulation studies and real-world applications show that including autoregressive components in zero-inflated count models significantly improves model performance in modeling temporally dependent count data, particularly in terms of predictive accuracy and parameter estimation. The application of these models is shown by exploring migration patterns and potential environmental factors of a steelhead smolt migration dataset from Southern California Santa Clara River. The results show that models with autoregressive components have better performance than the other models in both model fi (open full item for complete abstract)

    Committee: Xia Wang Ph.D. (Committee Chair); Xuan Cao Ph.D. (Committee Member); Siva Sivaganesan Ph.D. (Committee Member) Subjects: Statistics
  • 3. Schafer, Austin Enhancing Vehicle Detection in Low-Light Imagery Using Polarimetric Data

    Master of Science (M.S.), University of Dayton, 2024, Electrical Engineering

    RGB imagery provides detail which is usually sufficient to perform computer vision tasks. However, images taken in low-light appear vastly different from well-lit imagery due to the diversity in light intensity. Polarimetric data provides additional detail which focuses on the orientation of the light rather than intensity. Scaling our classic RGB images using polarimetric data can maintain the RGB image type, while also enhancing image contrast. This allows transfer learning using pre-trained RGB models to appear more feasible. Our work focuses on developing a large dataset of paired polarimetric RGB images in a highly controlled laboratory environment. Then, we perform transfer learning on a pre-trained image segmentation model with each of our image product types. Finally, we compare these results in both well-lit and low-light scenarios to see how our polarimetrically enhanced RGB images stack up against regular RGB images.

    Committee: Bradley Ratliff (Committee Chair); Amy Neidhard-Doll (Committee Member); Eric Balster (Committee Member) Subjects: Computer Engineering; Electrical Engineering; Engineering; Optics; Remote Sensing; Scientific Imaging; Statistics
  • 4. Rickman, William Surrogate Markov Models for Validation and Comparative Analysis of Proper Orthogonal Decomposition and Dynamic Mode Decomposition Reduced Order Models

    Master of Science, Miami University, 2025, Mechanical and Manufacturing Engineering

    Reduced order modeling (ROM) methods, such as those based upon Proper Orthogonal Decomposition (POD) and Dynamic Mode Decomposition (DMD), offer data-based turbulence modeling with potential applications for flow control. While these models are often cheaper than numerical approaches, their results require validation with source data. Within the literature, the metrics and standards used to validate these models are often inconsistent. Chabot (2014) produced a data-driven framework for validating these ROMs that used surrogate Markov models (SMMs) to compare how the system dynamics evolved rather than how any single metric evolved. These SMMs were constructed by clustering the flow data into different states of suitably similar flow fields, and the Markov model then mapped how likely each state was to transition into another. While this method was successful, there persisted an amount of uncertainty in how the outlier states within this clustering scheme were determined. Additionally, the study only examined the application of this procedure to POD-Galerkin ROMs. This study aims to tie the outlier state determination directly to the models' parent data. The study will also apply this procedure to ROMs generated from DMD to investigate how this framework's effectiveness carries over to different classes of ROMs.

    Committee: Edgar Caraballo (Advisor); Andrew Sommers (Committee Member); Mehdi Zanjani (Committee Member) Subjects: Aerospace Engineering; Fluid Dynamics; Mathematics; Mechanical Engineering; Statistics
  • 5. Inoshita, Gabriel Improving Abundance Estimates of the Black River Populatio of Textas Hornshell Via Zero Inflated Modeling

    Master of Science, Miami University, 2024, Biology

    Human-induced climate change has been impacting ecological systems around the world. These impacts include increases in temperature and changes in precipitation. Freshwater mussels are one of the taxonomic groups most affected by these changes, leading to the decline of many species of mussels. I sought to improve population estimates for the Black River population of the Texas Hornshell mussel (Popenaias popeii), in Eddy County, NM. I investigated changes in demographic parameters through a 27-year mark-and-recapture study, discovering that the total population is decreasing by about 1.67% per year with individual survival rates at 81.5% per year. In 2023-2024, I repeated a population census using methods that were also used in 2011-2012 and 2017-2018. Combining the results of the three censuses, I generated a zero-inflated Poisson model to predict abundance within the river with good fit to the testing data, which improved estimate precision by 80% compared to previous methods. I found that estimated model error doubles every 5 years into the future and that reduced sampling designs can produce good estimates with limited decrease in precision. Reduced sampling and this model combined with periodic complete surveys should allow managers to control costs while precisely assessing abundance.

    Committee: David Berg (Advisor); Tereza Jezkova (Committee Member); Kentaro Inoue (Committee Member); Hank Stevens (Committee Member) Subjects: Biology; Conservation; Ecology; Statistics
  • 6. Meng, Guanqun STATISTICAL CONSIDERATIONS IN CELL TYPE DECONVOLUTION AND CELL-TYPE-SPECIFIC DIFFERENTIAL EXPRESSION ANALYSIS

    Doctor of Philosophy, Case Western Reserve University, 2024, Epidemiology and Biostatistics

    Interpreting sequencing data precisely is often the primary task in genomic research, aiming to uncover gene expression alterations associated with various phenotypes. Biopsy or tissue samples collected in clinical and research settings are typically a mosaic of at least several pure cell types. The observed changes in gene expression could be caused by variations in cell type compositions or differentially expressed (DE) genes within specific cell types. Therefore, cellular deconvolution is a critical step before the cell-type-specific Differentially Expressed (csDE) gene study. Many statistical approaches have been proposed for csDE studies. However, a systematic review that examines the assumptions underlying these models and how these assumptions influence their performances under different scenarios has not yet been conducted. Additionally, there is a lack of statistical tools to assess the powers of csDE studies. Furthermore, current deconvolution methods largely depend on the assumption that all subjects share an identical population-level reference panel, which ignores inter-subject heterogeneities. This may compromise the validity of results, especially in studies that involve repetitive and longitudinal measurements. Moreover, while machine learning and deep learning-based deconvolution methods have been extensively developed for bulk transcriptomic data such as RNA-seq and microarrays, their application to imaging data, such as Immunohistochemistry (IHC), remains unexplored. We first benchmarked a few popular statistical models for detecting csDE genes between different phenotype-of-interests. Based on our comprehensive and flexible data simulation pipelines, we developed a power evaluation toolbox, cypress, to guide researchers in designing experiments for csDE studies. cypress can conduct extensive simulations using existing or provided parameters, model biological/technical variations, and provide thorough assessments by multiple metrics. Additio (open full item for complete abstract)

    Committee: Hao Feng (Advisor); Fredrick R. Schumacher (Committee Chair); Qian Li (Committee Member); Jenný Brynjarsdóttir (Committee Member); Lijun Zhang (Committee Member) Subjects: Bioinformatics; Biostatistics; Genetics; Public Health; Statistics
  • 7. Kim, Seonho Nonlinear Inverse Problems: Efficient and Guaranteed Algorithms

    Doctor of Philosophy, The Ohio State University, 2024, Electrical and Computer Engineering

    Nonlinear inverse problems arise in fields such as engineering, statistics, and machine learning. Unlike linear inverse problems, which can be formulated as convex programs, the main challenge in nonlinear inverse problems is the non-convex nature of the optimization involved. Solving non-convex optimization problems is NP-hard, susceptible to local minima, and often computationally intractable, making it essential to design practical algorithms with guaranteed performance. This thesis addresses two specific nonlinear inverse problems. The first problem is robust phase retrieval, which has applications in areas including X-ray crystallography, diffraction and array imaging, and optics. In this problem, the forward model is the magnitude of linear measurements, and the observations are corrupted by sparse outliers. We employ a least absolute deviation (LAD) approach to robust phase retrieval, which aims to recover a signal from its absolute measurements contaminated by sparse noise. To tackle the resulting non-convex optimization problem, we propose a robust alternating minimization (Robust-AM) approach, derived as an unconstrained Gauss-Newton method. For solving the inner optimization in each step of Robust- AM, we adopt two computationally efficient methods. We provide a non-asymptotic convergence analysis of these practical algorithms for Robust-AM under the standard Gaussian measurement assumption. With suitable initialization, these algorithms are guaranteed to converge linearly to the ground truth at an order-optimal sample complexity with high probability, assuming the noise support is arbitrarily fixed and the sparsity level does not exceed 1/4. Furthermore, comprehensive numerical experiments on synthetic and image datasets demonstrate that Robust-AM outperforms existing methods for robust phase retrieval, while offering comparable theoretical guarantees. The second problem is max-affine regression, where the forward model is a convex (open full item for complete abstract)

    Committee: Kiryung Lee (Advisor); Yoonkyung Lee (Committee Member); Philip Schniter (Committee Member) Subjects: Engineering; Mathematics; Statistics
  • 8. Leinbach, Josiah A Hidden Markov Approach to Authorship Attribution of the Pastoral Epistles

    Master of Science (MS), Bowling Green State University, 2024, Applied Statistics (ASOR)

    The New Testament contains thirteen epistles written in the name of the Apostle Paul, and from the earliest records of church history, Christian theologians received all thirteen as authentically Pauline. Since the 19th century, however, many scholars have doubted Paul's authorship of some epistles based on, among other factors, their vocabulary and writing style, which differ from undisputed Pauline epistles. In particular, three epistles called the Pastoral Epistles (1 Timothy, 2 Timothy, and Titus) have been subject to the most doubt. This thesis will use a Hidden Markov Model that analyzes the transitions between different parts of speech in the whole Pauline corpus and classifies sentences as belonging to a “Pauline” or “non-Pauline” style. Then, informed by New Testament scholarship, we will interpret these results and judge the possibility of Pauline authorship for the Pastoral Epistles.

    Committee: Shuchismita Sarkar (Committee Chair); Riddhi Ghosh (Committee Member); Christopher Rump (Committee Member) Subjects: Statistics
  • 9. Millican, Patrick Statistical Assessment and Comparison of Truncation Error and Convergence Patterns in Modern Nucleon-Nucleon Potentials

    Doctor of Philosophy, The Ohio State University, 2024, Physics

    The practice of nuclear physics is divided into experiment and theory. Experimental nuclear physics makes observations of nuclear properties, such as radii and binding energies, while theoretical nuclear physics interprets the results, assimilates them into broader and more fundamental theories, and counsels the direction of future experimental efforts. In order to learn from experimental data, theoretical nuclear physicists make models to describe interactions between fields under the guidelines of quantum field theory even when there may be no closed mathematical form for those interactions. In cases where such a closed form is lacking (and, indeed, even in some cases where such a form is known) and the physics of interest is confined to a physical regime, the dominant paradigm is that of effective field theory (EFT). A cutoff or cutoffs in some physical variable(s) bound the regime where an EFT seeks to describe physical phenomena, which restricts the degrees of freedom, the constituent fields and interactions available for calculations. Additionally, an EFT preserves the symmetries of the underlying theory to create its own Lagrangian that is an infinite sum including all possible terms compliant with those symmetries. An EFT therefore needs a power-counting scheme that organizes the terms by like magnitude. With the calculation of physical observables from an infinite sum of terms being impossible under any circumstances (and putting to the side the fact that such sums are almost always asymptotic and will diverge given enough terms), EFTs truncate at some order and leave an infinite number of higher-order terms out of calculations; the contribution of these terms to theoretical predictions constitutes the truncation error. The specific instantiation of EFT on which this thesis focuses is chiral effective field theory (χEFT), which treats interactions between protons and neutrons (“nucleons,” collectively) as mediated by the exchange of pions. In χEFT, (open full item for complete abstract)

    Committee: Richard Furnstahl (Advisor); Thomas Humanic (Committee Member); Yuri Kovchegov (Committee Member); Louis DiMauro (Committee Member) Subjects: Nuclear Physics; Physics; Statistics
  • 10. Huang, Rui Flexible Modeling Method Based on Bayesian Regression Using Multivariate Piecewise Linear Splines

    PhD, University of Cincinnati, 2024, Arts and Sciences: Statistics

    In this dissertation, we develop a flexible methodology using a multivariate piecewise linear spline (MPLS) regression model to address challenges in large clinical trials and observational studies. Our approach is characterized by several key contributions. Firstly, we introduce a novel method to model heteroscedastic data, ensuring robust estimation of the mean regression surface E(Y|X), and demonstrating strong predictive performance using piecewise basis functions. Secondly, our methodology is tailored to causal inference settings, where we incorporate treatment effects into our model and employ a Markov chain Monte Carlo method for simultaneous estimation of both mean and treatment surfaces. Thirdly, we extend our model to accommodate multivariate outcomes, enhancing its applicability to complex clinical studies where multiple endpoints are of interest. Lastly, to address the challenge of censored data, particularly in survival analysis, we integrate the truncated normal method with our spline model under a log-normal regression framework. This approach enables accurate estimation in the presence of truncated outcomes, enhancing the model's applicability to studies where survival data is critical. By comparing our model fitting with the BART method, we demonstrate the flexibility and accuracy of our approach using both simulated data and real data obtained from a secondary data core. This methodology provides valuable insights for examining treatment outcomes and heterogeneity in response across diverse patient populations, thereby contributing significantly to the field of medical research.

    Committee: Siva Sivaganesan Ph.D. (Committee Chair); Emily Kang Ph.D. (Committee Member); Bin Huang Ph.D. (Committee Member); Hang Joon Kim Ph.D. (Committee Member); Seongho Song Ph.D. (Committee Member) Subjects: Statistics
  • 11. Dignan, Stephen A Comparison of Logistic PCA and Selected Data Embedding Procedures for Binary Data with Application to Breast Cancer and Glioblastoma Data

    Master of Science, The Ohio State University, 2024, Statistics

    Principal component analysis (PCA) is a data analysis technique used to reduce the dimension of a data set while retaining key patterns of variation by transforming the data to a lower-dimension space defined by orthonormal basis vectors that capture the directions of maximal variation. A novel technique named logistic PCA (LPCA) was developed that allows researchers to make use of benefits of PCA analysis in the study of data sets containing binary variables, allowing for more widespread use of these methods in areas of study frequently examining binary data, such as biomedical science and healthcare. We apply logistic PCA method to two data sets, the first comprised of data from tissue samples obtained from patients diagnosed with breast cancer and the second comprised of data from select genetic profiles of individuals diagnosed with brain tumor. An initial simulation study was performed to examine randomly-generated binary data from settings with a known clustering structure to evaluate retention of clustering in low-dimension plots created using PCA, LPCA, and t-distributed stochastic neighbor embedding (t-SNE), another frequently-utilized data analysis technique. Results revealed that LPCA consistently outperforms PCA in terms of reconstruction error in settings where probability parameters for clusters are close to 0.5 and that LPCA and PCA perform comparably in settings with more extreme probability parameters. LPCA and t-SNE also show comparable clustering in the two-dimensional plots. In analysis of the cancer-related data, two-dimensional plots for data embedding were generated, and principal component loadings were obtained from each of the data sets using LPCA and PCA, and used to provide interpretations of data patterns in the context of cancer-related biomedical science and healthcare. Analysis revealed that interpretations of LPCA loadings provide information consistent with established biomedical research findings as well as new information and that (open full item for complete abstract)

    Committee: Yoonkyung Lee (Advisor); Asuman Turkmen (Committee Member) Subjects: Applied Mathematics; Biology; Genetics; Medicine; Oncology; Statistics
  • 12. Henderson, Nikolas First-Passage Percolation on the Complete Graph with Weight-Zero Edges

    Doctor of Philosophy, The Ohio State University, 2024, Mathematics

    Given any graph, one can generate a random metric on the vertices by assigning random weights to the edges, then letting the distance d(x, y) between any two vertices x and y be the total weight of the lightest path from x to y. This model started off on the integer lattice Zn under the name first-passage percolation, calling to mind a fluid percolating through a porous medium. More recently, however, the model has transitioned to the complete graph Kn, the Erdos-Renyi graph G(n, p), and other less geometric graph models, starting with [14] in 1999. While the theory has developed over the last 25 years, it has maintained a significant blind spot, namely that of weight-zero edges. Indeed, weightless edges have been what connected FPP on Zn with classical bond percolation, and the equivalent connections have not been explored on even Kn, where the theory of G(n, p) serves as a natural bond percolation analogue. We seek to begin building this connection by investigating the effects of weight-zero edges on the first passage model on Kn, examining the typical distance and radius of these random environments and seeing how the theory of G(n, p) can shed light on the behavior of these metrics.

    Committee: David Sivakoff (Advisor); Matthew Kahle (Committee Member); Cesar Cuenca (Committee Member); Arthur Burghes (Other) Subjects: Mathematics; Statistics
  • 13. Zhang, Rui Likelihood-free Inference via Deep Neural Networks

    Doctor of Philosophy, The Ohio State University, 2024, Statistics

    Many application areas rely on models that can be readily simulated but lack a closed-form likelihood, or an accurate approximation under arbitrary parameter values. In this "likelihood-free" setting, inference is typically simulation-based and requires some degree of approximation. Recent work on using neural networks to reconstruct the mapping from the data space to the parameters from a set of synthetic parameter-data pairs suffers from the curse of dimensionality, resulting in inaccurate estimation as the data size grows. In this dissertation, we propose new inferential techniques to overcome these limitations, beginning with a simulation-based dimension-reduced reconstruction map (RM-DR) estimation method. This approach integrates reconstruction map estimation with dimension-reduction techniques grounded in subject-specific knowledge. We examine the properties of reconstruction map estimation with and without dimension reduction, and describe the trade-off between information loss from data reduction and approximation error due to increasing input dimension of the reconstruction function. Numerical examples illustrate that the proposed approach compares favorably with traditional reconstruction map estimation, approximate Bayesian computation (ABC), and synthetic likelihood estimation (SLE). Additionally, in settings where likelihood evaluation is possible but expensive, we propose combining the RM-DR approach with local optimization as an alternative to using expensive global optimizers for parameter estimation, achieving comparable accuracy with improved time efficiency. To further incorporate uncertainty quantification, crucial for interpretation and making informed decisions, we introduce kernel-adaptive synthetic posterior estimation (KASPE). This method employs a deep learning framework to learn a closed-form approximation to the exact posterior, combined with a kernel-based adaptive sampling mechanism to generate synthetic training data. We study (open full item for complete abstract)

    Committee: Oksana Chkrebtii (Advisor); Dongbin Xiu (Advisor); Yuan Zhang (Committee Member) Subjects: Statistics
  • 14. Qiang, Rui Accounting for Preferential Sampling in Geostatistical Inference

    Doctor of Philosophy, The Ohio State University, 2024, Statistics

    In this dissertation, we address the problem of preferential sampling in spatial statis- tics, where the locations of point-referenced data are related to the latent spatial process of interest. Traditional geostatistical models can lead to biased inferences and predictions un- der preferential sampling. We introduce an extended Bayesian hierarchical framework that models both the observation locations and the observed data jointly, using a spatial point process for the locations and a geostatistical process for the observations. This allows for various point processes and non-Gaussian observation models. We review existing literature on geostatistical processes, spatial point processes, and their contribution to the preferen- tial sampling problem. We develop a hierarchical model using a log-Gaussian Cox process (LGCP) for the sampling locations, combined with a Gaussian process for the observations. This approach is extended to incorporate non-Gaussian observations through spatial gener- alized linear models (GLMs), adding fexibility to the preferential sampling framework. To reduce computational complexity, we adopt techniques like nearest neighbor approximation. We also introduce simpler methods for accounting for preferential sampling that are less computationally demanding at the expense of prediction accuracy. We validate our mod- els through extensive simulations, demonstrating their efectiveness in correcting biases and improving prediction accuracy. Applications to real-world data, such as the Global Histor- ical Climate Network, showcase the practical utility of our models in environmental and ecological studies.

    Committee: Peter Craigmile (Advisor); Thomas Metzger (Committee Member); Oksana Chkrebtii (Advisor) Subjects: Statistics
  • 15. Katugoda Gedara, Ayesha Kumari Ekanayaka Refining Climate Model Projections: Spatial Statistical Downscaling and Bayesian Model Averaging for Climate Model Integration

    PhD, University of Cincinnati, 2024, Arts and Sciences: Statistics

    In this dissertation, two innovative statistical methodologies are developed to enhance the accuracy of climate model projections. Climate models simulate future global climate conditions but are constrained by coarse resolutions due to computational limitations. Consequently, these projections must be refined to finer resolutions before they can be effectively utilized in regional studies. The first methodology introduces a novel spatial statistical model for downscaling climate model projections. This approach significantly enhances precision by incorporating spatial dependencies and stands out by providing meaningful uncertainty estimates, a feature often missing in many previous downscaling approaches. Additionally, the method achieves computational efficiency through a basis representation, making it adept at managing large datasets effectively. Furthermore, climate models originate from various research groups, each based on different understandings and assumptions about the Earth's climate. This leads to significant uncertainty in the choice of models for subsequent analysis. Since there is no definitive way to select the best model or a few reliable ones, climate scientists often seek methods to combine projections from multiple models to mitigate this uncertainty. The second method in this dissertation introduces a comprehensive approach to integrate projections from multiple climate models using Bayesian Model Averaging (BMA). The proposed method effectively tackles the challenge of implementing BMA for climate model integration in a full Bayesian framework by employing Polya-Gamma augmentation and yields combined climate projections with improved accuracy and reliable uncertainty estimates.

    Committee: Emily Kang Ph.D. (Committee Chair); Bledar Konomi Ph.D. (Committee Member); Won Chang Ph.D. (Committee Member) Subjects: Statistics
  • 16. Lu, Wei-En Causal Inference in Case-Cohort Studies Using Restricted Mean Survival Time

    Doctor of Philosophy, The Ohio State University, 2024, Biostatistics

    In large observational epidemiological studies with survival outcome and low event rates, the stratified case-cohort design is commonly used to reduce the cost associated with covariate measurement. The goal of many of these studies is to determine whether a cause-and-effect relationship exists between some treatment and an outcome rather an associative relationship. Therefore, a method for estimating the causal effect under the stratified case-cohort design is needed. In this dissertation, we propose to estimate the causal effect of treatment on survival outcome using restricted mean survival time (RMST) difference as the causal effect measure under the stratified case-cohort design and using propensity score stratification or matching to adjust for confounding bias that is present in observational studies. First, we propose a propensity score stratified RMST estimation strategy under the stratified case-cohort design. We established the asymptotic normality of the proposed estimator. Based on the simulation study, the proposed method performs well and is simple to implement in practice. We also applied the proposed method to the Atherosclerosis Risk in Communities (ARIC) Study to estimate the marginal causal effect of high sensitivity C-reactive protein level on coronary heart disease survival. As an alternative to propensity score stratification, we proposed a propensity score matched RMST estimation strategy under the stratified case-cohort design. The asymptotic normality of the proposed estimator was established and due to the matching design, the correlation that exists within the matched set was accounted for. Simulation studies also demonstrated that the proposed method has adequate performance and outperforms the competing methods. The proposed method was also used to estimate the marginal causal effect of high sensitivity C-reactive protein level on coronary heart disease survival in the ARIC study.

    Committee: Ai Ni (Advisor); Eben Kenah (Committee Member); Bo Lu (Committee Member) Subjects: Biostatistics; Public Health; Statistics
  • 17. Kaur, Pashmeen Statistical Methods for Generalized Integer Autoregressive Processes

    Doctor of Philosophy, The Ohio State University, 2024, Statistics

    A popular and flexible time series model for counts is the generalized integer autoregressive process of order p, GINAR(p). These Markov processes are defined using thinning operators evaluated on past values of the process along with a discretely-valued innovation process. This class includes the commonly used INAR(p) process, defined with binomial thinning and Poisson innovations. GINAR processes can be used in a variety of settings, including modeling time series with low counts, and allow for more general mean-variance relationships, capturing both over- or under-dispersion. While there are many thinning operators and innovation processes given in the literature, less focus has been spent on comparing statistical inference and forecasting procedures over different choices of GINAR process. We provide an extensive study of exact and approximate inference and forecasting methods that can be applied to a wide class of GINAR(p) processes with general thinning and innovation parameters. We discuss the challenges of exact estimation when p is larger. We summarize and extend asymptotic results for estimators of process parameters, and present simulations to compare small sample performance, highlighting how different methods compare. GINAR processes assume stationarity of the process, which may not be an appropriate assumption for many real-world applications. Hence, we introduce a process for non-stationary count time series called the time-varying generalized integer autoregressive process (TV-GINAR(p)), which allows for time-varying parameters modeled via basis functions. We introduce statistical properties, discuss estimation strategies, and statistical inference for this class of processes. We present simulation studies for the TV-GINAR(p) process and illustrate this methodology by fitting GINAR and TV-GINAR processes to a disease surveillance series and patient scores dataset.

    Committee: Peter F. Craigmile (Advisor); Christopher Hans (Advisor); Yoonkyung Lee (Committee Member) Subjects: Statistics
  • 18. Yuan, Yiwen Lasso Method with SCAD Penalty for Estimation and Variable Selection in Sequential Models

    Doctor of Philosophy (Ph.D.), Bowling Green State University, 2024, Statistics

    The sequential linear model is widely employed to analyze the dynamic data where the response variable at each time point incorporates the lagged results from the previous time point. With the lagged dependent response variables added to the model longitudinally, the issue of multicollinearity arises. In such situations, the Lasso method proposed by Tibshirani (1996) addresses both parameter estimation and variable selection simultaneously. However, in high-dimensional data and multicollinearity, the Lasso method can introduce bias in coefficient estimation and inconsistency in variable selection. To improve the Lasso method, a number of different penalty terms are proposed. Among the Lasso methods with different penalty terms, selecting an appropriate estimation and variable selection method is challenging work because it requires balancing the trade-off between achieving low bias and maintaining high prediction accuracy. One of the primary inferences in the sequential linear model is to predict the response variable with high accuracy and relatively minimal prediction errors, thereby saving time and expenses. To achieve this goal, we propose the estimation and variable selection method based on the Lasso, named Smoothly Clipped Absolute Deviation Penalty (SCAD) (Fan and Li, 2001), in the sequential linear model. The proposed SCAD method performs effectively in parameter estimation with low bias and variable selection with low predicted errors. In the demonstration of the effectiveness of the proposed method, we conduct the simulations where we compare the SCAD method with other methods including the ordinary least squares (OLS), Lasso, and Adaptive Lasso in both linear regression and sequential linear models. Since time series refers to a sequence of data generated at each time point, where the lagged response variable at each time point is used as a predictor in the subsequent time point model, accounting for errors based on assumptions, we simulate the data in (open full item for complete abstract)

    Committee: Junfeng Shang Ph. D. (Committee Chair); John H. Boman Ph. D. (Other); Hanfeng Chen Ph. D. (Committee Member); John Chen Ph. D. (Committee Member) Subjects: Statistics
  • 19. Sharna, Silvia Enhancing Classification on Disease Diagnosis with Deep Learning

    Doctor of Philosophy (Ph.D.), Bowling Green State University, 2024, Data Science

    The use of statistical and machine learning methods in collection, evaluation and presentation of biological data is very extensive. This reflects a need for precise quantitative assessment of different types of challenges encountered in the field of healthcare. But the sparse nature of medical data makes it hard to find the hidden patterns and as a result makes the prediction a complex task. This dissertation research discusses several biostatistical methods including sample size determination in a balanced clinical trial, finding cohort risk from case control information, odds ratio, Cochran-Mantel-Haenszel odds ratio etc. along with examples and analysis of a real life dataset to further solidify the concepts. Moreover, different classification models: Random Forest, Gradient Boosting, Support vector Machine (SVM), Naive Bayes, K-Nearest Neighbors (KNN), Decision Tree (DT), Logistic Regression, Artificial Neural Network (ANN) are applied in the analysis of Wisconsin Breast Cancer (diagnostic and original) dataset and their performance comparison is presented. Later, these classification models are also used in conjunction with ensemble learning methods; since ensemble methods significantly improves the predictive outcomes of the classification models. The evaluation of the classification models is measured using accuracy, AUC score, precision and recall metrics. In tree-based classification models, Random Forest (solely and in conjunction with the ensemble learning) gives the highest accuracy; whereas in the later chapter Artificial Neural Network gives the highest accuracy measure.

    Committee: John Chen Ph.D. (Committee Chair); Mohammadali Zolfagharian Ph.D. (Other); Umar Islambekov Ph.D. (Committee Member); Qing Tian Ph.D. (Committee Member) Subjects: Biostatistics; Statistics
  • 20. Owusu-Agyemang, Samuel Fare Pricing: Social Equity Conversations in Public Transportation Pricing, and The Potential of Mobile Fare Payment Technology

    Doctor of Philosophy in Urban Studies and Public Affairs, Cleveland State University, 2024, Levin College of Public Affairs and Education

    When designing fare structures, transit agencies are primarily concerned with generating revenue. They must however adhere to social justice goals set by transit governing bodies. Considering that travel is a derived demand, equitable access to public transportation enhances accessibility to desired destinations, especially for transit-dependent households. This three-essay dissertation addresses selected research questions in current transit pricing literature. The first essay examines the potential implications of fare capping policies for low-income transit users. The findings indicate that the introduction of monthly fare caps reduces total monthly fare expenditure among Extremely-Low-Income (ELI) riders and increases the likelihood of ELI riders earning unlimited monthly rides. The second essay explores how distance-based fares (DBF), compared to flat fare, potentially alters the travel expenditure of transit riders. This research finds that ELI riders experience significantly lower fare spending under a DBF system compared to a flat fare structure. The third essay tests current methodologies of extracting geodemographic information from mobile fare payment data. The findings show that land use type and the concentration of employment and housing in a neighborhood are significantly associated with the accuracy with which the residential locations of transit users can be inferred from mobile fare payment data. The analyses conducted in this dissertation are based on transit user activity data and survey data from a three-year federal grant led by NEORide, in partnership with multiple agencies in Ohio and Northern Kentucky. The research findings offer valuable insights into the current landscape of transit pricing and mobile fare payment technology in the United States.

    Committee: Robert Simons Ph.D. (Advisor); Floun'say Caver Ph.D (Committee Member); Thomas Hilde Ph.D (Committee Member); William Bowen Ph.D (Committee Member) Subjects: Demographics; Economics; Geographic Information Science; Public Policy; Statistics; Transportation; Transportation Planning; Urban Planning