Search Results (1 - 25 of 26 Results)

Sort By  
Sort Dir
 
Results per page  

Parsons, Michael M.Planned Missing Data Designs in Communication Research
Doctor of Philosophy (PhD), Ohio University, 2013, Communication Studies (Communication)
Prominent among the many methodological challenges communication research faces are the relative lack of longitudinal research conducted in the discipline and the threats to validity that arise from the complex instrumentation necessary for inquiry into human interaction. This dissertation presented planned missing data designs (PMDs) as solutions to these challenges because PMDs can make research less burdensome, cheaper, faster, and more valid. Three studies illustrate the use of PMDs in communication research. Study one was a controlled-enrollment PMD investigation of the relationship between students' public speaking anxiety and communication competence in a semester-long longitudinal study. By using the controlled-enrollment design, this study had five measurement waves, but each participant was measured at no more than three measurement waves. Results indicated that the controlled-enrollment design was effective at minimizing participant loss due to attrition and reducing the risk of testing effects due to repeated measurements. Study two was an efficiency-type PMD replication of Infante and Wigley's (1986) verbal aggressiveness scale validation study, in which each participant was presented with only 95 items from the 147 item survey instrument. Through the use of an efficiency design, this study was able to replicate the results of the original study with a dramatically reduced time burden on the participants, indicating that efficiency-type PMDs are an effective tool for scale shortening. Study three was an accelerated longitudinal PMD replication of Rubin, Graham, and Mignerey's (1990) longitudinal communication competence study, which measured change in students' communication competence over the course of a college career. Through the use of an accelerated longitudinal PMD, data collection was completed in just over one calendar year, far shorter than the three years the original study took to collect data. A flaw in participant retention procedures prevented data analysis from being conducted, but this study did effectively illustrate the increased methodological complexities caused by PMDs. This dissertation concludes that PMDs can be of substantial benefit to communication research, and should be adopted in the discipline. Special attention must be paid, however, to the increased design complexity added by the use of these methods.

Committee:

Amy Chadwick, PhD (Advisor)

Subjects:

Communication

Keywords:

missing data; planned missing data; longitudinal; multiple imputation; communication; communication research

Lewis, Marsha S. A Series of Sensitivity Analyses Examining the What Works Clearinghouse's Guidelines on Attrition Bias
Doctor of Philosophy (PhD), Ohio University, 2013, Educational Research and Evaluation (Education)
This dissertation addresses the following overall research question: How do the amount and type of attrition, under varying assumptions of how much a subject's likelihood of dropping out of a study is related to his or her outcome, impact randomized controlled studies by contributing to systematic bias? The study first replicates a study conducted on behalf of the U.S. Department of Education's What Works Clearinghouse. Then, by applying a more systematic change in the magnitudes of the coefficients representing how much a subject's likelihood of dropping out of the study is correlated to his or her outcome and also varying the differential attrition rates, the study helps address the question: How sensitive is the measure of bias to changes in attrition rates and/or the relationship between outcome and a participant's propensity to respond in randomized controlled trials? The study also adds to the complexity of the bias modeling by addressing the question: How does varying the random error to simulate variations in the reliability of instruments used across studies impact the attrition thresholds? The methodology consisted of a series eight of Monte Carlo simulations (50,000 replications each) programmed in R. Each simulation varied one or more of the following components: the relationship (or correlation) between the outcome at follow-up for a study participant and his or her propensity to respond (or likelihood of not attriting from the study); the magnitude of differential attrition between the treatment and control groups; and the random error generated by the reliability of the outcome instruments. The sensitivity analyses indicate that the What Works Clearinghouse attrition bias model is sensitive to changes in the assumptions about the relationship between attrition and outcome. The patterns in the findings indicate that the difference in the relationship between the propensity to respond and outcome in the model are as important to the bias estimates as the overall and differential attrition. Modifying the attrition bias formula to allow for the relationship between the propensity to respond and outcome to be less impactful in the model until that relationship reaches a certain threshold overall magnitude may help provide some more specific guidance to reviewers who may be reviewing studies where, for example, they assume zero or near zero relationship of the propensity to respond to outcome in the control group. The conclusion reached by varying the random error term in the model in order to address the potential impact of reliability on the bias thresholds indicates that the WWC attrition bias thresholds may be somewhat sensitive to varying reliabilities of instruments across studies. This sensitivity may necessitate the development of more specific guidance for reviewers of certain types of studies for inclusion in the U.S. Department of Education's What Works Clearinghouse.

Committee:

Gordon Brooks, PhD (Advisor)

Subjects:

Education; Education Policy; Educational Evaluation; Statistics

Keywords:

attrition; bias; Monte Carlo simulation; What Works Clearinghouse; missing data

Modur, Sharada P.Missing Data Methods for Clustered Longitudinal Data
Doctor of Philosophy, The Ohio State University, 2010, Statistics

Recently medical and public health research has focused on the development of models for longitudinal studies that aim to identify individuals at risk for obesity by tracking childhood weight gain. The National Longitudinal Surveys of Youth 79 (NLSY79), which includes a random sample of women with biometric information on their biological children collected biennially, has been considered. A mixed model with three levels of clustered random effects has been proposed for the estimation of child-specific weight trajectories while accounting the nested structure of the dataset. Included in this model is a regression equation approach to address any remaining heterogeneity in the within-child variances. Specifically, a model has been implemented to fit the log of the within-child variances as a function of time. This allows for more flexibility in modeling residual variances that appear to be increasing over time. Using the EM algorithm with a Newton-Raphson update all the parameters of the model are estimated simultaneously.

A second aspect to the research that is presented is the analysis of missing data. Extensive exploratory analysis revealed that intermittent missingness was prevalent in the relevant subset of the NLSY79 dataset. Starting with the assumptions of MCAR and MAR selection models are built up to appropriately account for the missing mechanism at play. A factorization of the multinomial distribution as a product of dependent binary observations is applied to model intermittent missingness. Logit models for dependent binary observations are used to fit selection models for missingness under the assumptions of MAR and MCAR. The NMAR case for clustered longitudinal data is discussed as an area for future research.

Committee:

Elizabeth Stasny, PhD (Advisor); Christopher Hans, PhD (Advisor); Eloise Kaizar, PhD (Committee Member); John Casterline, PhD (Committee Member)

Subjects:

Statistics

Keywords:

Longitudinal models; multilevel models; missing data analysis

Tang, YuxiaoInference on cross correlation with repeated measures data
Doctor of Philosophy, The Ohio State University, 2004, Statistics
We discuss the problem of estimating the correlation coefficient between two variables observed in a longitudinal study. We assume that they follow a bivariate normal distribution, and that the repeated measures taken on the same subject follow a multivariate normal model. We consider two cases: when the data are complete and incomplete. First, when all the observations are available, we introduce two estimators: the marginal mean estimator and the estimator based on the mean of Fisher's z values. These two estimators are functions of the sample cross correlations computed at each time point. Asymptotic distributions of the two estimators are given. After comparing these two estimators with the MLE, we find that the performance of the estimator based on the mean of Fisher's z values is as good as that of the MLE. The former estimator is much easier to compute. When some observations are missing with ignorable missing-data mechanism, we propose four estimators: the group weighted mean estimator, the marginal mean estimator, the estimator based on the weighted Fisher's z values, and the weighted marginal mean estimator. In the first approach, we group the data based on the missing pattern, estimate the correlation for each group, and take the weighted average. In the other three approaches, we compute the sample correlation coefficients based on cross-sectional data, and combine the marginal information in different ways. We obtain the asymptotic distributions of these estimators. Using simulation we compare them with the MLE. We find that these estimators are almost as good as the MLE while they are much easier to compute, except for the group weighted mean estimator. We discuss the robustness of these estimators as the nuisance parameters associated with the multivariate normal model vary. Further, we apply our approaches to the data from a dog diet study and an AIDS study separately to illustrate the advantages of the proposed approaches. We also discuss how to test the equality of correlations over time for the cases with complete and incomplete data sets from a multivariate normal model. We compare several tests and conclude that the asymptotic test based on the Fisher's z transformations performs well.

Committee:

Haikady Nagaraja (Advisor)

Subjects:

Statistics

Keywords:

cross correlation; repeated measurements; missing data

Chen, TianJudgment Post-Strati cation with Machine Learning Techniques: Adjusting for Missing Data in Surveys and Data Mining
Doctor of Philosophy, The Ohio State University, 2013, Statistics
Missing data is found in every type of data collection. How to deal with missing data has long been discussed in the survey sampling literature. It has not, however, been the topic of much research involving the huge data sets common in the data mining setting. In this dissertation, we combine ideas from the survey sampling and data mining literature to develop methods for handling missing data in both contexts. Judgement Post-Stratification (JPS) is a data analysis method, motivated by ranked set sampling (RSS), that uses judgement ranking for post-stratification. This dissertation briefly introduces RSS and JPS. Then it connects the JPS method with machine learning (ML) techniques in two ways. One is to use the ML techniques to build a ranking function therefore solving the judgement ranking problem. The other is to compare the estimates from the JPS method with these well-known ML techniques and provide efficiency measurements for the JPS method. We investigate the effect of set size, the number of units ranked at one time, through simulation studies. We also consider possible extensions for JPS, such as proportional proration. To our knowledge, we provide the first systematic study of the influences of three types of missing data on various ML techniques using simulated data. Finally, two real life examples are used to demonstrate the application of the JPS method in real world problems.

Committee:

Elizabeth Stasny (Committee Co-Chair); Tao Shi (Committee Co-Chair); Omer Ozturk (Committee Member); Aleix Martinez (Committee Member)

Subjects:

Statistics

Keywords:

JPS, Machine Learning, Missing Data

Nwosu, AnnSensitivity Analyses of the Effect of Atomoxetine and Behavioral Therapy in a Randomized Control Trial
Master of Science, The Ohio State University, 2017, Public Health
Missing data is a common feature in longitudinal studies. The CHARTS study is randomized controlled trial of the effectiveness of atomoxetine and parent management training in reducing attention deficit/hyperactivity disorder symptoms of non-compliance and inattention in a population of children diagnosed with Autism Spectrum Disorder. During the ten-week trial period, several children were lost to follow-up, or were missing cognitive outcome data for various reasons. The cognitive data are analyzed from both an assumption of missing at random, and that of missing not at random. This paper will analyze the sensitivity of the data to departures from missing at random using the pattern-mixture factorization. Such an analysis strategy is structured to improve inference on the efficacy of the drug and behavioral intervention.

Committee:

Rebecca Andridge (Advisor); James Odei (Committee Member)

Subjects:

Biostatistics; Public Health

Keywords:

missing data atomoxetine RCT

Carmack, Tara LynnA Comparison of Last Observation Carried Forward and Multiple Imputation in a Longitudinal Clinical Trial
Master of Science, The Ohio State University, 2012, Public Health
In randomized clinical trials, the presence of missing data presents challenges in determining the actual treatment effect of the study. It is particularly problematic in longitudinal studies when patients followed over time withdrawal from the study. Although it is important to anticipate and attempt to prevent these drop-outs in the study design, it is still likely that a significant amount of missingness will be present in the final data. It is important to have statistical methods that effectively analyze data which contains missing values, and produce unbiased results. This study compares several methods for handling missing data in longitudinal trials. The focus is on the single imputation method of last observation carried forward, and compares it to complete case analysis, multiple imputation and two additional versions of multiple imputation where everyone was imputed as if they were actually in the control group (placebo-imputation). We simulated a randomized control trial with a treatment and placebo group and two time points. After creating the data, we imparted missingness in the follow-up time point. We considered three mechanisms for the missing data: missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR). The results indicated that in all situations, last observation carried forward produced extremely biased estimates of treatment effect. Both placebo imputations produced similarly biased estimates. Complete case analysis was only valid in the situation where the data was MCAR. Traditional multiple imputation using regression performed the best of all the methods.

Committee:

Rebecca Andridge, PhD (Advisor); Abigail Shoben, PhD (Committee Member)

Subjects:

Biostatistics

Keywords:

LOCF; Multiple Imputation; missing data; randomized clinical trials

Kwon, HyukjeA Monte Carlo Study of Missing Data Treatments for an Incomplete Level-2 Variable in Hierarchical Linear Models
Doctor of Philosophy, The Ohio State University, 2011, EDU Policy and Leadership

This study was designed to evaluate the performance of missing data treatments with restrictive and inclusive strategies in a two-level hierarchical linear model with missing at random (MAR) missingness in terms of bias, Root Mean Square Error (RMSE), and width and coverage rate of confidence interval. The missing data treatments included in this study were listwise deletion, mean substitution, restrictive and inclusive EM, restrictive and inclusive multiple imputation (MI). The number of level-2 predictors, proportion of missingness (PM) and sample size (N) were manipulated as study factors.

The number of level-2 predictors and sample size appeared not to have a distinct impact on the performance of missing data treatments for level-2 missing data in terms of bias. However, the proportion of missing data significantly tends to affect the performance of missing data treatments with large effect so that with larger proportion of missingness, the relative bias difference among missing data treatments tends to increase in most fixed effects and some random effects.

Inclusive MI and listwise deletion generally outperformed the other missing data treatments producing practically acceptable bias in most fixed effects that are highly related to missingness. Restrictive EM and inclusive EM also performed well with some exceptions with large proportion of missingness (PM=30%). Restrictive MI and mean substitution produced unacceptable bias even with smaller proportions of missingness (PM=5% or 15%). For random effects, every missing data treatment was effective except for the non-significant Tau11.

Listwise deletion tends to provide the largest RMSE on both fixed and random effects. The relative difference in the RMSE between listwise deletion and the other missing data treatments was substantially large with large proportion of missingness (PM=30%) and smaller sample sizes (N<80 or 40).

Furthermore, listwise deletion provided the largest confidence intervals for both fixed and random effects. Again, the difference in the confidence interval width between listwise deletion and the other missing data treatments was substantially large with smaller sample sizes (N<80 or 40) and large proportion of missingness (PM=30%). The confidence interval coverage rates of mean substitution and inclusive EM were problematic with short confidence intervals for fixed effects when proportions of missingness are larger (PM>15% or 30%). Listwise deletion and inclusive EM also provided poor confidence interval coverage on Tau00 when missingness is large (PM=30%) and sample size is small (N=40).

Therefore, inclusive MI and restrictive EM may be a viable option with MAR missingness at level 2 in HLM to applied researchers. However, inclusive MI is preferred when proportion of missing data is large (PM=30%). Finally, it should be noted that no missing data treatment was effective on non-significant fixed or random effects smaller than .30.

Committee:

Richard Lomax, PhD (Advisor); Ann O'Connell, EdD (Committee Member); Dorinda Gallant, PhD (Committee Member)

Subjects:

Educational Tests and Measurements

Keywords:

missing data treatment; listwise deletion; mean substitution; EM; multiple imputation; inclusive; restrictive; bias; RMSE; confidence interval; HLM; intercepts- and slopes- as-outcomes

Hening, Dyah A.Missing Data Imputation Method Comparison in Ohio University Student Retention Database
Master of Science (MS), Ohio University, 2009, Industrial and Systems Engineering (Engineering and Technology)
Ohio University has been conducting research on first-year-student retention to prevent dropouts (OU Office of Institutional Research, First-Year Students Retention, 2008). Yet, the data sets have more than 20% of missing values, which can lead to bias in prediction. Missing data affects on the ability to generalize results to the target population. This study categorizes the missing data in variables into one of three types of missing data: missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). After the missing data is identified, the proper method of handling it is discussed. The proposed method is validated through developed and tested models. The goal of this work is to explore the methods of imputation missing data, and apply them to the Ohio University student retention dataset.

Committee:

David Koonce, PhD (Advisor); Dusan Sormaz, PhD (Committee Member); Diana Schwerha, PhD (Committee Member); Valerie Conley, PhD (Committee Member)

Subjects:

Higher Education; Industrial Engineering

Keywords:

Data Imputation; Missing data; MNAR; MCAR; MAR; student retention

Uzdavines, AlexStressful Events and Religious Identities: Investigating the Risk of Radical Accommodation
Master of Arts, Case Western Reserve University, Psychology
At some point in their lives, everyone will experience a stressful life event. Usually, someone can cope with and make meaning from the event. However, the body of research investigating the impact of severe and/or chronic exposure to stressful life events on the brain shows that harmful effects of stress exposure accumulate. Considering the extant literature regarding religious meaning making in light of these findings and the robust literature on spiritual transformation following stressful life events, I developed three hypotheses: 1) stressful life events increase risk of (non)religious ID change, 2) earlier events continued to impact later ID changes, and 3) risk of ID change was similar across change groups. This study analyzed a nationally representative longitudinal dataset of US children born between 1980 and 1984 (N = 8984). The final analyses used multiple imputation to account for missing data and did not find evidence supporting the hypotheses.

Committee:

Julie Exline, Ph.D. (Committee Chair); Heath Demaree, Ph.D. (Committee Member); Arin Connell, Ph.D. (Committee Member)

Subjects:

Health; Mental Health; Psychology; Religion; Spirituality

Keywords:

stressful life events; conversion; atheism; religion; spirituality; missing data analysis; multiple imputation by chained equations; longitudinal; national longitudinal survey of youth; meaning making; open science

Li, JianEffects of Full Information Maximum Likelihood, Expectation Maximization, Multiple Imputation, and Similar Response Pattern Imputation on Structural Equation Modeling with Incomplete and Multivariate Nonnormal Data
Doctor of Philosophy, The Ohio State University, 2010, EDU Policy and Leadership

The purpose of this study is to investigate the effects of missing data techniques in SEM under different multivariate distributional conditions. Using Monte Carlo simulations, this research examines the performance of four missing data methods in SEM: full information maximum likelihood (FIML), expectation maximum (EM) procedure, multiple imputation (MI), and similar response pattern imputation (SRPI) in the missing data mechanisms of missing completely at random (MCAR) and missing at random (MAR). The effects of three independent variables (sample size, missing proportion, and distribution shape) are investigated on parameter and standard error estimation, standard error coverage and model fit statistics. An inter-correlated 3-factor CFA model is used. The findings of this study indicate that FIML is the most robust method in terms of parameter estimate bias; FIML and MI generate almost equally accurate standard error coverage; and MI is the best in terms of estimation efficiency/accuracy and model rejection rate.

The results of SRPI in this study are consistent with previous studies in the literature. Generally speaking, SRPI produces unbiased parameter estimates for factor loadings and factor correlations under MCAR. However, when there are severe missingness or nonnormality conditions in the data or when the sample size is very small, it has bias problems on error variance estimates for indicators with moderate to low factor loadings. Some of the merits regarding SRPI found in this study are that it is more efficient than FIML under MCAR when data not only have small to moderate missingness, but also are severely nonnormal; it was also found to be more efficient for factor loading estimates of those indicators with missing data in MAR when the missing percentage is high and the nonnormality condition is most severe.

Recommendations regarding when to use each of the missing data techniques are provided at the end of the study. Future works are also discussed for improvement after this research.

Committee:

Richard G. Lomax, PhD (Advisor); Ann A. O’Connell, PhD (Committee Member); Pamela M. Paxton, PhD (Committee Member)

Subjects:

Educational Psychology; Statistics

Keywords:

Structural equation modeling; missing data techniques; distributional conditions; confirmative factor analysis; Monte Carlo; simulation procedure

Gotardo, Paulo Fabiano UrnauModeling Smooth Time-Trajectories for Camera and Deformable Shape in Structure from Motion with Occlusion
Doctor of Philosophy, The Ohio State University, 2010, Electrical and Computer Engineering

This Ph.D. dissertation focuses on the computer vision problems of rigid and non-rigid structure from motion (SFM) with occlusion. The predominant approach to solve SFM is based on the factorization of an input data matrix, W, using singular value decomposition (SVD). In practical application of SFM, however, 2D scene points cannot be tracked over all images due to occlusion. Therefore, matrix W is often missing a large portion of its entries and standard matrix factorization techniques such as SVD cannot be used directly. We assume the columns of the input observation matrix W describe the trajectories of 2D points whose positions change only gradually over time. This is the case of point tracks obtained from video images provided by a single camera that moves smoothly (i.e., gradually) around the structure of interest. We then derive a family of efficient matrix factorization algorithms that estimate the column space of W using compact parameterizations in the Discrete Cosine Transform (DCT) domain. Our methods tolerate high percentages of missing data and incorporate new models for the smooth time-trajectories of 2D-points, affine and weak-perspective cameras, and 3D deformable shape.

We solve the rigid SFM problem by estimating the smooth time-trajectory of a single camera. By considering a weak-perspective camera model from the outset, we directly compute Euclidean 3D shape reconstructions without requiring post-processing steps. Our results on datasets with high percentages of missing data are positively compared to those in the literature.

In non-rigid SFM, we propose a novel 3D shape trajectory approach that solves for the deformable structure as the smooth time-trajectory of a single point in a linear shape space. A key result shows that, compared to state-of-the-art algorithms, our non-rigid SFM method can better model complex articulated deformation with higher frequency deformation components. We also offer an approach for the challenging problem of non-rigid SFM with missing data.

Committee:

Aleix Martinez, PhD (Advisor); Kevin Passino, PhD (Other); Yuan Zheng, PhD (Other)

Subjects:

Artificial Intelligence; Computer Science; Electrical Engineering; Mathematics; Motion Pictures; Robots

Keywords:

structure from motion; matrix factorization; missing data; camera trajectory; shape trajectory

RAPUR, NIHARIKATREATMENT OF DATA WITH MISSING ELEMENTS IN PROCESS MODELLING
MS, University of Cincinnati, 2003, Engineering : Industrial Engineering
Advancements in monitoring and control have led to the collection of vast amounts of information for analysis. Inherent failure or human error in machines and processes, leads to a large amount of missing information is the data collected for analysis. This research pertains to improving the quality of a dataset by generating missing values, based on the observed data. Many methods have been proposed for the problem of incomplete information. These methods include single imputation techniques, multiple imputation techniques, principal components analysis based methods, and neural networks based methods. But these methods have inherent limitations that restrict their usage. Two areas, the initialization method and the iteration process, were identified as the key areas where efficiency could be improved. A new technique which uses clustering as a tool for generation of missing values was constructed. Clustering techniques like Fuzzy C Means and subtractive clustering based algorithms for missing information were developed. In order to be applied to real life data, these algorithms must be able to predict the missing values fast and with accuracy. This study involves comparing existing methods and new techniques in terms of accuracy of prediction and computational time. Through comparative studies it was shown that subtractive clustering method performs better and faster clustering of data. It reaches higher levels of convergence and is thus a better option to using FCM clustering for the generation of missing values. As an initialization method, clustering based techniques predict initial values that are closer to the actual values in the datasets used for testing. Tests conducted on industrial datasets also show considerable improvement in accuracy and speed of predicting missing values. Tests on incomplete data collected for silicon micromachined atomizer development, show that the accuracy of prediction for clustering based algorithms are of the same distribution as the original observed data. In this study, clustering based algorithms have superior performance statistics in comparison to the existing methods.

Committee:

Dr. Samuel H. Huang (Advisor)

Subjects:

Engineering, Industrial

Keywords:

missing data; data; complete; incomplete; data cleansing

Chen, YanranInfluence of Correlation and Missing Data on Sample Size Determination in Mixed Models
Doctor of Philosophy (Ph.D.), Bowling Green State University, 2013, Statistics
Sample size determination plays an important role in clinical trials. In the early stage of a design, we have to decide the right amount of data needed to reach desired accuracy in the follow-up statistical hypothesis testing. With a stated significance level, the accuracy of a hypothesis test largely depends on the test power. The test power measures the probability of correctly detecting a significant difference, which indicates how trustworthy the statistical decision is from the test in the case of a significant difference. Test power and sample size are closely related. In general, the testing power goes up when the sample size increases. However, the exact dependency of the testing power on the sample size is rather complicated to pin down. Furthermore, missing data and certain correlation structures in the data that are particularly common in clinical trials by nature would complicate the relationship between sample size and test power. It is thus crucial to assess and to analyze the effects of missing data and correlation structures of measurements on the sample size and test power.

In this dissertation, we focus on estimating the adequate sample size for testing means for longitudinal data with missing observations. We derive formulas for determining the sample size in the settings of compound symmetry and autoregressive order one models. The longitudinal data structure is commonly observed in clinical trials, in which measurements are repeatedly recorded over time points for each subject. These measurements of different subjects are mutually independent, while the repeated measurements at different time points within the same subject are correlated. It is assumed that the missing data mechanism is missing completely at random (MCAR). We mainly study the compound symmetry and autoregressive order one correlation structures. The generalized estimating equations (GEE) method for the analysis of longitudinal data, proposed by Liang and Zeger (1986), is applied to estimate the parameters in the mixed models. The GEE methodology yields consistent estimators of the parameters and of their variances by introducing a working correlation matrix. The GEE consistent estimators are robust to the choice of working correlation matrices under the assumption of MCAR. The sample size estimation procedure utilizes all the available data and incorporates correlations within the repeated measurements and missing data while the derived formulas stay simple. To evaluate the performance of the formulas through empirical powers, the simulation studies have been carried out. The simulation results show that the derived sample size formulas effectively reflect the influence of correlation structures and missing data on the sample size and test power.

Committee:

Junfeng Shang (Advisor); Mark Earley (Committee Member); Hanfeng Chen (Committee Member); John Chen (Committee Member)

Subjects:

Statistics

Keywords:

Missing Data; Sample Size

Dogucu, MineProperties of Partially Convergent Models and Effect of Re-Imputation on These Properties
Doctor of Philosophy, The Ohio State University, 2017, Educational Studies
When researchers fit models to multiply imputed datasets, they have to fit the model separately for each imputed dataset resulting in multiple sets of model results. It is possible for some of these sets of results not to converge while some do converge. This study examined occurrence of such a problem – partial convergence and inspected four outcomes of partially convergent models: proportion of convergence, percent parameter bias, root mean square error, and confidence interval coverage rate.

Committee:

Richard Lomax (Advisor)

Subjects:

Educational Tests and Measurements; Quantitative Psychology; Statistics

Keywords:

Missing Data, Multiple Imputation, Model Convergence

Kosler, Joseph StephenMultiple comparisons using multiple imputation under a two-way mixed effects interaction model
Doctor of Philosophy, The Ohio State University, 2006, Statistics
Missing data is commonplace with both surveys and experiments. For this dissertation, we consider imputation methods founded in Survey Sampling, and assess their performance with experimental data. With a two-way interaction model, missing data renders Multiple Comparisons Procedures invalid; we seek a resolution to this problem through development of a Multiple Imputation Procedure. By completing an incomplete data set, we obtain a balanced data set for which multiple comparisons of treatment effects may be performed. Our procedure is RMNI: Repeated Measures Normal Imputation. This procedure is readily adapted to function with any hierarchical linear model.

Committee:

Elizabeth Stasny (Advisor)

Subjects:

Statistics

Keywords:

Multiple Comparisons; Multiple Imputation; Missing Data; Mixed Model; Interaction Effect; RMNI

Medvedeff, Alexander MarkOn the Interpolation of Missing Dependent Variable Observations
Master of Arts, University of Akron, 2008, Economics
This paper utilizes a Monte Carlo experiment to simulate random missing observations in a dataset in order to analyze the effects of how various techniques to compensate for missing dependent variable observations behave in time series regression analysis. Reduced Sample, Modified Zero Order, and First Order techniques are tested with authentic data and simulated population datasets. The size of the datasets and the relative percentage of observations simulated as missing are changed in order to investigate the sensitivity of results to different dataset conditions. Each combination of data-type, size of dataset, percentage of missing observations, and model specification are regressed 1,000 times. Results are compared to a control or "full information" regression in order to analyze the bias and efficiency of each method tested. Results indicate that the Reduced Sample method should be the preferred solution for dealing with missing dependent variables in time series regression, as it consistently produces the least biased and most efficient estimates. The results also indicate a small degree of sensitivity to model specification, size of dataset and the percentage of observation simulated as missing.

Committee:

Steven Myers, PhD (Advisor); Gasper Garofalo, PhD (Advisor)

Subjects:

Economics; Statistics

Keywords:

Interpolation; Missing Data; Missing Observations; Econometrics; Monte Carlo

Ferguson, Meg ElizabethStatistical Analysis of Species Level Phylogenetic Trees
Master of Science (MS), Bowling Green State University, 2017, Applied Statistics (Math)
In this thesis, statistical methods are used to analyze the generation of species-level phylogenies. Two software packages, one phylogenetic and one statistical, are used to investigate the difference in phylogeny topology across three methods. Maximum likelihood estimation, neighbor-joining, and UPGMA methodologies are applied in this comparison to study the accuracy of each software package in correctly placing taxa with the true phylogeny. Four genes are used to compare with variable length sequences and genes amongst forty-seven squid species. In addition, missing data techniques are employed to assess the impact missing data has on phylogeny generation. Two software platforms were used to generate phylogenies for genes 16S rRNA, 18s rRNA, 28S rRNA, and the mitochondrial gene cytochrome c oxidase I (COI). The phylogenetic software platform MEGA was utilized as well as the statistical software platform, R; within R, the packages ape, phangorn, and seqinr were used in tree generation. Results show discrepancies between phylogenies generated across the four single-gene trees and multiple-gene trees; only phylogenies generated using missing data in the form of partial sequences grouped all families correctly. Results from this study highlight the struggle in determining the best software package to use for phylogenetic analyses. It was discovered that in general, MEGA generated a more accurate single-gene phylogeny from gene 18S rRNA while R generated a more accurate single-gene phylogeny from gene 28S rRNA. Results also showed that sequences with 50% missing characters could be accurately placed within generated phylogenies.

Committee:

John Chen, Dr. (Advisor); Junfeng Shang, Dr. (Committee Member); Craig Zirbel, Dr. (Committee Member)

Subjects:

Statistics

Keywords:

statistics; phylogenetics; squid; missing data

Merkle, Edgar C.Bayesian estimation of factor analysis models with incomplete data
Doctor of Philosophy, The Ohio State University, 2005, Psychology
Missing data are problematic for many statistical analyses, factor analysis included. Because factor analysis is widely used by applied social scientists, it is of interest to develop accurate, general-purpose methods for the handling of missing data in factor analysis. While a number of such missing data methods have been proposed, each individual method has its weaknesses. For example, difficulty in obtaining test statistics of overall model fit and reliance on asymptotic results for standard errors of parameter estimates are two weaknesses of previously-proposed methods. As an alternative to other general-purpose missing data methods, I develop Bayesian missing data methods specific to factor analysis. Novel to the social sciences, these Bayesian methods resolve many of the other missing data methods' weaknesses and yield accurate results in a variety of contexts. This dissertation details Bayesian factor analysis, the proposed Bayesian missing data methods, and the computation required for these methods. Data examples are also provided.

Committee:

Trisha Van Zandt (Advisor)

Keywords:

Bayesian computation; Factor analysis; Missing data; Incomplete data; Data augmentation; Multiple imputation

Jiang, HuiMissing Data Treatments in Multilevel Latent Growth Model: A Monte Carlo Simulation Study
Doctor of Philosophy, The Ohio State University, 2014, EDU Policy and Leadership
Under the framework of structural equation modeling (SEM), longitudinal data can be analyzed using latent growth models (LGM). An extension of the simple LGM is the multilevel latent growth model, which can be used to fit clustered data. The purpose of this study is to investigate the performance of five different missing data treatments (MDTs) for handling missingness due to longitudinal attrition in a multilevel LGM. The MDTs are: (1) listwise deletion (LD), (2) FIML, (3) EM imputation, (4) multiple imputation based on regression (MI-Reg), and (5) MI based on predictive mean matching (MI-PMM). A Monte Carlo simulation study was conducted to explore the research questions. First, population parameter values for the model were estimated from a nationally representative sample of elementary school students. Datasets were then simulated based on a two-level LGM, with different growth trajectories (constant, decelerating, accelerating), and at varying levels of sample size (200, 500, 2000,10000). After datasets are generated, a designated proportion of data points (5%, 10%, 20%) were deleted based on different mechanism of missingness (MAR, MNAR), and the five missing data treatments were applied. Finally, the parameter estimates produced by each missing data treatment were compared to the true population parameter values and to each other, according to the four evaluation criteria: parameter estimate bias, root mean square error, length of 95% confidence intervals (CI), and coverage rate of 95% CIs. Among the five MDTs studied, FIML is the only MDT that yields satisfactory bias level as well as coverage rate for all parameters across all sample sizes, attrition rates, and growth trajectories under MAR. It is also the only MDT that consistently outperforms the conventional MDT, LD, in every aspect, especially when missingness ratio increases. Under MNAR, however, estimates of the predictor effects on slopes become biased and coverage for those two parameters becomes unacceptable. Under MAR, LD produces acceptable bias levels for most of the parameters except for the predictor effects. However, LD tends to generate wider CIs, and when a high missingness proportion is combined with small sample size, or when missingness is MNAR, the amount of bias generally increases, and CI coverage deteriorates. This study found that EM imputation does not perform well under either MAR or MNAR. On average, EM tends to underestimate standard errors unless the sample size is very large. Less than half of all parameters have intervals with satisfactory coverage levels using EM imputation, and coverage for variance components is generally low. Similar to EM, MI-Reg also fails to produce satisfactory bias level for certain slope-related parameters and level-2 measurement error even under MAR. Contrary to EM, MI-Reg tends to overestimate standard errors. Coverage is generally superior using MI-Reg than using EM. Multiple imputation based on predictive mean matching (MI-PMM) performs similarly to MI-Reg, though it tends to yield higher bias and lower coverage for certain parameters.

Committee:

Richard Lomax (Advisor); Paul Gugiu (Committee Member); Eloise Kaizar (Committee Member)

Subjects:

Education; Statistics

Keywords:

structural equation model, latent growth model, missing data treatments, EM imputation, full information maximum likelihood, multiple imputation, Monte Carlo simulation

Yajima, AyakoAssessment of Soil Corrosion in Underground Pipelines via Statistical Inference
Doctor of Philosophy, University of Akron, 2015, Civil Engineering
In the oil industry, underground pipelines are the most preferred means of transporting a large amount of liquid product. However, a considerable number of unforeseen incidents due to corrosion failure are reported each year. Since corrosion in underground pipelines is caused by physicochemical interactions between the material (steel pipeline) and the environment (soil), the assessment of soil as a corrosive environment is indispensable. Because of the complex characteristics of soil as a corrosion precursor influencing the dissolution process, soil cannot be explained fully by conventional semi-empirical methodologies defined in controlled settings. The uncertainties inherited from the dynamic and heterogeneous underground environment should be considered. Therefore, this work presents the unification of direct assessment of soil and in-line inspection (ILI) with a probabilistic model to categorize soil corrosion. To pursue this task, we employed a model-based clustering analysis via Gaussian mixture models. The analysis was performed on data collected from southeastern Mexico. The clustering approach helps to prioritize areas to be inspected in terms of underground conditions and can improve repair decision making beyond what is offered by current assessment methodologies. This study also addresses two important issues related to in-situ data: missing data and truncated data. The typical approaches for treating missing data utilized in civil engineering are ad hoc methods. However, these conventional approaches may cause several critical problems such as biased estimates, artificially reduced variance, and loss of statistical power. Therefore, this study presents a variant of EM algorithms called Informative EM (IEM) to perform clustering analysis without filling in missing values prior to the analysis. This model-based method introduces additional cluster-specific Bernoulli parameters to exploit the nonuniformity of the frequency of missing values across clusters. In-line inspection tools (ILI) are commonly used for pipeline defect detection and characterization with advanced technologies such as magnetic flux leakage (MFL) and ultrasonic tools (UT). Each technology has its limitation for minimum detectable defect size. As a result, the data measured by different technologies are difficult to compare under the same modeling framework. In the present study, this problem will be addressed, considering two datasets measured by MFL and UT. Moreover, a truncated generalized exponential (TGE) distribution is introduced to describe the observed data. The non-informative Jeffreys’ prior is used to establish the Bayesian updating algorithm, and the Markov chain Monte Carlo (MCMC) method is adopted to estimate the posterior distribution of model.

Committee:

Robert Liang, Dr. (Advisor); Chien-Chun Chan, Dr. (Committee Member); Junliang Tao, Dr. (Committee Member); Guo-Xiang Wang, Dr. (Committee Member); Lan Zhang, Dr. (Committee Member)

Subjects:

Civil Engineering

Keywords:

Soil corrosion; Corrosion assessment; ECDA; ILI; Reliability; Gaussian mixture models; Clustering analysis; Missing data analysis; truncated distribution; Generalized exponential distribution;Bayesian inference; MCMC

Bishop, BrendenExamining Random-Coeffcient Pattern-Mixture Models for Longitudinal Data with Informative Dropout
Doctor of Philosophy, The Ohio State University, 2017, Psychology
Missing data commonly arise during longitudinal measurements. Dropout is a particular troublesome type of missingness because inference after the dropout occasion is effectively precluded at the level of the individual without substantial assumptions. If missingness, such as dropout, is related to the unobserved outcome variables, then parameter estimates derived from models which ignore the missingness will be biased. For example, a treatment effect may appear less substantial if poor-performing subjects tend to withdraw from the study. In a general sense, missing data lead to scenarios in which the empirical distribution of observed data is lacking nominal coverage in some areas. Little (1993) proposed a general pattern-mixture model approach in which the moments of the full data distribution were estimated as a finite mixture across the various missing-data patterns. These models and their extensions are flexible and may be estimated using wildly available mixed-modeling software in some special cases. The purpose of this work is to review the relevant missing-data literature and to examine the viability of random-coeffcient pattern-mixture models as an option for analysts seeking unbiased inference for longitudinal data subject to pernicious dropout.

Committee:

Robert Cudeck (Advisor); DeBoeck Paulus (Committee Member); MacEachern Steve (Committee Member)

Subjects:

Psychology

Keywords:

Pattern-Mixture Model;Longitudinal;Dropout;Missing Data;NMAR;Nonignorable Missingness

Kohlschmidt, Jessica KayRANKED SET SAMPLING: A LOOK AT ALLOCATION ISSUES AND MISSING DATA COMPLICATIONS
Doctor of Philosophy, The Ohio State University, 2009, Statistics

Ranked set sampling (RSS) is an alternative to simple random sampling that has been shown to outperform simple random sampling (SRS) in many situations. RSS outperforms SRS by reducing the variance of an estimator, thereby providing the same accuracy with a smaller sample size than is needed in simple random sampling. Ranked set sampling involves the preliminary ranking of potential sample units on the variable of interest using judgment or an auxiliary variable to aid in sample selection. Ranked set sampling prescribes the number of units from each rank order to be measured. Chapter 1 provides an overview of RSS.

Chapter 2 considers unbalanced RSS and allocations associated with it under the assumption that we can observe the measurement of interest on each unit. Balanced ranked set sampling assigns equal numbers of sample units to each rank order. Unbalanced ranked set sampling allows unequal allocation to the various ranks, but this allocation may be sensitive to the quality of the information available to do the allocation. In Chapter 2, we use a simulation study to conduct a sensitivity analysis of optimal allocation of sample units to each of the order statistics in unbalanced ranked set sampling. Our motivating example comes from the National Survey of Families and Households.

In Chapter 3, we consider the optimal allocations for unbalanced RSS when allowing the ranking method to be imperfect. We provide a general formula for the optimal allocation under any imperfect ranking procedure. Then we investigate the optimal allocation scheme for several variations of imperfect rankings and examine graphs that display the changing allocations.

In Chapter 4, we consider missing data in the balanced ranked set sampling (RSS) setting under the condition of perfect rankings. Our goal is to estimate a population proportion. We study estimations of the parameters for RSS data under three missing data models: 1) missing completely at random (MCAR); 2) missing at random (MAR); and 3) non-ignorable non-response (NINR). In the MAR case we allow the probability of missingness to depend on the known RSS ranking. In the NINR case we allow the probability of missingness to depend on whether the observation that is measured is a success or failure. We present the theoretical results for MLEs of the parameters under these three models, followed by an example with data from the National Survey of Families and Households.

Chapter 5 derives the variances associated with the estimators we find in Chapter 4. In providing some measure of the variability of our estimate, we can then use inferential statistics to analyze our estimators. Chapter 6 uses the variances from Chapter 5 to derive optimal allocation schemes for estimators.

Committee:

Elizabeth Stasny (Advisor); Douglas Wolfe (Advisor); Eloise Kaizar (Committee Member); Deb Rumsey (Committee Member)

Subjects:

Statistics

Keywords:

Ranked Set Sampling (RSS); Missing Data

Sarkar, SaurabhFeature Selection with Missing Data
PhD, University of Cincinnati, 2013, Engineering and Applied Science: Industrial Engineering
In the modern world information has become the new power. An increasing amount of efforts are being made to gather data, resources being allocated, time being invested and tools being developed. Data collection is no longer a myth; however, it remains a great challenge to create value out of the enormous data that is being collected. Data modeling is one of the ways in which data is being utilized. When we try to model a process or a system, it is crucial to have the right features, and thus, feature selection has become an essential part of data modeling. Yet often we have missing data, and in a worse scenario, the important features themselves may have considerable data missing. The challenge is to pick out the best features and yet accommodate the missing data. To address this problem, this dissertation introduces a cluster based feature selection process which is quite robust in handling missing data. The research extends the Minimum Expected Cost of Misclassification (MECM) based feature selection method to a very high dimensional dataset by using cluster based sampling methods. However, even though the cluster based sampling methods allow the MECM to scale to larger datasets, determining the optimal cluster size is still a challenge. This is the first issue that the dissertation aims to solve. The second area that the dissertation tries to address is the issue of handling missing data while doing feature selection by MECM based method. This area has not been studied extensively as feature selection itself, though missing data is witnessed quite often. The dissertation discusses an algorithm which enables the MECM to handle missing data. This approach is a probabilistic approach based on the distribution of most similar instances. The algorithm determines the probability of having the instance in the sampling cluster and then does a fractional count while evaluating the MECM. One of the challenges of this approach is to correctly estimate the probability of a missing point lying within the sampling cluster. The key lies in picking up the correct number of similar instances to calculate the probability. The dissertation also seeks to address this problem. The last part of the research contains a benchmark study to determine the effectiveness of the algorithm. A wrapper based feature selection method using Naive Bayesian and another method using the MECM without missing data algorithm are used simultaneously as benchmarks. The MECM missing data algorithm showed a significant improvement over the other two. Solving these problems is of great practical significance to data modeling. Instances with missing data might carry critical information; ignoring missing data during feature selection can have a cascading effect downstream when the final model is built. This research will enable us to choose better features which would in return improve the accuracy of the existing models. It will impact a broad range of applications from gene based medicine, fraud detection models, engineering, business and any field which uses feature selection as one of the components in model building process.

Committee:

Hongdao Huang, Ph.D. (Committee Chair); Manish Kumar, Ph.D. (Committee Member); Sundararaman Anand, Ph.D. (Committee Member); David Thompson, Ph.D. (Committee Member)

Subjects:

Engineering

Keywords:

Feature selection;missing data;MECM;Minimum Expected Cost of Misclassification

Wang, WeiThree Essays on Spatial Econometric Models with Missing Data
Doctor of Philosophy, The Ohio State University, 2010, Economics
This dissertation is composed of three essays on spatial econometric models with missing data. Spatial models that have a long history in regional science and geography have received substantial attention in various areas of economics recently. Applications of spatial econometric models prevail in urban, developmental and labor economics among others. In practice, an issue that researchers often face is the missing data problem. Although many solutions such as list-wise deletion and EM algorithm can be found in literature, most of them are either not suited for spatial models or hard to apply due to technical difficulties. My research focuses on the estimation of the spatial econometric models in the presence of missing data problems. The first chapter develops a GMM method based on linear moments for the estimation of mixed regressive, spatial autoregressive (MRSAR) models with missing observations in the dependent variables. The estimation method uses the expectation of the missing data, as a function of the observed independent variables and the parameters to be estimated, to replace the missing data themselves in the estimation. The proposed GMM estimators are shown to be consistent and asymptotically normal. Feasible optimal weighting matrix for the GMM estimation is given. We extend our estimation method to MRSAR models with heteroskedastic disturbances, high order MRSAR models and unbalanced spatial panel data models with random effects as well. From these extensions, we see that the proposed GMM method has more compatibility, compared with the conventional EM algorithm. The second chapter considers a group interaction model first proposed by Lee (2006); this model is a special case of the spatial autoregressive (SAR) models. It is a first attempt to estimate the model in a more general random sample setting, i.e. a framework in which only a random sample rather than the whole population in a group is available. We incorporate group heteroskedasticity along with the endogenous, exogenous and group fixed effects in the model. We prove that, under some basic assumptions and certain identification conditions, the quasi maximum likelihood (QML) estimators are consistent and asymptotically normal when the functional form of the group heteroskedasticity is known. Two types of misspecifications are considered, and, under each, the estimators are inconsistent. We also propose IV estimation in the case that the group heteroskedasticity is unknown. A LM test of group heteroskedasticity is given at the end. The third chapter considers the same group interaction model as that in the second chapter, but focuses on the large group interaction case and uses a random effects setting for the group specific characters. A GMM estimation framework using moment conditions from both within and between equations is applied to the model. We prove that under some basic assumptions and certain identification conditions, the GMM estimators are consistent and asymptotically normal, and the convergence rates of the estimators are higher than those of the estimators derived from the within equations only. Feasible optimal GMM estimators are proposed.

Committee:

Lung-fei Lee (Advisor); Stephen Cosslett (Committee Member); Lucia Dunn (Committee Member)

Subjects:

Economics

Keywords:

spatial econometric models; missing data

Next Page