The focus of this dissertation is on uncertainty quantification for model selection. In recent years, model selection methods for high-dimensional data have achieved many exciting results in terms of efficient algorithms and theoretical developments. Even if the number of predictors is much larger than the sample size, the powerful penalization methods for variable selection in regression can provide a sparse representation of the data. However, quantifying the model selection uncertainty is still a pressing task. In this dissertation, we propose model selection deviation to quantify the model selection uncertainty for linear regressions and propose confidence graphs to analyze the graphical model selection uncertainty for graphical models.
In the first part of this dissertation, we introduce several graphical tools, such as G-plot and H-plot, to visualize the distribution of the selected model. We propose the concept of model selection deviation (MSD) to quantify the uncertainty. Similar to the standard error of an estimator, model selection deviation measures the stability of the selected model given by a model selection procedure. For such a measure, we discuss a bootstrap estimation procedure and demonstrate its desirable performance through simulation studies and real data analysis.
In the second part, we introduce the concept of confidence graphs (CG) for graphical model selection. CG first identifies two nested graphical models - called small and large confidence graphs (SCG and LCG) - trapping the true graphical model in between at a given level of confidence, just like the endpoints of traditional confidence interval capturing the population parameter. Therefore, SCG and LCG provide us with more insights about the simplest and most complex forms of dependence structure the true model can possibly have, and their difference also offers us a measure of model selection uncertainty. In addition, rather than relying on a single selected model, CG consists of a group of graphical models between SCG and LCG as the candidates. The proposed method can be coupled with many popular model selection methods, making it an ideal tool for comparing model selection uncertainty as well as measuring reproducibility. We also propose a new residual bootstrap procedure for graphical model settings to approximate the sampling distribution of the selected models and to obtain CG.
Numerical studies further illustrate the advantages of the proposed method.
At last but not least, an R package called VDSM is built to visualize the distribution of the selected model for generic variable selection methods.