Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 16)

Mini-Tools

 
 

Search Report

  • 1. Yadav, Anurag Memory Efficient Construction of Delaunay Triangulation in High Dimensions

    MS, University of Cincinnati, 2024, Engineering and Applied Science: Computer Engineering

    Topological Data Analysis(TDA) is a sub-field of data mining that studies topological invariants present in data. Generally, TDA methods treat discrete data as sampling of manifold to form filtration of data through series of sub-complexes. These sub-complexes are constructed through an increasing set of connectivity distances (ranging from disconnected, to fully connected). Examining the filtration, TDA methods works to identify invariant topological features for the dataset. In computational TDA, two most common tools are Persistent Homology(PH) and Euler Characteristic Curves(ECC). PH encapsulates the study of features that remain invariant to the continuous deformation of an object and is built upon sub-complexes that represent a continuous span over the input dataset. ECC plots the sum of Betti numbers found at each sub-complex of filtration as a curve. While there are several types of complexes that can be formed from data, Vietoris-Rips(VR), and Alpha complex are most prevalent. VR complexes are formed by combinatorial methods. This results in fast construction; however, their space complexity is large and sensitive to dimension. In contrast, Alpha complex is substantially smaller than the corresponding VR complex, but it requires computation of Delaunay Triangulation(DT) in data dimension. Unfortunately, computing DT requires a substantial amount of runtime to complete. Since much of the work with TDA tools is done in low dimensional data R3, a VR complex is used in the construction of filtration. However, when data resides in higher dimensions, VR complex become impractical and alternate complexes must be pursued. Alpha complex for instance can be formed for dimensions R3-R9$ while using conventional construction methods for underlying DT; however in higher dimensions these construction methods fail. This thesis introduces a new algorithm for computing Delaunay Triangulation, called Helix Algorithm, that provides ability to process higher dim (open full item for complete abstract)

    Committee: Philip Wilsey Ph.D. (Committee Chair); Badri Vellambi Ravisankar Ph.D. (Committee Member); Vikram Ravindra Ph.D. (Committee Member) Subjects: Computer Engineering
  • 2. Dey, Sayantan High-performance, Parameter-free Data Stream Mining on High-Dimensional and Distributed Data

    PhD, University of Cincinnati, 2023, Engineering and Applied Science: Computer Science and Engineering

    The goal of data clustering is to partition a set of data vectors into groups called clusters such that the intra-cluster similarity is maximized while the inter-cluster similarity is minimized. Distributed Clustering is performed on very large data, distributed across machines. Algorithms must operate over distributed data and still produce insightful clusters for the entirety of data. Distributed data-mining needs are growing with data being collected at various sources across the world while their movement and storage at a centralized location may be impossible due to multiple reasons. Often, data is continuously generated and, with time it evolves (a.k.a Concept Drift) to have new clusters while losing the importance of older clusters. Algorithms must operate on these high-dimensional Big-Data streams with limited processing power and memory. Although clustering is unsupervised learning where prior data knowledge is unavailable, most algorithms require parameters. The proper setting for the multiple parameters is challenging to determine. Ideally a solution that operates independent of any user-defined parameters would prove useful. How can we know if something extraordinary or new data trends starts occurring? When proper input parameter values are unknown, the algorithm must be run multiple times while adjusting the parameters to tune the results. It is run for each set of values from a carefully chosen parameter space. Outputs from all runs are evaluated and analyzed using multiple chosen performance metrics. Optimal values of these metrics determine the optimal output for the algorithm. Drawbacks of this include establishing the testing range of values for each parameter, amount of time and computational resources needed to perform this exploration and when it is applied to streaming evolving data, the algorithm execution speed must match the data streaming speed. This dissertation develops a Parameter-Free high-performance approximat (open full item for complete abstract)

    Committee: Philip Wilsey Ph.D. (Committee Chair); Lee Albert Carraher Ph.D. (Committee Member); Carla Purdy Ph.D. (Committee Member); Wen-Ben Jone Ph.D. (Committee Member); John Gallagher Ph.D. (Committee Member) Subjects: Computer Science
  • 3. Yang, Fang Nonlocal Priors in Generalized Linear Models and Gaussian Graphical Models

    PhD, University of Cincinnati, 2022, Arts and Sciences: Mathematical Sciences

    High-dimensional data, where the number of features or covariates is larger than the number of independent observations, are ubiquitous and are encountered on a regular basis by statistical scientists both in academia and in industry. Due to the modern advancements in data storage and computational power, the high-dimensional data revolution has significantly occupied mainstream statistical research. In this thesis, we undertake the problem of variable selection in high-dimensional generalized linear models, as well as the problem of high-dimensional sparsity selection for covariance matrices in Gaussian graphical models. We first consider a hierarchical generalized linear regression model with the product moment (pMOM) nonlocal prior over coefficients and examine its properties. Under standard regularity assumptions, we establish strong model selection consistency in a high-dimensional setting, where the number of covariates is allowed to increase at a sub-exponential rate with the sample size. The Laplace approximation is implemented for computing the posterior probabilities and the shotgun stochastic search procedure is suggested for exploring the posterior space. The proposed method is validated through simulation studies and illustrated by a real data example on functional activity analysis in fMRI study for predicting Parkinson's disease. Moreover, we consider sparsity selection for the Cholesky factor L of the inverse covariance matrix in high-dimensional Gaussian Directed Acyclic Graph (DAG) models. The sparsity is induced over the space of L via pMOM non-local prior, and the hierarchical hyper-pMOM prior. We also establish model selection consistency for Cholesky factor under more relaxed conditions compared to those in the literature and implement an efficient MCMC algorithm for parallel selecting the sparsity pattern for each column of L. We demonstrate the validity of our theoretical results via numerical simulations, and also use further s (open full item for complete abstract)

    Committee: Xuan Cao Ph.D. (Committee Member); Xia Wang Ph.D. (Committee Member); Seongho Song Ph.D. (Committee Member); Lili Ding Ph.D. (Committee Member) Subjects: Statistics
  • 4. Lopez Gomez, Daniel High Dimensional Data Methods in Industrial Organization Type Discrete Choice Models

    Doctor of Philosophy, The Ohio State University, 2022, Economics

    This dissertation is composed of three main papers. Each of these papers studies a different classical discrete choice model setting within the realm of Industrial Organization (IO) that now has the added complexity of containing a high-dimensional component that renders ineffective the traditional methods used and thus requires alternative approaches. In the first paper, I study a static single equilibrium market entry game of homogenous firms that contains a high-dimensional set of exogenous market characteristics that could enter a firm's profit function. In such type of high-dimensional setting we are at high risk of overfitting, i.e. estimating model parameters that are tailored too closely to the sample data available and thus don't generalize well to new data. The focus of this paper is exploring the use of different regularization techniques with the purpose of reducing overfitting when predicting market entry for a previously unobserved market. The second paper extends the previous market entry framework by now examining a static multiple equilibria market entry game of heterogeneous firms. The high-dimensional component in this setting arises from the way in which such a model is partially identified, which is through a set of moment inequalities that have to be met for a particular set of values of the parameters of interest to be consistent with the data. The number of moment inequalities that characterize this type of model can very easily grow beyond traditional sample sizes, thus requiring special attention from the researcher when testing whether a vector of values for the parameters of interest is indeed accepted by the model. This paper studies different approaches of high-dimensional testing applied to this market entry model and evaluates their performance. Finally, in the third paper I consider a different but still extremely relevant model of Industrial Organization, the aggregate discrete choice model with random coefficients for dema (open full item for complete abstract)

    Committee: Jason Blevins (Advisor); Adam Dearing (Committee Member); Robert de Jong (Committee Member) Subjects: Economics
  • 5. Cai, Haoshu Modeling of High-Dimensional Industrial Data for Enhanced PHM using Time Series Based Integrated Fusion and Filtering Techniques

    PhD, University of Cincinnati, 2022, Engineering and Applied Science: Mechanical Engineering

    Prognostics and Health Management (PHM) has extended its frontiers to more pervasive applications for failure detection, process monitoring, and predictive maintenance in the increasingly complicated manufacturing environment. Meanwhile, as Internet of Things (IoT) technologies are developed rapidly, the research for PHM is facing non-negligible challenges in several aspects. The advancement in the volume, velocity, and variety of the manufacturing data demands improved analytics of PHM solutions. The mass of the manufacturing data demands more efficient selection strategy to exclude the incorrect and useless information. Also, in the industrial environment, the high-dimensional data is usually collected from various sensor recordings with changes and drifts, which constitute the fundamental properties of the stream data. The advanced PHM techniques are required to be capable to capture and track the coming information within the high-dimensional data continuously and adaptively. To deal with the challenges and research gaps, this research proposes a scalable methodology for discrete time series prediction based on industrial high-dimensional data. First, a reference-based fusion strategy is proposed and employed to combine the valuable knowledge from the historical data, to reduce data dimensionality and to exclude the information which is not helpful for further analysis. Second, a state modeling strategy is designed to fuse both the reference data selected by the previous strategy and the past time series data. Also, it formulates an efficient and accurate function to depict the relationship between the predictor and the target. Finally, a Bayesian filter is designed to deal with the strong non-linearity, to propagate in high-dimensional space and to learn the new knowledge continuously in the stream data without losing the properties of the historical data. Finally, three cases from different industrial environments are implemented to justify the feasibility, ef (open full item for complete abstract)

    Committee: Jay Lee Ph.D. (Committee Member); David Siegel Ph.D. (Committee Member); Jing Shi Ph.D. (Committee Member); Jay Kim Ph.D. (Committee Member) Subjects: Mechanical Engineering
  • 6. Jiang, Jinzhu Feature Screening for High-Dimensional Variable Selection In Generalized Linear Models

    Doctor of Philosophy (Ph.D.), Bowling Green State University, 2021, Statistics

    High-dimensional data are widely encountered in a great variety of areas such as bioinformatics, medicine, marketing, and finance over the past few decades. The curse of high-dimensionality presents a challenge in both methodological and computational aspects. Many traditional statistical modeling techniques perform well for low-dimensional data, but their performance begin to deteriorate when being extended to high-dimensional data. Among all modeling techniques, variable selection plays a fundamental role in high-dimensional data modeling. To deal with the high-dimensionality problem, a large amount of variable selection approaches based on regularization have been developed, including but not limited to LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), Dantzig selector (Candes and Tao, 2007). However, as the dimensionality getting higher and higher, those regularization approaches may not perform well due to the simultaneous challenges in computational expediency, statistical accuracy, and algorithm stability (Fan et al., 2009). To address those challenges, a series of feature screening procedures have been proposed. Sure independence screening (SIS) is a well-known procedure for variable selection in linear models with high and ultrahigh dimensional data based on the Pearson correlation (Fan and Lv, 2008). Yet, the original SIS procedure mainly focused on linear models with the continuous response variable. Fan and Song (2010) also extended this method to generalized linear models by ranking the maximum marginal likelihood estimator (MMLE) or maximum marginal likelihood itself. In this dissertation, we consider extending the SIS procedure to high-dimensional generalized linear models with binary response variable. We propose a two-stage feature screening procedure for generalized linear models with a binary response based on point-biserial correlation. The point-biserial correlation is an estimate of the correlation between one continuous variable and (open full item for complete abstract)

    Committee: Junfeng Shang (Committee Chair); Emily Freeman Brown (Committee Member); Hanfeng Chen (Committee Member); Wei Ning (Committee Member) Subjects: Statistics
  • 7. Ratnasingam, Suthakaran Sequential Change-point Detection in Linear Regression and Linear Quantile Regression Models Under High Dimensionality

    Doctor of Philosophy (Ph.D.), Bowling Green State University, 2020, Statistics

    Sequential change point analysis aims to detect structural change as quickly as possible when the process state changes. A good sequential change point detection procedure is expected to minimize the detection delay time and the risk of raising false alarm. Existing sequential change point detection methods cannot be applicable for high-dimensional data because they are univariate in nature and thus present challenges. In the first part of the dissertation, we develop a monitoring method to detect structural change in smoothly clipped absolute deviation (SCAD) penalized regression model for high-dimensional data after the historical sample with the sample size m. The unknown pre-change regression coefficients are replaced by the SCAD penalized estimator. The asymptotic properties of the proposed test statistics are derived. We conduct a simulation study to evaluate the performance of the propose method. The proposed method is applied to the gene expression in the mammalian eye data to detect changes sequentially. In the second part of the dissertation, we develop a sequential change point detection method to monitor structural changes in SACD penalized quantile regression (SPQR) model for high-dimensional data. We derive the asymptotic distributions of the test statistic under the null and alternative hypotheses. Furthermore, to improve the performance of the SPQR method, we propose the Post-SCAD penalized quantile regression estimator (P-SPQR) for high-dimensional data. Simulations are conducted under different scenarios to study the finite sample properties of the SPQR and P-SPQR methods. A real data application is provided to demonstrate the effectiveness of the method. In the third and fourth part of the dissertation, we investigate the change point problem for Skew-Normal distribution and three parameter Weibull distribution respectively. Besides detecting and obtaining the point estimate of a change location, we propose an estimation procedure based (open full item for complete abstract)

    Committee: Wei Ning PhD (Advisor); Andy Garcia PhD (Other); Hanfeng Chen PhD (Committee Member); Junfeng Shang PhD (Committee Member) Subjects: Statistics
  • 8. Li, Lingjun Statistical Inference for Change Points in High-Dimensional Offline and Online Data

    PHD, Kent State University, 2020, College of Arts and Sciences / Department of Mathematical Sciences

    High-dimensional offline and online time series data are characterized by a large number of measurements and complex dependence, and often involve change points. Change point detection in offline time series data improves the parameter testing and estimation by pooling homogeneous observations between two successive change points. Change point detection in online time series data provides timely snapshots of the monitored system and allows for real-time anomaly detection. Despite its importance, methods available for change point detection in high-dimensional offline and online time series data are scarce. In the first part of the thesis, we present some new statistics for change-point testing and estimation in high dimensional offline time series data. We establish their theoretical properties including asymptotic distributions and consistency under mild conditions. The developed new methods are non-parametric without imposing restrictive structural assumptions. They incorporate spatial and temporal dependence in data. Most importantly, they can detect the change point near the boundary of time series data. In the second part of the thesis, we extend these new statistics to high-dimensional online time series data and provide a new stopping rule to detect a change point as early as possible after an anomaly occurs. We study theoretical properties of the new stopping rule, and derive an explicit expression for the average run length (ARL) so that the level of threshold in the stopping rule can be easily obtained with no need to run time-consuming Monte Carlo simulations. We also establish an upper bound for the expected detection delay (EDD), which demonstrates the impact of data dependence and magnitude of structure change in data. Simulation and case studies are provided to demonstrate the empirical performance of the proposed offline and online change-point detection methods.

    Committee: Jun Li (Advisor); Mohammad Khan (Committee Member); Jing Li (Committee Member); Cheng-Chang Lu (Committee Member); Ruoming Jin (Committee Member) Subjects: Mathematics; Statistics
  • 9. Polin, Afroza Simultaneous Inference for High Dimensional and Correlated Data

    Doctor of Philosophy (Ph.D.), Bowling Green State University, 2019, Statistics

    In high dimensional data, the number of covariates is larger than the sample size, which makes the estimation process challenging. We consider a high-dimensional and longitudinal data where at each time point, the number of covariates is much higher than the number of subjects. We consider two different settings of longitudinal data. First, we consider that the samples at different time points are generated from different populations. Second, we consider that the samples at different time points are generated from a multivariate distribution. In both cases, the number of covariates is much larger than the sample size and the standard least square methods are not applicable. In longitudinal study, our main focus is in the changes of the mean responses over the time and how these changes are related to the explanatory variables. Thus we are interested in testing the effect of the covariates over the time points simultaneously. In the first scenario, we use lasso at each time point to regress the response on the explanatory variables. Along with estimating the regression coefficients lasso also does dimension reduction. We use de-biased lasso for inference. To adjust the multiplicity effect in simultaneous testing we apply Bonferroni, Holm's, Hochberg's and the coherent stepwise procedures. In the second scenario, the samples at different time points are generated from a multivariate distribution and the dimension of the multivariate distribution is equal to the number of time points. We use lasso and de-biased lasso for inferences. To adjust the multiplicity effect in simultaneous testing, we use Bonferroni, Holm's, Hochberg's and stepwise procedures. We provide theoretical details that Bonferroni, Holm's step-down and the coherent step-wise procedures controls the family-wise error rate in strong sense for de-biased lasso estimators. While Hochberg's procedure provides a strong control of family-wise error rate only for independent or positively correlated te (open full item for complete abstract)

    Committee: John Chen Ph.D. (Advisor); Marc Simon Ph.D. (Other); Wei Ning Ph.D. (Committee Member); Junfeng Shang Ph.D. (Committee Member) Subjects: Statistics
  • 10. Zhu, Zheng A Unified Exposure Prediction Approach for Multivariate Spatial Data: From Predictions to Health Analysis

    PhD, University of Cincinnati, 2019, Medicine: Biostatistics (Environmental Health)

    Epidemiological cohort studies of health effect often rely on spatial models to predict ambient air pollutant concentrations at participants' residential addresses. Compared with traditional linear regression models, spatial models such as Kriging provide us accurate prediction by taking into account spatial correlations within data. Spatial model utilizes regression covariates from high dimensional database provided by geographical information system (GIS). This modeling requires dimension reduction techniques such as partial least squares, lasso, elastic net, etc. In the first chapter of this thesis, we presented a comparison of performance of four potential spatial prediction models. The first two approaches are based on universal kriging (UK). The third and fourth approaches are based on random forest and Bayesian additive regression trees (BART), with some degree of spatial smoothing. Multivariate spatial models are often considered for point-referenced spatial data, which contains multiple measurements at each monitoring location and therefore correlation between measurements is anticipated. In the second chapter of the thesis, we proposed a chain model, for analyzing multivariate spatial data. We showed that chain model outperform other spaital models such as universal kriging and coregionalization model. In the third chapter, we connect our spatial analysis with epidemiological studies of health effects of environmental chemical mixtures. Specifically, we investigated the relationship between environmental chemical mixture exposure and cognitive and motor development of infants. We proposed a framework to analyze health effects of environmental chemical mixtures. We first perform dimension reduction of the exposure variables using principal component analysis. In the second stage, we applied a best subset regression to obtain the final model.

    Committee: Roman Jandarov Ph.D. (Committee Chair); Sivaraman Balachandran Ph.D. (Committee Member); Won Chang Ph.D. (Committee Member); Marepalli Rao Ph.D. (Committee Member) Subjects: Statistics
  • 11. Cho, Jang Ik Partial EM Procedure for Big-Data Linear Mixed Effects Model, and Generalized PPE for High-Dimensional Data in Julia

    Doctor of Philosophy, Case Western Reserve University, 2018, Epidemiology and Biostatistics

    Methodologically, this dissertation contributes to two areas in Statistics: Linear mixed effects models for big data and Test of equal covariance for high-dimensional data. Scientifically, this dissertation helps to comprehensively evaluate the effect of the Specialty Care Access Network-Extension for Community Healthcare Outcomes (SCAN-ECHO) training on primary care providers at outpatient clinics in treating diabetes for the VA patient population. In the first part of this dissertation, we introduce three challenges and offer solutions to each, in examining the effect of SCAN-ECHO training on VA diabetic patients. The first challenge was data curation for longitudinal variables. As a solution, we developed an R-function called "fusion"' customized to our data structure for effective data curation. The second challenge was measurement variability and heterogeneity of the population. Different types of summary measures were used to reduce the variability of the outcome. Longitudinal cluster analysis was conducted to identify similar subgroups among the heterogeneous population. The third challenge was fitting linear mixed effects model for big data that could not be imported to R because the data exceeded the memory capacity. As a solution, we proposed a new modern approach to Big-data Linear Mixed Effects Model (bLMM) using a Partial EM (PEM) algorithm and data partitioning. Our PEM procedure was developed to analyze the effect of SCAN-ECHO training on diabetes treatment but this analytic approach is of interest by itself (statistical contribution 1) because the PEM is a general procedure for fitting LMM for big data. We evaluated the performance of bLMM PEM by comparing PEM to the following three methods for fitting LMM: Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm using the entire data, full EM using the entire data, and meta analysis using data partitions. Finally, for implementation, we applied our PEM procedure to evaluate the effect of SCAN-ECHO train (open full item for complete abstract)

    Committee: Jiayang Sun (Advisor); Jeffrey Albert (Committee Chair); Mark Schluchter (Committee Member); David Aron (Committee Member); Yifan Xu (Committee Member) Subjects: Biomedical Research; Biostatistics; Health; Mining; Statistics
  • 12. Fuhry, David PLASMA-HD: Probing the LAttice Structure and MAkeup of High-dimensional Data

    Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering

    Making sense of, analyzing, and extracting useful information from large and complex data is a grand challenge. A user tasked with meeting this challenge is often befuddled with questions on where and how to begin to understand the relevant characteristics of such data. Recent advances in relational analytics, in particular network analytics, offer key tools for insight into connectivity structure and relationships at both local ("guilt by association") and global (clustering and pattern matching) levels. These tools form the basis of recommender systems, ranking, and learning algorithms of great importance to research and industry alike. However, complex data rarely originate in a format suitable for network analytics, and the transformation of large and typically high-dimensional non-network data to a network is rife with parameterization challenges, as an under- or over-connected network will lead to poor subsequent analysis. Additionally, both network formation and subsequent network analytics become very computationally expensive as network size increases, especially if multiple networks with different connectivity levels are formed in the previous step; scalable approximate solutions are thus a necessity. I present an interactive system called PLASMA-HD to address these challenges. PLASMA-HD builds on recent progress in the fields of locality sensitive hashing, knowledge caching, and graph visualization to provide users with the capability to probe and interrogate the intrinsic structure of data. For an arbitrary dataset (vector, structural, or mixed), and given a similarity or distance measure-of-interest, PLASMA-HD enables an end user to interactively explore the intrinsic connectivity or clusterability of a dataset under different threshold criteria. PLASMA-HD employs and enhances the recently proposed Bayesian Locality Sensitive Hashing (BayesLSH), to efficiently estimate connectivity structure among entities. Unlike previous efforts which operate at (open full item for complete abstract)

    Committee: Srinivasan Parthasarathy (Advisor); Arnab Nandi (Committee Member); P Sadayappan (Committee Member); Michael Barton (Committee Member) Subjects: Computer Science
  • 13. Liu, Yang Improving the Accuracy of Variable Selection Using the Whole Solution Path

    Doctor of Philosophy (Ph.D.), Bowling Green State University, 2015, Statistics

    The performances of penalized least squares approaches profoundly depend on the selection of the tuning parameter; however, statisticians did not reach consensus on the criterion for choosing the tuning parameter. Moreover, the penalized least squares estimation that based on a single value of the tuning parameter suffers from several drawbacks. The tuning parameter selected by the traditional selection criteria such as AIC, BIC, CV tends to pick excessive variables, which results in an over-fitting model. On the contrary, many other criteria, such as the extended BIC that favors an over-sparse model, may run the risk of dropping some relevant variables in the model. In the dissertation, a novel approach for the feature selection based on the whole solution paths is proposed, which significantly improves the selection accuracy. The key idea is to partition the variables into the relevant set and the irrelevant set at each tuning parameter, and then select the variables which have been classified as relevant for at least one tuning parameter. The approach is named as Selection by Partitioning the Solution Paths (SPSP). Compared with other existing feature selection approaches, the proposed SPSP algorithm allows feature selection by using a wide class of penalty functions, including Lasso, ridge and other strictly convex penalties. Based on the proposed SPSP procedure, a new type of scores are presented to rank the importance of the variables in the model. The scores, noted as Area-out-of-zero-region Importance Scores (AIS), are defined by the areas between the solution paths and the boundary of the partitions over the whole solution paths. By applying the proposed scores in the stepwise selection, the false positive error of the selection is remarkably reduced. The asymptotic properties for the proposed SPSP estimator have been well established. It is showed that the SPSP estimator is selection consistent when the original estimator is either estimation con (open full item for complete abstract)

    Committee: Hanfeng Chen (Committee Chair); Peng Wang (Advisor); James Albert (Committee Member); Jonathan Bostic (Other) Subjects: Statistics
  • 14. Harvey, William Understanding High-Dimensional Data Using Reeb Graphs

    Doctor of Philosophy, The Ohio State University, 2012, Computer Science and Engineering

    Scalar functions are virtually ubiquitous in scientific research. A vast amount of research has been conducted in visualization and exploration of low-dimensional data during the last few decades, but adapting these techniques to high-dimensional, topologically-complex data remains challenging. Traditional metric-preserving dimensionality reduction techniques suffer when the intrinsic dimension of data is high, as the metric cannot generally survive projection into low dimensions. The metric distortion can be arbitrarily large, and preservation of topological structure is not guaranteed, resulting in a misleading view of the data. When preservation of geometry is not possible, topological analysis provides a promising alternative. As an example, simplicial homology characterizes the structure of a topological space (i.e. a simplicial complex) via its intrinsic topological features of various dimensions. Unfortunately, this information can be abstract and difficult to comprehend. The ranks of these homology groups (the Betti numbers) offer a simpler, albeit coarse, interpretation as the number of voids of each dimension. In high dimensions, these approaches suffer from exponential time complexity, which can render them impractical for use with real data. In light of these difficulties, we turn to an alternative type of topological characterization. We investigate the Reeb graph as a visualization and analysis tool for such complex data. The Reeb graph captures the topology of the set of level sets of a scalar function, providing a simple, intuitive, and informative topological representation. We present the first sub-quadratic expected time algorithm for computing the Reeb graph of an arbitrary simplicial complex, opening up the possibility of using the Reeb graph as a tool for understanding high-dimensional data. While the Reeb graph effectively captures some topological structure, it is still somewhat terse. The Morse-Smale complex summarizes a scalar function by b (open full item for complete abstract)

    Committee: Yusu Wang PhD (Advisor); Tamal Dey PhD (Committee Member); Rephael Wenger PhD (Committee Member) Subjects: Bioinformatics; Computer Science
  • 15. Kumar, Vijay Specification, Configuration and Execution of Data-intensive Scientific Applications

    Doctor of Philosophy, The Ohio State University, 2010, Computer Science and Engineering

    Recent advances in digital sensor technology and numerical simulations of real-world phenomena are resulting in the acquisition of unprecedented amounts of raw digital data. Terms like ‘data explosion' and ‘data tsunami' have come to describe the uncontrolled rate at which scientific datasets are generated by automated sources ranging from digital microscopes and telescopes to in-silico models simulating the complex dynamics of physical and biological processes. Scientists in various domains now have secure, affordable access to petabyte-scale observational data gathered over time, the analysis of which, is crucial to scientific discovery. The availability of commodity components have fostered the development of large distributed systems with high-performance computing resources to support the execution requirements of scientific data analysis applications. Increased levels of middleware support over the years have aimed to provide high scalability of application execution on these systems. However, the high-resolution, multi-dimensional nature of scientific datasets, and the complexity of analysis requirements present challenges to efficient application execution on such systems. Traditional brute-force analysis techniques to extract useful information from scientific datasets may no longer meet desired performance levels at extreme data scales. This thesis builds on a comprehensive study involving multi-dimensional data analysis applications at large data scales, and identifies a set of advanced factors or parameters to this class of applications which can be customized in domain-specific ways to obtain substantial improvements in performance. A useful property of these applications is their ability to operate at multiple performance levels based on a set of trade-off parameters, while providing different levels of quality-of-service (QoS) specific to the application instance. To avail the performance benefits brought about by such facto (open full item for complete abstract)

    Committee: P Sadayappan PhD (Advisor); Joel Saltz MD, PhD (Committee Member); Gagan Agrawal PhD (Committee Member); Umit Catalyurek PhD (Committee Member) Subjects: Computer Science
  • 16. Liu, Peng Adaptive Mixture Estimation and Subsampling PCA

    Doctor of Philosophy, Case Western Reserve University, 2009, Sciences

    Data mining is important in scientific research, knowledge discovery and decision making. A typical challenge in data mining is that a data set may be too large to be loaded all together, at one time, into computer memory for analyses. Even if it can be loaded all at once for an analysis, too many nuisance features may mask important information in the data. In this dissertation, two new methodologies for analyzing large data are studied. The first methodology is concerned with adaptive estimation of mixture parameters in heterogeneous populations of large-n data. Our adaptive estimation procedures, the partial EM (PEM) and its Bayesian variants (BMAP and BPEM) work well for large or streaming data. They can also handle the situation in which later stage data may contain extra components (a.k.a. "contaminations" or "intrusions") and hence have applications in network traffic analysis and intrusion detection. Furthermore, the partial EM estimate is consistent and efficient. It compares well with a full EM estimate when a full EM procedure is feasible. The second methodology is about subsampling large-p data for selecting important features under the principal component analysis (PCA) framework. Our new method is called subsampling PCA (SPCA). Diagnostic tools for choosing parameter values, such as subsample size and iteration number, in our SPCA procedure are developed. It is shown through analysis and simulation that the SPCA can overcome the masking effect of nuisance features and pick up the important variables and major components. Its application to gene expression data analysis is also demonstrated.

    Committee: Jiayang Sun PhD (Advisor); Joe Sedransk PhD (Committee Member); Guoqiang Zhang PhD (Committee Member); Mark Schluchter PhD (Committee Member); Patricia Williamson PhD (Committee Member) Subjects: Statistics