Skip to Main Content

Basic Search

Skip to Search Results
 
 
 

Left Column

Filters

Right Column

Search Results

Search Results

(Total results 88)

Mini-Tools

 
 

Search Report

  • 1. Gruenberg, Rebecca Multi-Model Snowflake Schema Creation

    Master of Computer Science, Miami University, 2022, Computer Science and Software Engineering

    Big Data's three V's--volume, velocity, and variety--have continually presented a problem for storing and querying large, diverse data efficiently. Data lakes represent a growing field of study to store large volumes of data in a variety of formats. Multi-model star schemas support analytical processing of data stored in native formats and are an emerging area in data warehousing. Using multi-model snowflake schemas in place of star schemas gives the user a bigger picture of the data lake and the relationships within. In this work, we extend and implement a meta-model for data lakes and provide an algorithm to semi-automatically perform mappings between the data lake and multi-model snowflake schema for structured and semi-structured data. Our algorithm recommends candidate multi-model snowflake schemas derived from a meta-model of a data lake. The algorithm is the basis for a tool to assist analysts in understanding the contents of a data lake and in creating views that support analytical processing to better make business decisions when querying a large data repository. We implement this basis for a tool and demonstrate its functionality using a variety of case studies.

    Committee: Karen Davis (Advisor); Dhananjai Rao (Committee Member); Daniela Inclezan (Committee Member) Subjects: Computer Science
  • 2. Raghavendra, Aarthi Performance Evaluation of Analytical Queries on a Stand-alone and Sharded Document Store

    MS, University of Cincinnati, 2015, Engineering and Applied Science: Computer Science

    Numerous organizations perform data analytics using relational databases by executing data mining queries. These queries include complex joins and aggregate functions. However, due to an explosion of data in terms of volume, variety, veracity, velocity, and value, known as Big Data [1], many organizations such as Foursquare, Adobe, and Bosch have migrated to NoSQL databases [2] such as MongoDB [3] and Cassandra [4]. We intend to demonstrate the performance impact an organization can expect for analytical queries on a NoSQL document store. In this thesis, we benchmark the performance of MongoDB [3], a cross-platform document-oriented database for datasets of sizes 1GB and 5GB in a stand-alone environment and a sharded environment. The stand-alone MongoDB environment for all the datasets is the same whereas the configurations of the MongoDB cluster vary based on the dataset size. The TPC-DS benchmark [5] is used to generate data of different scales and selected data mining queries are executed in both the environments. Our experimental results show that along with choosing the environment, data modeling in MongoDB also has a significant impact on query execution times. MongoDB is an appropriate choice when the data has a flexible structure and analytical query performance is best when data is stored in a denormalized fashion. When the data is sharded, due to multiple query predicates in an analytical query, aggregating data from a few or all nodes proves to be an expensive process and hence performs poorly when compared to the alternative process of executing the same in a stand-alone environment.

    Committee: Karen Davis Ph.D. (Committee Chair); Raj Bhatnagar Ph.D. (Committee Member); Paul Talaga Ph.D. (Committee Member) Subjects: Computer Science
  • 3. Xiong, Hui Combining Subject Expert Experimental Data with Standard Data in Bayesian Mixture Modeling

    Doctor of Philosophy, The Ohio State University, 2011, Industrial and Systems Engineering

    Engineers face many quality-related datasets containing free-style text or images. For example, a database could include summaries of complaints filed by customers, or descriptions of the causes of rework or maintenance or of the associated actions taken, or a collection of quality inspection images of welded tubes. The goal of this dissertation is to enable engineers to input a database of free-style text or image data and then obtain a set of clusters or “topics” with intuitive definitions and information about the degree of commonality that together helps prioritize system improvement. The proposed methods generate Pareto charts of ranked clusters or topics with their interpretability improved by input from the analyst or method user. The combination of subject matter expert data with standard data is the novel feature of the methods considered. Prior to the methods proposed here, analysts applied Bayesian mixture models and had limited recourse if the cluster or topic definitions failed to be interpretable or are at odds with the knowledge of subject matter experts. The associated “Subject Matter Expert Refined Topic” (SMERT) model permits on-going knowledge elicitation and high-level human expert data integration to address the issues regarding: (1) unsupervised topic models often produce results to user, and (2) to provide a “Hierachical Analysis Designed Latency Experiment” (HANDLE) for human expert to interact with the model results. If grouping are missing key elements, so-called “boosting” these elements is possible. If certain members of a cluster are nonsensical or nonphysical, so-called “zapping” these nonsensical elements is possible. We also describe a fast Collapsed Gibbs Sampling (CGS) algorithm for SMERT method, which offers the capacity to efficiently SMERT model large datasets but which is associated with approximations in certain cases. We use three case studies to illustrate the proposed methods. The first relates to scrap text reports for a Ch (open full item for complete abstract)

    Committee: Theodore Allen PhD (Advisor); Suvrajeet Sen PhD (Committee Member); David Woods PhD (Committee Member) Subjects: Computer Science; Engineering; Industrial Engineering; Information Technology
  • 4. Nepal, Suranjan Incorporation and Evaluation of Parametric Wind and Rainfall Models for Compound Flooding in a Discontinuous Galerkin Storm Surge Framework

    Master of Science, The Ohio State University, 2024, Civil Engineering

    Hurricanes frequently bring intense and heavy rainfall, which compound with storm surge and contribute significantly to their overall impact. Current models assume these events as disjoint, which leads to underestimation of flooding that consequently occurs. Our research aims to bridge this gap by integrating two parametric rainfall models — the R-CLIPER (Rainfall CLImatology and PERsistence) and IPET (Interagency Performance Evaluation Task Force Rainfall Analysis) models — into an existing Discontinuous Galerkin Shallow Water Equation Model (DG-SWEM). Firstly, we aim to showcase the effectiveness of the R-CLIPER and IPET models in replicating observed rainfall patterns. To do this, we will use Quantitative Precipitation Forecast (QPF) indices to check how well these models match observed rainfall patterns around the center, around the mean and distributed volumes and around extreme precipitation values. Subsequently, we will integrate these models into the DG-SWEM storm surge framework.Our hypothesis is: Incorporation of rainfall into an existing storm surge model will enhance its ability to predict the flooding that follows hurricanes.To test this hypothesis, we evaluate the performance of DG-SWEM with and without the inclusion of rainfall for 6 storms at available United States Geological Survey (USGS) high water marks. The results clearly demonstrate that the inclusion of rainfall can indeed improve compound flood simulations.

    Committee: Ethan Kubatko (Advisor); Andy May (Committee Member); Jim Stagge (Committee Member) Subjects: Civil Engineering; Environmental Engineering
  • 5. Katugoda Gedara, Ayesha Kumari Ekanayaka Refining Climate Model Projections: Spatial Statistical Downscaling and Bayesian Model Averaging for Climate Model Integration

    PhD, University of Cincinnati, 2024, Arts and Sciences: Statistics

    In this dissertation, two innovative statistical methodologies are developed to enhance the accuracy of climate model projections. Climate models simulate future global climate conditions but are constrained by coarse resolutions due to computational limitations. Consequently, these projections must be refined to finer resolutions before they can be effectively utilized in regional studies. The first methodology introduces a novel spatial statistical model for downscaling climate model projections. This approach significantly enhances precision by incorporating spatial dependencies and stands out by providing meaningful uncertainty estimates, a feature often missing in many previous downscaling approaches. Additionally, the method achieves computational efficiency through a basis representation, making it adept at managing large datasets effectively. Furthermore, climate models originate from various research groups, each based on different understandings and assumptions about the Earth's climate. This leads to significant uncertainty in the choice of models for subsequent analysis. Since there is no definitive way to select the best model or a few reliable ones, climate scientists often seek methods to combine projections from multiple models to mitigate this uncertainty. The second method in this dissertation introduces a comprehensive approach to integrate projections from multiple climate models using Bayesian Model Averaging (BMA). The proposed method effectively tackles the challenge of implementing BMA for climate model integration in a full Bayesian framework by employing Polya-Gamma augmentation and yields combined climate projections with improved accuracy and reliable uncertainty estimates.

    Committee: Emily Kang Ph.D. (Committee Chair); Bledar Konomi Ph.D. (Committee Member); Won Chang Ph.D. (Committee Member) Subjects: Statistics
  • 6. Bhatnagar, Saumya Computer Model Emulation and Calibration using Deep Learning

    PhD, University of Cincinnati, 2022, Arts and Sciences: Mathematical Sciences

    The focus of this thesis is to use deep learning methods for computer model calibration and uncertainty quantification. Computer model calibration is the process of combining information from computer model outputs and observation data to make inference about unknown input parameters of the computer model. The framework for calibration involves an emulation step which faces computational issues when data is high dimensional and a calibration step which faces the inferential issues due to the nonidentifiability between the input parameters and data-model discrepancy. The main aim of this thesis is to address these computational and inferential issues using deep learning methods. This thesis contribute in the field of computer model calibration in the following way: 1) Developing a new inverse model-based computer model calibration framework that utilizes the feature extraction ability of deep neural network to efficiently handle high dimensional data while filtering out the effects of data-model discrepancy. 2) Formulating a computationally efficient generative deep learning model-based emulation method for large spatial data. 3) Siamese neural network and approximate Bayesian computation-based calibration method that can efficiently solve the issue of data-model discrepancy. The proposed methods have been successfully applied to calibrate important climate models such as Weather Research and Forecasting Model (WRF-Hydro) and University of Victoria Earth System Climate Model(UVic ESCM).

    Committee: Won Chang Ph.D. (Committee Member); Siva Sivaganesan Ph.D. (Committee Member); Bledar Konomi Ph.D. (Committee Member); Emily Kang Ph.D. (Committee Member) Subjects: Statistics
  • 7. Wen, Zhezhu Building Resilience through Supply Chain Agility: Cross-sectional and Longitudinal Studies

    Doctor of Philosophy, University of Toledo, 2022, Manufacturing and Technology Management

    COVID19 pandemic rampaged the business world unprecedentedly, with the disrupted operations worldwide being the highlight. The COVID19 is an extreme example demonstrating the magnitude of challenges a turbulent and hard-to-predict business environment can do to modern firms. Facing a dynamic environment, firms need to move much faster than the rate at which surprises arrive to stay competitive and resilient. Emergency response literature identifies two measures organizations can use to cope with unexpected emergencies. One is the structural factors such as documented plans and standard operating procedures. Since it can be documented, it cannot be the source of competitive advantage. However, nonstructural factors such as improvisation, adaptability, and creativity (Harrald, 2006) can become an essential source of competitive advantage since they cannot be imitated and transferred easily. This study complements and extends supply chain agility literature by empirically testing the impact of supply chain agility using cross-sectional and longitudinal study designs. Given that very few studies were conducted to test the agility in contexts other than the western world, the first study is designed to understand the supply chain agility in different cultural clusters. Specifically, a theory-driven agility measurement is developed. Then, using the International Manufacturing Strategy Survey VI data, the impact of supply chain agility on operational performance is tested. Further, combined with GLOBE project cultural dimension data, the study explores how the agility-performance linkage works in different societal cultures. A multi-level model is formulated to test the research hypotheses. The results from the analysis show that a focal firm's supply chain agility positively improves operational performance. It also supports the hypothesis that performance-based culture hinders agility and that a socially supportive culture facilitates agility. Project management li (open full item for complete abstract)

    Committee: Paul Hong (Committee Chair); Xinghao Yan (Committee Member); Collin Gilstrap (Committee Member); James Bland (Committee Member) Subjects: Economics; Finance; Management
  • 8. Kwon, Kihyun The Relationship between Socio-Demographic Constraints, Neighborhood Built Environment, and Travel Behavior: Three Empirical Essays

    Doctor of Philosophy, The Ohio State University, 2022, City and Regional Planning

    Socio-demographics may represent constraints that shape different travel outcomes of individuals. This leads to studies with not only different findings on travel behavior, but also mixed and inconclusive conclusions on the effects of built environment on individuals' travel outcomes. There are gaps in many existing studies on the relationship between socio-demographics, built environment, and travel behavior, which need to be filled. In addition, the existing literature has not paid much attention to the varying impacts of neighborhood-built environment on travel outcomes across different socio-demographic groups. Many signs from U.S. Census Bureau and Center for Diseases Control and Prevention (CDC) indicate that the socio-demographics of the U.S. society are undergoing a process of significant changes. It is uncertain how these changes may affect travel behavior in the short term and the long term. In the face of this uncertainty, a key challenge for transportation planners and policymakers is to understand how socio-demographics affect individuals' travel outcomes and out-of-home activities. These major trends that affect future travel patterns will dramatically reshape transportation priorities and needs. This dissertation quantitatively examines the links between socio-demographic constraints, neighborhood-built environments, and travel behavior. This dissertation comprises three essays. The first essay explores gender differences in commute behavior with a focus on two-earner households. The second essay examines the links between walkability and transit use, focusing on the differences between disabled individuals and others. The third essay explores how neighborhood walkability affects older adults' walking trips, considering different household income levels. The first essay utilizes the detailed individual-level data from 2001, 2009, and 2017 National Household Travel Surveys (NHTS). The NHTS datasets provide information on travel by U.S. residen (open full item for complete abstract)

    Committee: Gulsah Akar (Advisor); Zhenhua Chen (Advisor); Harvey J. Miller (Committee Member) Subjects: Transportation Planning; Urban Planning
  • 9. Zhang, Jieyan Bayesian Hierarchical Modeling for Dependent Data with Applications in Disease Mapping and Functional Data Analysis

    PhD, University of Cincinnati, 2022, Arts and Sciences: Mathematical Sciences

    Bayesian hierarchical modeling has a long history but did not receive wide attention until the past few decades. Its advantages include flexible structure and capability of incorporating uncertainty in the inference. This dissertation develops two Bayesian hierarchical models for the following two scenarios: first, spatial data of time to disease outbreak and disease duration, second, large or high dimensional functional data that may cause computational burden and require rank reduction. In the first case, we use cucurbit downy mildew data, an economically important plant disease data recorded in sentinel plot systems from 23 states in the eastern United States in 2009. The joint model is established on the dependency of the spatially correlated random effects, or frailty terms. We apply a parametric Weibull distribution to the censored time to disease outbreak data, and a zero-truncated Poisson distribution to the disease duration data. We consider several competing process models for the frailty terms in the simulation study. Given that the generalized multivariate conditionally autoregressive (GMCAR) model, which contains correlation and spatial structure, provides a preferred DIC and LOOIC results, we choose the GMCAR model for the real data. The proposed joint Bayesian hierarchical model indicates that states in the mid-Atlantic region tend to have a high risk of disease outbreak, and in the infected cases, they tend to have a long duration of cucurbit downy mildew. The second Bayesian hierarchical model smooths functional curves simultaneously and nonparametrically with improved computational efficiency. Similar to the frequentist counterpart, principal analysis by conditional expectation, the second model reduces rank through the multi-resolution spline basis functions in the process model. The proposed method outperforms the commonly used B-splines basis functions by providing a slightly better estimation within a much shorter computing time. The performanc (open full item for complete abstract)

    Committee: Emily Kang Ph.D. (Committee Member); Seongho Song Ph.D. (Committee Member); Bledar Konomi Ph.D. (Committee Member); Won Chang Ph.D. (Committee Member) Subjects: Statistics
  • 10. Khadka, Pravakar Three-Dimensional Hydrodynamic Modeling to Analyze Salinity Interaction of Coastal Marshland with a Lake: A Case Study of Mentor Marsh near Lake Erie, Ohio

    Master of Science in Engineering, Youngstown State University, 2020, Department of Civil/Environmental and Chemical Engineering

    Salinization is a global threat to the ecological functioning and development of the coastal wetlands. Therefore, the study of salinity interaction between the wetland and the coastal estuary is crucial to determine the salinity distribution and its variation in the coastal wetlands. This study, preeminently, was conducted to investigate the distribution of salinity in the Mentor Marsh wetland using a hydrodynamic model. The marsh is a coastal estuary system located within the Ohio Lake Basin, which has been experiencing increased levels of salinity from the early 1960s, especially after the placement of salt mine tailings near the marsh. Consequently, increased salinity has been inducing drastic vegetative change throughout the Mentor Marsh and leading to the rapid development of Phragmites australis. When dry, Phragmites australis is very prone to catch fire. Ten monitoring stations were established within the Mentor Marsh to monitor the salinity and record the hourly salinity, water level, and stream temperature data. The graphical analysis of the observed salinity was performed at several locations within the western basin of Mentor Marsh. Furthermore, a three-dimensional hydrodynamic Environmental Fluid Dynamics Code Plus (EFDC+) model was developed for the western section of the Mentor Marsh utilizing the measured data from five monitoring stations of the western basin. Most of the meteorological data needed for this model were obtained from the National Oceanic and Atmospheric Administration (NOAA). While cloud cover and precipitation data were acquired from nearby airports, solar radiation data were obtained from the United States Department of Agriculture (USDA). Similarly, bathymetry data were prepared by integrating the shoreline GIS shapefile of the Lake Erie and Mentor Marsh with a detailed survey conducted in the marina and adjoining marsh in order to appropriately represent the bathymetry in the EFDC+ model. The water levels at Lake Erie and Mentor (open full item for complete abstract)

    Committee: Suresh Sharma PhD (Advisor); Shakir Husain PhD (Committee Member); Felicia Armstrong PhD (Committee Member); Thomas Mathis MS (Committee Member) Subjects: Civil Engineering; Environmental Engineering
  • 11. Zhu, Zheng A Unified Exposure Prediction Approach for Multivariate Spatial Data: From Predictions to Health Analysis

    PhD, University of Cincinnati, 2019, Medicine: Biostatistics (Environmental Health)

    Epidemiological cohort studies of health effect often rely on spatial models to predict ambient air pollutant concentrations at participants' residential addresses. Compared with traditional linear regression models, spatial models such as Kriging provide us accurate prediction by taking into account spatial correlations within data. Spatial model utilizes regression covariates from high dimensional database provided by geographical information system (GIS). This modeling requires dimension reduction techniques such as partial least squares, lasso, elastic net, etc. In the first chapter of this thesis, we presented a comparison of performance of four potential spatial prediction models. The first two approaches are based on universal kriging (UK). The third and fourth approaches are based on random forest and Bayesian additive regression trees (BART), with some degree of spatial smoothing. Multivariate spatial models are often considered for point-referenced spatial data, which contains multiple measurements at each monitoring location and therefore correlation between measurements is anticipated. In the second chapter of the thesis, we proposed a chain model, for analyzing multivariate spatial data. We showed that chain model outperform other spaital models such as universal kriging and coregionalization model. In the third chapter, we connect our spatial analysis with epidemiological studies of health effects of environmental chemical mixtures. Specifically, we investigated the relationship between environmental chemical mixture exposure and cognitive and motor development of infants. We proposed a framework to analyze health effects of environmental chemical mixtures. We first perform dimension reduction of the exposure variables using principal component analysis. In the second stage, we applied a best subset regression to obtain the final model.

    Committee: Roman Jandarov Ph.D. (Committee Chair); Sivaraman Balachandran Ph.D. (Committee Member); Won Chang Ph.D. (Committee Member); Marepalli Rao Ph.D. (Committee Member) Subjects: Statistics
  • 12. Jiang, Hui Missing Data Treatments in Multilevel Latent Growth Model: A Monte Carlo Simulation Study

    Doctor of Philosophy, The Ohio State University, 2014, EDU Policy and Leadership

    Under the framework of structural equation modeling (SEM), longitudinal data can be analyzed using latent growth models (LGM). An extension of the simple LGM is the multilevel latent growth model, which can be used to fit clustered data. The purpose of this study is to investigate the performance of five different missing data treatments (MDTs) for handling missingness due to longitudinal attrition in a multilevel LGM. The MDTs are: (1) listwise deletion (LD), (2) FIML, (3) EM imputation, (4) multiple imputation based on regression (MI-Reg), and (5) MI based on predictive mean matching (MI-PMM). A Monte Carlo simulation study was conducted to explore the research questions. First, population parameter values for the model were estimated from a nationally representative sample of elementary school students. Datasets were then simulated based on a two-level LGM, with different growth trajectories (constant, decelerating, accelerating), and at varying levels of sample size (200, 500, 2000,10000). After datasets are generated, a designated proportion of data points (5%, 10%, 20%) were deleted based on different mechanism of missingness (MAR, MNAR), and the five missing data treatments were applied. Finally, the parameter estimates produced by each missing data treatment were compared to the true population parameter values and to each other, according to the four evaluation criteria: parameter estimate bias, root mean square error, length of 95% confidence intervals (CI), and coverage rate of 95% CIs. Among the five MDTs studied, FIML is the only MDT that yields satisfactory bias level as well as coverage rate for all parameters across all sample sizes, attrition rates, and growth trajectories under MAR. It is also the only MDT that consistently outperforms the conventional MDT, LD, in every aspect, especially when missingness ratio increases. Under MNAR, however, estimates of the predictor effects on slopes become biased and coverage for those two paramet (open full item for complete abstract)

    Committee: Richard Lomax (Advisor); Paul Gugiu (Committee Member); Eloise Kaizar (Committee Member) Subjects: Education; Statistics
  • 13. Codd, Casey A Review and Comparison of Models and Estimation Methods for Multivariate Longitudinal Data of Mixed Scale Type

    Doctor of Philosophy, The Ohio State University, 2014, Psychology

    Models for the joint analysis of multiple outcome variables which are of possibly different scale types are useful because they allow researchers to answer questions related to the association between the trajectories of several variables and also provide a way to evaluate the change in the association between variables over time. There are several types of models that can handle multivariate longitudinal data, but one common approach involves using generalized linear mixed models with correlated random effects. This type of model is quickly growing in popularity in fields such as medicine and biostatistics, and the potential applications in the behavioral sciences are also quite broad. The limiting feature has been that estimation of the model parameters involves integration over the random effects in order to obtain the marginal distribution of the data. In this dissertation, a review of several widely-available estimation methods and models for multivariate longitudinal data is provided. To evaluate the performance of the estimation methods within the multivariate generalized linear mixed model framework, a simulation study was conducted. The particular estimation methods of interest were adaptive Gaussian quadrature (AGQ), Laplace approximation (LA), penalized quasi-likelihood (PQL), and marginal quasi-likelihood (MQL). Results indicated that although AGQ and LA typically estimate the parameter estimates with less bias, PQL and MQL tend to produce more stable estimates for the covariance matrix of random effects. Furthermore, even though PQL performed quite poorly in many conditions, the bias of parameter estimates tended to decrease as the correlation between the random effects increased. Data from the National Longitudinal Study of Youth are used to illustrate the applicability of the multivariate generalized linear mixed model to behavioral data. A summary of the major findings in the literature review and also of numerical studies is given.

    Committee: Robert Cudeck Ph.D. (Advisor); Michael Edwards Ph.D. (Committee Member); Minjeong Jeon Ph.D. (Committee Member) Subjects: Quantitative Psychology
  • 14. Li, Qian Approaches to Find the Functionally Related Experiments Based on Enrichment Scores: Infinite Mixture Model Based Cluster Analysis for Gene Expression Data

    PhD, University of Cincinnati, 2013, Arts and Sciences: Mathematical Sciences

    DNA microarray is a widely used high-throughput technology to measure the expression level of tens of thousands of genes simultaneously. With increasing availability of microarray genomics data, various clustering algorithms have been explored to identify the latent patterns in gene expression data as well as discover disease subtypes. Interesting connections that can be founded correlating differential-expressed genes evidence to other biological information are very important in developing a full picture of the biological pathways as well as in giving insightful suggestions to the new conducted experiments. The abundant biological information we need to identify the disease signature is organized in the functional categories. Thus, relating the microarray experiments to the functional categories could lead to a better understanding of the underlying biological process and help develop targeted treatment to a specific disease. In this dissertation, we investigated several Dirichlet process mixture (DPM) model based clustering methods that explicitly account for interactions across the functional category enrichment scores for improved sample clustering. Our clustering method represents microarray data enrichment score profiles as multivariate Gaussian random variables with structured or unstructured correlation. Also we demonstrate by a simulation study that when correlation exist, our algorithm will outperform the other clustering algorithm assume independence. Furthermore, factor analysis based clustering procedure is developed to search for the correct underlying correlation pattern and we optimize the number of factors using the Metropolised Carlin and Chib method based model selection algorithm. In such a way, we reduce the number of parameters to be estimated in the unstructured covariance matrix model and also incorporate the unknown variance-covariance structure across different functional categories. The main contributions of our ap (open full item for complete abstract)

    Committee: Siva Sivaganesan Ph.D. (Committee Chair); Seongho Song Ph.D. (Committee Member); Xia Wang Ph.D. (Committee Member) Subjects: Statistics
  • 15. Wang, Chao Exploiting non-redundant local patterns and probabilistic models for analyzing structured and semi-structured data

    Doctor of Philosophy, The Ohio State University, 2008, Computer and Information Science

    This work seeks to develop a probabilistic framework for modeling, querying and analyzing large-scale structured and semi-structured data. The framework has three components: (1) Mining non-redundant local patterns from data; (2) Gluing these local patterns together by employing probabilistic models (e.g., Markov random field (MRF), Bayesian network); and (3) Reasoning over the data for solving various data analysis tasks. Our contributions are as follows: (a) We present an approach of employing probabilistic models to identify non-redundant itemset patterns from a large collection of frequent itemsets on transactional data. Our approach can effectively eliminate redundancies from a large collection of itemset patterns. (b) We propose a technique of employing local probabilistic models to glue non-redundant itemset patterns together in tackling the link prediction task in co-authorship network analysis. Our technique effectively combines topology analysis on network structure data and frequency analysis on network event log data. The main idea is to consider the co-occurrence probability of two end nodes associated with a candidate link. We propose a method of building MRFs over local data regions to compute this co-occurrence probability. Experimental results demonstrate that the co-occurrence probability inferred from the local probabilistic models is very useful for link prediction. (c) We explore employing global models, models over large data regions, to glue non-redundant itemset patterns together. We investigate learning approximate global MRFs on large transactional data and propose a divide-and-conquer style modeling approach. Empirical study shows that the models are effective in modeling the data and approximately answering queries on the data. (d) We propose a technique of identifying non-redundant tree patterns from a large collection of structural tree patterns on semi-structured XML data. Our approach can effectively eliminate redundancies from a larg (open full item for complete abstract)

    Committee: Srinivasan Parthasarathy (Advisor) Subjects: Computer Science
  • 16. Jeanty, Pierre Two essays on environmental and food security

    Doctor of Philosophy, The Ohio State University, 2006, Agricultural, Environmental and Development Economics

    The first essay of this dissertation is an attempt to incorporate uncertainty into double bounded dichotomous choice contingent valuation. The double bounded approach, which entails asking respondents a follow-up question after they have answered a first question, has emerged as a means to increase efficiency in willingness to pay (WTP) estimates. However, several studies have found inconsistency between WTP estimates generated by the first and second questions. In this study, it is posited that this inconsistency is due to uncertainty facing the respondents when the second question is introduced. The author seeks to understand whether using a follow-up question in a stochastic format, which allows respondents to express uncertainty, would alleviate the inconsistency problem. In a contingent valuation survey to estimate non-market economic benefits of using more biodiesel vs. petroleum diesel fuel in an airshed encompassing South Eastern and Central Ohio, it is found that the gap between WTP estimates produced by the first and the second questions reduces when respondents are allowed to express uncertainty. The proposed stochastic follow-up approach yields more efficient WTP estimates than the conventional follow-up approach while maintaining efficiency gain over the single bounded model. In the second essay, instrumental variable panel data techniques are applied to estimate the effects of civil wars and violent conflicts on food security in a sample of 73 developing countries from 1970 to 2002. The study aims to provide empirical evidence as to whether the manifest increase in the number of hungry can be ascribed to civil unrest. From a statistical standpoint, the results convincingly pinpoint the danger of using conventional panel data estimators when endogeneity is of the conventional simultaneous equation type, i.e. with respect to the idiosyncratic error term. From a policy viewpoint, it is found that, in general, civil wars and conflicts are detrimental to fo (open full item for complete abstract)

    Committee: Fred Hitzhusen (Advisor) Subjects: Economics, Agricultural
  • 17. Kang, Heechan Essays on methodologies in contingent valuation and the sustainable management of common pool resources

    Doctor of Philosophy, The Ohio State University, 2006, Agricultural, Environmental and Development Economics

    The first essay is aimed to compare the single bounded model with both bivariate probit model and interval data model in terms of bias as well as statistical efficiency. The implication of Monte-Carlo simulation and empirical estimation in this paper is to test the usefulness of a dichotomous choice contingent valuation with follow-up survey. In terms of efficiency, bivariate probit model is not always better than single bounded model, and interval data model performs better than single bounded model only if the true means and true variances of the first and the second WTP are relatively close. As for biasedness, the estimated parameters from bivariate probit model are very close to those of single bounded model but those of interval data model show high discrepancy that also does not disappear as the sample size increases when the true means and true variances of the first and the second response are reasonably different. In the second essay, I develop a new method to diagnose inconsistency in dichotomous choice contingent valuation with follow-up questions. I show that the previous methods aimed to explain behavioral inconsistency in responses have ignored statistical inconsistency (non-perfect correlation between the initial and follow-up responses) and thus have provided a wrong direction to explain respondents' inconsistency pattern. In addition, from an application of our method, I prove that one model can not encompass all possible inconsistency patterns in responses. I find that the behavioral inconsistency patterns are different both within and between data sets. In the third essay, I develop a theoretical model that is capable of explaining the existence of sustainable common pool resource equilibria in the absence of external regulation. I combine ideas from the literature on social norms in an iterative game theory framework to establish the existence of multiple sustainable common pool equilibria.

    Committee: Timothy Haab (Advisor) Subjects:
  • 18. Schneider, Chelsey Comprehensive Evaluation of a Data-Based Problem Solving Reading Model

    Specialist in Education, Miami University, 2008, School Psychology

    The purpose of this study was to provide a comprehensive evaluation of a data-based problem solving model for the assessment and intervention of reading problems. The research design was a multiple baseline across ten participants. The length of the baseline varied before the treatment phase was applied to indicate if the change in performance corresponded with the introduction of treatment. This design allowed the researcher to determine if the application of treatment was truly influencing the change in reading performance. First, the study examined if an individualized, data-based problem solving model leads to increased oral reading fluency for children at risk for poor reading outcomes. Second, the study examined if an individualized data-based problem solving model leads to generalized effects on comprehension, prosody, academic engagement, and self-efficacy. Third, the study examined if self-efficacy is a significant predictor to response to intervention.

    Committee: Kevin Jones (Committee Chair); Katherine Wickstrom (Committee Member); Leah Wasburn-Moses (Committee Member); Tonya Watson (Committee Member) Subjects: Education; Educational Psychology; Reading Instruction
  • 19. Chantamas, Wittaya A Multiple Associative Computing Model to Support the Execution of Data Parallel Branches Using the Manager-worker Paradigm

    PHD, Kent State University, 2009, College of Arts and Sciences / Department of Computer Science

    The multiple associative computing (MASC) model is an enhanced strictly synchronous multi-SIMD model that is a generalization of an associative computing model (ASC) designed to support multiple ASC threads by using control parallelism to substantially improve the low processor utilization often criticized in SIMDs. The MASC model combines the advantages of both SIMD and MIMD models such as simple description, inherent synchronous operations, and ease of programming and debugging of SIMDs while providing flexible control flow support of MIMDs with small thread synchronization overheads. In this research, a cycle of simulations is used to show that a MASC model with constant associative operation word length and a MASC model with log n associative operation word length are equivalent in power. Moreover, the MASC model is powerful as the B-RMBM, S-RMBM, COMMON CRCW PRAM, and BRM models. This research presents a model description of a MASC model that uses the manager and worker instruction stream paradigm. A cycle precision software simulator, which is able to provide the exact number of overhead and execution cycles the model requires to execute a program, is used to demonstrate the performance of this implementation of MASC on various algorithms. The simulator is actually a software prototype for the manager-worker version with sufficient software details to allow a computer engineer to convert this software prototype into a hardware prototype of the manager-worker version of MASC. On the example multithreaded algorithms used, when processing large-scale instances using multiple workers, the MASC Floyd-Warshall algorithm shows strong scaling with constant time overhead and, for an average case, the MASC Quickhull algorithm shows good scaling with low overhead.

    Committee: Johnnie Baker Ph.D. (Advisor); Robert Walker Ph.D. (Committee Member); Mikhail Nesterenko Ph.D. (Committee Member); Andrew Tonge Ph.D. (Committee Member); Mark Kershner Ph.D. (Committee Member) Subjects: Computer Science
  • 20. Choi, Ickwon Computational Modeling for Censored Time to Event Data Using Data Integration in Biomedical Research

    Doctor of Philosophy, Case Western Reserve University, 2011, EECS - Computer and Information Sciences

    Medical prognostic models are designed by clinicians to predict the future course or outcome of disease progression after diagnosis or treatment. The data, which are used when these clinical models are developed, are required to contain a high number of events per variable (EPV) for the resulting model to be reliable. If our objective is to optimize predictive performance by some criterion, we can often achieve a reduced model that has a little bias with low variance, but whose overall performance is improved. To accomplish this goal, we propose a new variable selection approach that combines Stepwise Tuning in the Maximum Concordance Index (STMC) and Forward Nested Subset Selection (FNSS) in two stages. In the first stage, the proposed variable selection is employed to identify the best subset of risk factors optimized with the concordance index using inner cross validation for optimism correction in the outer loop of cross validation, yielding potentially different final models for each of the folds. We then feed the intermediate results of the prior stage into another selection method in the second stage to resolve the overfitting problem and to select a final model from the variation of predictors in the selected models. Two case studies on relatively different sized survival data sets as well as a simulation study demonstrate that the proposed approach is able to select an improved and reduced average model under a sufficient sample and event size compared to other selection methods such as stepwise selection using the likelihood ratio test, Akaike Information Criterion (AIC), and least absolute shrinkage and selection operator (lasso). Finally, we achieve improved final models in each dataset as compared full models according to most criteria. These results of the model selection models and the final models were analyzed in a systematic scheme through validation for independent performance evaluation. For the second part of this dissertation, we build prognos (open full item for complete abstract)

    Committee: Michael Kattan (Advisor); Mehmet Koyuturk (Committee Chair); Andy Podgurski (Committee Member); Soumya Ray (Committee Member) Subjects: Computer Science