Search Results (1 - 22 of 22 Results)

Sort By  
Sort Dir
 
Results per page  

Abounia Omran, BehzadApplication of Data Mining and Big Data Analytics in the Construction Industry
Doctor of Philosophy, The Ohio State University, 2016, Food, Agricultural and Biological Engineering
In recent years, the digital world has experienced an explosion in the magnitude of data being captured and recorded in various industry fields. Accordingly, big data management has emerged to analyze and extract value out of the collected data. The traditional construction industry is also experiencing an increase in data generation and storage. However, its potential and ability for adopting big data techniques have not been adequately studied. This research investigates the trends of utilizing big data techniques in the construction research community, which eventually will impact construction practice. For this purpose, the application of 26 popular big data analysis techniques in six different construction research areas (represented by 30 prestigious construction journals) was reviewed. Trends, applications, and their associations in each of the six research areas were analyzed. Then, a more in-depth analysis was performed for two of the research areas including construction project management and computation and analytics in construction to map the associations and trends between different construction research subjects and selected analytical techniques. In the next step, the results from trend and subject analysis were used to identify a promising technique, Artificial Neural Network (ANN), for studying two construction-related subjects, including prediction of concrete properties and prediction of soil erosion quantity in highway slopes. This research also compared the performance and applicability of ANN against eight predictive modeling techniques commonly used by other industries in predicting the compressive strength of environmentally friendly concrete. The results of this research provide a comprehensive analysis of the current status of applying big data analytics techniques in construction research, including trends, frequencies, and usage distribution in six different construction-related research areas, and demonstrate the applicability and performance level of selected data analytics techniques with an emphasis on ANN in construction-related studies. The main purpose of this dissertation was to help practitioners and researchers identify a suitable and applicable data analytics technique for their specific construction/research issue(s) or to provide insights into potential research directions.

Committee:

Qian Chen, Dr. (Advisor)

Subjects:

Civil Engineering; Comparative Literature; Computer Science

Keywords:

Construction Industry; Big Data; Data Analytics; Data mining; Artificial Neural Network; ANN; Compressive Strength; Environmentally Friendly Concrete; Soil Erosion; Highway Slope; Predictive Modeling; Comparative Analysis

Dhar, SamirAddressing Challenges with Big Data for Maritime Navigation: AIS Data within the Great Lakes System
Doctor of Philosophy, University of Toledo, 2016, Spatially Integrated Social Science
The study presented here deals with commercial vessel tracking in the Great Lakes using the Automatic Identification System (AIS). Specific objectives within this study include development of methods for data acquisition, data reduction, storage and management, and reporting of vessel activity within the Great Lakes using AIS. These data show considerable promise in tracking commodity flows through the system as well as documenting traffic volumes at key locations requiring infrastructure investment (particularly dredging). Other applications include detecting vessel calls at specific terminals, locks and other navigation points of interest. This study will document the techniques developed to acquire, reduce, aggregate and store AIS data at The University of Toledo. Specific topics of the paper include: data reducing techniques to reduce data volumes, vessel path tracking, estimate speed on waterway network, detection of vessel calls made at a dock, and a data analysis and mining for errors within AIS data. The study also revealed the importance of AIS technology in maritime safety, but the data is coupled with errors and inaccuracy. These errors within the AIS data will have to be addressed and rectified in future to make the data accurate and useful. The data reduction algorithm shows a 98% reduction in AIS data making it more manageable. In future similar data reduction techniques can possibly be used with traffic GPS data collected for highways and railways.

Committee:

Peter Lindquist (Committee Chair); Kevin Czajkowski (Committee Member); Neil Reid (Committee Member); Mark Vonderembse (Committee Member); Richard Stewart (Committee Member)

Subjects:

Geographic Information Science; Geography; Information Technology; Remote Sensing; Social Research; Transportation

Keywords:

Automatic Identification System , AIS, Big Data, Data Reduction Technique, Vessel Path, Vessel Call, Great Lakes, Maritime, VTS

Kulkarni, Kunal VikasPerformance Characterization and Improvements of SQL-On-Hadoop Systems
Master of Science, The Ohio State University, 2016, Computer Science and Engineering
Impala and Hive bring SQL technologies on Hadoop Systems enabling users to run analytics queries aganist data stored in HDFS and Apache HBase without requiring data movement or transformation. In this work we characterize BigDataBench SQL workloads in Impala as I/O, Communication or Compute intensive. We do detailed profiling and analysis of query execution in Impala to understand the performance of SQL queries. From the analysis we observe that the performance of Inner Join queries can be improved in Impala since the existing Join implementation is blocking based. This work implements a non-blocking Join where the reading of right-side table of Join and building of its Hashtable is overlapped with construction of left-side table data. Experimental results show that non-blocking Join implementation improves the execution of Join queries by 9-12%. Next scalability study of Impala is performed to evaluate how well Impala scales out on increasing the number of compute nodes for divergent SQL queries. We observe that the default Inner Join SQL query is not scaling well since Impala by default does a broadcast Join. We change the default Inner Join in Impala to do partitioned/shuffle Join and the results show that it scales linearly. We then evaluate Hive SQL queries running on top of Triple-H - RDMA (Remote Direct Memory Access) based HDFS which is optimized for HDFS-Write. We design new write intensive SQL benchmark queries and the experimental results show that Triple-H brings benefit of 45% to write intensive queries and 25% benefit to read intensive query in Hive. In another scheme we evaluate querying of HBase tables in Hive running on top of Triple-H and we see 20-33% benefit for write intensive queries and 15% benefit for read intensive query. From these results we show improvements of SQL queries on Hadoop Systems.

Committee:

Dhabaleswar Panda, Dr (Advisor); P Sadayappan, Dr (Committee Member); Xiaoyi Lu, Dr (Committee Member)

Subjects:

Computer Science

Keywords:

Hadoop; SQL; Impala; Hive; Big Data; Joins; HDFS

Gadiraju, Krishna KarthikBenchmarking Performance for Migrating a Relational Application to a Parallel Implementation
MS, University of Cincinnati, 2014, Engineering and Applied Science: Computer Science
Many organizations rely on relational database platforms for OLAP-style querying (aggregation and filtering) for small to medium size applications. We investigate the impact of scaling up the data sizes for such queries. We intend to illustrate what kind of performance results an organization could expect should they migrate current applications to big data environments. This thesis benchmarks the performance of Hive, a parallel data warehouse platform that is a part of the Hadoop software stack. We set up a 4-node Hadoop cluster using Hortonworks HDP 1.3.2. We use the data generator provided by the TPC-DS benchmark to generate data of different scales. We use a representative query provided in the TPC-DS query set and run the SQL and Hive Query Language (HiveQL) versions of the same query on a relational database installation (MySQL) and on the Hive cluster. An analysis of the results shows that for all the dataset sizes used, Hive is faster than MySQL when executing the query. Hive loads the large datasets faster than MySQL, while it is marginally slower than MySQL when loading the smaller datasets.

Committee:

Karen Davis, Ph.D. (Committee Chair); Prabir Bhattacharya, Ph.D. (Committee Member); Paul Talaga, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

Hive;Hadoop;benchmarking;big data;SQL;queries

Elkin, Lauren SPredicting Diffusion of Contagious Diseases Using Social Media Big Data
Master of Sciences (Engineering), Case Western Reserve University, 2015, EECS - Computer and Information Sciences
Influenza (flu) outbreaks affect approximately 200,000 people in the United States annually, which sometimes causes overcrowding in hospitals. Predicting future outbreaks and understanding how they are spreading across geographies can better prepare hospitals. In this study, we analyze social media micro-blogs and geographical locations to understand how outbreaks spread, and to enhance disease forecasting. In this paper, we use Twitter as our data source, influenza-like illnesses (ILI) as our disease epidemic, and states in the United States as our geographical locations. We present a novel network-based model that utilizes social media data to make predictions about disease diffusion a week in advance. We show that flu-related tweets align well with ILI activity (p<0.049). We also show that our model yielded accurate predictions for upcoming ILI activity (p<0.04), and for predicting flu diffusion (76% accuracy). Our methods can be translated to apply to any social media source, contagious disease, and region.

Committee:

Gurkan Bebek, Dr. (Advisor); Mehmet Koyuturk, Dr. (Committee Member); Xiang Zhang, Dr. (Committee Member)

Subjects:

Computer Science; Information Science

Keywords:

social media big data; influenza; diffusion

Su, YuBig Data Management Framework based on Virtualization and Bitmap Data Summarization
Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering
In recent years, science has become increasingly data driven. Data collected from instruments and simulations is extremely valuable for a variety of scientific endeavors. The key challenge being faced by these efforts is that the dataset sizes continue to grow rapidly. With growing computational capabilities of parallel machines, temporal and spatial scales of simulations are becoming increasingly fine-grained. However, the data transfer bandwidths and disk IO speed are growing at a much slower pace, making it extremely hard for scientists to transport these rapidly growing datasets. Our overall goal is to provide a virtualization and bitmap based data management framework for “big data” applications. The challenges rise from four aspects. First, the “big data” problem leads to a strong requirement for efficient but light-weight server-side data subsetting and aggregation to decrease the data loading and transfer volume and help scientists find subsets of the data that is of interest to them. Second, data sampling, which focuses on selecting a small set of samples to represent the entire dataset, is able to greatly decrease the data processing volume and improve the efficiency. However, finding a sample with enough accuracy to preserve scientific data features is difficult, and estimating sampling accuracy is also time-consuming. Third, correlation analysis over multiple variables plays a very important role in scientific discovery. However, scanning through multiple variables for correlation calculation is extremely time-consuming. Finally, because of the huge gap between computing and storage, a big amount of time for data analysis is wasted on IO. In an in-situ environment, before the data is written to the disk, how to generate a smaller profile of the data to represent the original dataset and still support different analyses is very difficult. In our work, we proposed a data management framework to support more efficient scientific data analysis, which contains two modules: SQL-based Data Virtualization and Bitmap-based Data Summarization. SQL-based Data Virtualization module supports high-level SQL-like queries over different kinds of low-level data formats such as NetCDF and HDF5. From the scientists’ perspective, all they need to know is how to use SQL queries to specify their data subsetting, aggregation, sampling or even correlation analysis requirements. And our module can automatically transfer the high-level SQL queries into low-level data access languages, fetch the data subsets, perform different calculations and return the final results to the scientists. Bitmap-based Data Summarization module treats bitmap index as a data summarization and supports different kinds of analysis only using bitmaps. Indexing technology, especially bitmap indexing have been widely used in database area to improve the data query efficiency. The major contribution of our work is that we find bitmap index keeps both value distribution and spatial locality of the scientific dataset. Hence, it can be treated as a summarization of the data with much smaller size. We demonstrate that many different kinds of analyses can be supported only using bitmaps.

Committee:

Gagan Agrawal (Advisor)

Subjects:

Computer Science

Keywords:

Big Data; High-Performance Computing; Bitmap Index; Data Virtualization; Sampling; Correlation Analysis; Time Steps Selection; In-Situ Analysis; Distributed Computing; Scientific Data Management; Wide-area Data Transfer

Jamthe, AnaghaMitigating interference in Wireless Body Area Networks and harnessing big data for healthcare
PhD, University of Cincinnati, 2015, Engineering and Applied Science: Computer Science and Engineering
Wireless Body Area Network (WBAN) has become an important field of research that could provide cost effective solution for ubiquitous health care monitoring of human body. In recent past, it has attracted attention from several researchers due to its potential applications in various disciplines including health care, sports-medicine, entertainment, etc. It is rapidly replacing wired counterparts due to its several attractive features such as light-weight easy portability, support for real time remote monitoring, ease of use, etc. Users of WBANs are increasing exponentially as more people are embracing wearable monitoring devices for numerous health care causes. Interference is considered to be one of the major issues in WBANs, which arises primarily due to close proximity of other WBANs, random human mobility, and distributed nature of people carrying WBANs. Coexisting WBANs have great chance of interference, which might degrade the network performance. If left unchecked, interference can cause serious threat to reliable operation of the network. It could cause to loss of critical medical data of patients, which might even prove to be life threatening. The primary motivation behind this dissertation is to avoid such a situation by using various interference mitigation techniques. Graceful coexistence could be ensured by scheduling the transmissions between co-existing WBANs. MAC layer is responsible for scheduling data transmissions and coordinating nodes’ channel access that avoids possible collisions during data transmissions. In this dissertation, we have attempted to address intra-WBAN and inter-WBAN interference issues. We model a fuzzy logic based inference engine to make decisions while scheduling transmissions in isolated WBANs. For coexisting WBANs, due to its distributed nature and lack of central coordinator, we propose a QoS based MAC scheduling approach that avoids inter-WBAN interference. Our proposed MAC scheduling scheme can be used for improving network performance, which is also confirmed from the results. We also discuss one of the important challenges in modeling such a MAC schemes, which is random human mobility. In this dissertation we also discuss leveraging big data technology for healthcare. The use of wearable devices is growing tremendously so is the data generated from it. In order to efficiently convert this big data into a useful resource of information, which can be used for diagnosis or prognosis, a need for distributed parallel framework arises. Using the latest big data solutions we can efficiently store and process the healthcare sensor data. This offers a cheaper, reliable and faster computation as compared to traditional database management systems. Various analytic methods for clinical prediction can be used with this framework to enable automated learning and accurate prediction. We discuss in this dissertation, the emerging technologies for WBAN such as implant medical sensor devices and the communication standards associated with them. The emerging technologies Near field communication and Beacon can be harnessed to develop a smart medicine management mobile application. We conclude this dissertation by identifying future directions on data security in WBANs and implementation of personalized medicine.

Committee:

Dharma Agrawal, D.Sc. (Committee Chair); Richard Beck, Ph.D. (Committee Member); Prabir Bhattacharya, Ph.D. (Committee Member); Chia Han, Ph.D. (Committee Member); Wen-Ben Jone, Ph.D. (Committee Member); Carla Purdy, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

Wireless Body Area Networks;Big Data in healthcare;Interference mitigation;IEEE 802 15 6;coexistence in WBANs;sensor data analysis

Bicer, TekinSupporting Data-Intensive Scienti c Computing on Bandwidth and Space Constrained Environments
Doctor of Philosophy, The Ohio State University, 2014, Computer Science and Engineering
Scientific applications, simulations and instruments generate massive amount of data. This data does not only contribute to the already existing scientific areas, but it also leads to new sciences. However, management of this large-scale data and its analysis are both challenging processes. In this context, we require tools, methods and technologies such as reduction-based processing structures, cloud computing and storage, and efficient parallel compression methods. In this dissertation, we first focus on parallel and scalable processing of data stored in S3, a cloud storage resource, using compute instances in Amazon Web Services (AWS). We develop MATE-EC2 which allows specification of data processing using a variant of Map-Reduce paradigm. We show various optimizations, including data organization, job scheduling, and data retrieval strategies, that can be leveraged based on the performance characteristics of cloud storage resources. Furthermore, we investigate the efficiency of our middleware in both homogeneous and heterogeneous environments. Next, we improve our middleware so that users can perform transparent processing on data that is distributed among local and cloud resources. With this work, we maximize the utilization of geographically distributed resources. We evaluate our system's overhead, scalability, and performance with varying data distributions. The users of data-intensive applications have different requirements on hybrid cloud settings. Two of the most important ones are execution time of the application and resulting cost on the cloud. Our third contribution is providing a time and cost model for data-intensive applications that run on hybrid cloud environments. The proposed model lets our middleware adapt performance changes and dynamically allocate necessary resources from its environments. Therefore, applications can meet user specified constraints. Fourth, we investigate compression approaches for scientific datasets and build a compression system. The proposed system focuses on implementation and application of domain specific compression algorithms. We port our compression system into aforementioned middleware and implement different compression algorithms. Our framework enables our middleware to maximize bandwidth utilization of data-intensive applications while minimizing storage requirements. Although, compression can help us to minimize input and output overhead of data-intensive applications, utilization of compression during parallel operations is not trivial. Specifically, unable to determine compressed data chunk sizes in advance complicates the parallel write operations. In our final work, we develop different methods for enabling compression during parallel input and output operations. Then, we port our proposed methods into PnetCDF, a widely used scientific data management library, and show how transparent compression can be supported during parallel output operations. The proposed system lets an existing parallel simulation program start outputting and storing data in a compressed fashion. Similarly, data analysis applications can transparently access to compressed data using our system.

Committee:

Gagan Agrawal (Advisor); Feng Qin (Committee Member); Spyros Blanas (Committee Member)

Subjects:

Computer Science

Keywords:

Data-Intensive Computing; Map-Reduce; Cloud Computing; Big Data; Scientific Data Management; Compression

Upadhyay, AbhyudayaBig Vector: An External Memory Algorithm and Data Structure
MS, University of Cincinnati, 2015, Engineering and Applied Science: Computer Science
In such data-centered domains as science, finance, and social media, it is essential to collect vast quantities of data for research purposes in order to remain relevant and competitive. However, the effective utilization of this data by computers is hindered by insufficient memory and processing capacities, since present data structures and free memory (RAM and Virtual Memory) were not designed to process large data sets As a solution to this issue, we have developed Big Vector, a new data storage container capable of storing large amount of data which features a user-friendly STL Vector Interface and is dynamically resizable during run time. This paper demonstrates that Big Vector provides larger storage than standard memory containers such as array, vector (STL), and linkedlist, making Big Vector useful for programmers hindered by inadequate storage space and memory resources.

Committee:

Paul Talaga, Ph.D. (Committee Chair); Raj Bhatnagar, Ph.D. (Committee Member); John Franco, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

Vectors;Big Data;External Memory Management;Data Structures;Memory I O;Vitter PDM Model

Bhatta, SanjeevConditional Correlation Analysis
Master of Science (MS), Wright State University, 2017, Computer Science
Correlation analysis is a frequently used statistical measure to examine the relationship among variables in different practical applications. However, the traditional correlation analysis uses an overly simplistic method to do so. It measures how two variables are related in an application by examining only their relationship in the entire underlying data space. As a result, traditional correlation analysis may miss a strong correlation between those variables especially when that relationship exists in the small subpopulation of the larger data space. This is no longer acceptable and may lose a fair share of information in this era of Big Data which often contains highly diverse nature of data where data can di er in a noticeable manner within the same application. To remedy this situation, we are introducing a new approach called Conditional Correlation Analysis (CCR) in this thesis. Instead of computing the correlation among variables in the entire data space, this approach rst divides the entire data space into multiple subpopulations using patterns.It then computes the correlation for each subpopulation and identi es the subpopulation which is highly di erent (in term of correlation strength) from the global population. Moreover, we introduce the concepts of CCRs and the ways to mine those CCRs, provides measures to evaluate the unusualness of CCRs and gives experiments to evaluate and illustrate the CCR approach in nancial and medical applications.

Committee:

Guozhu Dong, Ph.D. (Committee Chair); Keke Chen, Ph.D. (Committee Member); Derek Doran, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

subpopulation; conditional correlation; big data; patterns; unusualness

Kurt, Mehmet CanFault-tolerant Programming Models and Computing Frameworks
Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering
Fault-tolerance on parallel systems has always been a big challenge for High Performance Computing (HPC), and hence it has drawn a lot of attention of the community. This pursuit in fault-tolerant systems is now important more than ever due to the recent advances in hardware. As the emergence of first multi-core and more recently many-core machines evince, computing power is constantly being increased with more number of processing cores resulting in more parallelism. In order to satisfy this demand and to increase power of individual components, chips are manufactured with decreasing feature sizes. Another trend is power optimization efforts, since it might not be feasible to run all system resources at their peak levels all the time due to the factors such as heat dissipation and maintaining a total power budget. These trends in hardware also change the way that scientific applications are implemented. The community designs new and diverse parallel programming models to harvest the available computing power in new hardware architectures. These models provide additional support to programmers so that they can achieve scalable performance by tuning applications via additional API, specifications or annotations. Unfortunately, these changes in hardware and software also bring new challenges. For instance, increasing number of components in HPC systems results in increasing probability of failure at the same time. Trends such as decreasing feature sizes and low voltage computing cause more frequent bit flip occurrences. Lastly, when incorporated incorrectly or inaccurately, programmer specifications for performance tuning might cause potential errors during execution. Considering these new problems, the community foresees that Mean Time Between Failures (MTBF) rates in the future are destined to decrease significantly so that the current fault-tolerance solutions will become completely inapplicable. In this dissertation, we introduce fault-tolerance solutions in the context of existing and new parallel programming models and query and data-analysis frameworks. Our specific solutions target the three type of failures commonly seen; fail-stop failures, soft errors and programmer induced errors. With proposed solutions, we address the following key challenges. (1) Replication is a standard technique employed in big data analysis platforms to ensure the availability of underlying data in presence of fail-stop failures. How should we create and organize data replicas so that we guarantee efficient recovery by preserving load balance among remaining processing units when failures occur? (2) Programming models are expected to play a key role in overcoming the challenges in future HPC systems including resilience. Can we design a programming model that exposes the core execution state and the most critical computations in an application through a set of programming abstractions? (3) With the help of these abstractions, can such a programming model automate application-level checkpointing and reduce the amount of checkpointed state? Can we use the same knowledge to detect silent data corruptions with low overheads by executing a subset of the computations in an application redundantly? (4) For checkpoint/restart solutions, can we design recovery techniques that do not enforce any assumptions on the number of processing units that the execution is restarted with? (5) Fault-tolerance has been mostly addressed in the context of SPMD paradigm. Is it possible to design fault-tolerance solutions against soft errors in different parallel programming paradigms such as task graph execution model? (6) In addition to fail-stop failures and soft errors due to manufacturing issues and machine defects, can we also deal with the potential failures that are induced by programmer specifications while tuning an application to improve performance? First, we presented the design and implementation of a fault-tolerant environment for processing queries and data analysis tasks on large scientific datasets. For two common query and data analysis tasks, we first provided a framework that employs the standard data indexing techniques and achieves high-efficiency of execution when there are no failures. Then, we showed how the framework recovers from failures up to a certain number of nodes efficiently and still maintains the load balance among remaining nodes after recovery completes. We achieved these goals by developing a data replication scheme, which we refer to as subchunk or subpartition replication. Our extensive evaluation demonstrated that our replication scheme outperforms the traditional solutions. Second, we focused on designing a parallel programming paradigm that models computations and communication in iterative scientific applications through an underlying domain and interactions among domain elements. With proper abstractions, the proposed model hides the details of inter-process communication and work partitioning (including re-partitioning in presence of heterogeneous processing cores) from users. More importantly, it captures the most critical execution state and instructions in an application through the concepts of compute-function and computation-space object. The model supports automated, yet efficient, application-level checkpointing and at the same time detects soft errors occurring in processing cores and corrupting the main application state by a low-overhead redundant execution strategy. We analyze the performance of our programming model with various scenarios both on homogeneous and heterogeneous configurations. Next, we directed our attention to task graph execution model, which is a different parallel programming paradigm than Single Program Multiple Data (SPMD) for which most of the existing fault-tolerance solutions in the literature has been proposed. We designed a fault-tolerant dynamic task graph scheduling algorithm that recovers corrupted data blocks and meta-data in a task graph from arbitrary number of soft errors with low time and space overheads. We provided a task re-execution algorithm which is selective in the sense that only the corrupted portion of the task graph is recovered. Furthermore, as opposed to the traditional checkpoint/restart solutions, recovery is performed in a non-collective fashion so that only the threads observing the failure participate in the recovery process, whereas remaining threads continue normal execution. We evaluated our fault-tolerance solution extensively under different failure scenarios and showed the recovery overheads are negligible for the common case of small number of failures. As our last work, we focused on another type of failure caused by the tuning efforts of runtime software and programmers to improve parallel execution performance. First, we proposed a memory management scheme to reduce the total memory consumption of applications expressed as a task graph. The presented optimization technique is based on recycling data blocks among tasks and it is able to handle task graphs with dynamic dependence relations efficiently in contrast to the common use-count based memory allocators. Recycling operations are dictated by functions which are either automatically explored by runtime or specified explicitly by user annotations. Regardless, an incorrect recycling operation might lead to data races and erroneous program output. Therefore, next, to detect such cases and still benefit from data block recycling, we proposed two algorithms which prune the space of candidate recycling functions and recover the effects of any invalid choice of recycling operation efficiently during execution. We demonstrated that the proposed schemes reduce the total memory consumption significantly and yet is able to avoid any potential hazards.

Committee:

Gagan Agrawal (Advisor); Saday Sadayappan (Committee Member); Radu Teodorescu (Committee Member)

Subjects:

Computer Science

Keywords:

fault-tolerance, fail-stop failures, soft errors, programmer errors, application-level checkpointing, replication, task re-execution, recovery, soft error detection, big data processing, SPMD, task graph scheduling, memory management

Massimino, Brett JOperational Factors Affecting the Confidentiality of Proprietary Digital Assets
Doctor of Philosophy, The Ohio State University, 2014, Business Administration
The leakage of an organization's proprietary, digital assets to unauthorized parties can be a catastrophic event for any organization. The magnitude of these events have been recently underscored by the Target data breach, in which 70 million consumer credit card accounts were compromised, and financial costs are expected to exceed $1 billion. Digital assets have steadily progressed beyond low-value data and information, and into high-value knowledge-based domains. Failures to protect these latter types of digital assets can have even greater implications for firms or even macroeconomic conditions. Using the Target event as an illustrative motivation, we highlight the importance of two relatively-unexplored topics within the domain of digital asset protections - (1) vendor management, and (2) worker adherence to standard, well-codified procedures and technologies. We explicitly consider each of these topics through the separate empirical efforts detailed in this dissertation. Our first empirical effort examines the effects of sourcing and location decisions on the confidentiality of digital assets. We frame our study within a product-development dyad, with a proprietary, digital asset being shared between partners. We treat confidentiality as a performance dimension that is influenced by each organization accessing the asset. Specifically, we empirically investigate the realm of electronic video game development and the illegal distribution activities of these products. We employ a series of web-crawling data collection programs to compile an extensive secondary dataset covering the legitimate development activities for the industry. We then harvest data from the archives of a major, black-market distribution channel, and leverage these data to derive a novel, product-level measure of asset confidentiality. We examine the interacting factors of industrial clustering (agglomeration) and national property rights legislations in affecting this confidentiality measure. We find that (1) firms within industry clusters tend to have significantly higher levels of asset confidentiality, (2) strong national property rights tend to suppress this benefit, and (3) these effects are greatly amplified for client organizations. Our second empirical effort seeks insight into the compliance behaviors of workers with tasks related to digital asset protections. Here, we frame a general, dual-task setting in which a worker has a procedural task as a primary responsibility (e.g., manufacturing), but is also requested to comply with a discretionary, protection-oriented task. We draw from Goal Setting Theory and task switching theories to elicit two factors which may significantly impact the worker's performance on each of these tasks: (1) the level of resource utilization for the worker, and (2) the level of attribution (group vs. individual) held for the protection-oriented task. In our analyses, we examine several performance variables, including: task performance, task switching and sequencing behaviors, and goal achievement levels. Through a controlled, laboratory experiment, we find that individual accountability on the protection task positively relates to the subjects' performance on both tasks. We also find evidence for a negative, nonlinear relationship between resource utilization and performance of the protection-oriented task, and find that this relationship is further moderated by the protection-tasks type of outcome attribution.

Committee:

John Gray (Advisor); Kenneth Boyer (Advisor); James Hill (Committee Member); Elliot Bendoly (Committee Member)

Subjects:

Business Administration

Keywords:

Information Security; Intellectual Property; Breach; Digital Economy; Supply Chain Management; Confidentiality; Asset Protections; Software Development; Worker Behaviors; Behavioral Operations Management; Data Protection; Big Data

Huai, YinBuilding High Performance Data Analytics Systems based on Scale-out Models
Doctor of Philosophy, The Ohio State University, 2015, Computer Science and Engineering
To respond to the data explosion, new system infrastructures have been built based on scale-out models for the purposes of high data availability and reliable large-scale computations. With an increasing amount of adoptions of data analytics systems, users continuously demand high throughput and high performance on various applications. In this dissertation, we identify three critical issues to achieve high throughput and high performance for data analytics, which are efficient table placement methods (i.e. the method to place structured data), generating high quality distributed query plans without unnecessary data movements, and effective support of out-of-band communications. To address these three issues, we have conducted a comprehensive study on design choices of different table placement methods, designed and implemented two optimizations to remove unnecessary data movements in distributed query plans, and introduced a system facility called {\it SideWalk} to facilitate the implementation of out-of-band communications. In our first work of table placement methods, we comprehensively studied existing table placement methods and generalized the basic structure of table placement methods. Based on the basic structure, we conducted a comprehensive evaluation of different design choices of table placement methods on I/O performance. Based on our evaluation and analysis, we provided a set of guidelines for users and developers to tune their implementations of table placement method. In our second work, we focused on building our optimizations based on Apache Hive, a widely used open source data warehousing system in the Hadoop ecosystem. We analyze operators that may require data movements in the context of the entire query plan. Our optimization methods remove unnecessary data movements from the distributed query plans. Our evaluation shows that these optimization methods can significantly reduce the query execution time. In our third work, we designed and implemented SideWalk, a system facility to implement out-of-band communications. We designed the APIs of SideWalk based on our abstraction of out-of-band communications. With SideWalk, users can implement out-of-band communications in various applications instead of using ad-hoc approaches. Through our evaluation, we show that SideWalk can effectively support out-of-band communications, which will be used in implementing advanced data processing flows, and users can conduct out-of-band communications in a reusable way. Without SideWalk, users commonly need to build out-of-band communications in an ad hoc way, which is hard to reuse and limit the programming productivity. The proposed studies in this dissertation has been comprehensively tested and evaluated to show their effectiveness. The guidelines on table placement methods in our table placement method study has been verified by newly implemented and widely used file formats, Optimized Record Columnar File (ORCFile) and Parquet. Optimization methods in our query planner work have been adopted by Apache Hive, which is a widely used data warehousing system in the Hadoop ecosystem and is shipped with all of major Hadoop vendors.

Committee:

Xiaodong Zhang (Advisor); Feng Qin (Committee Member); Spyridon Blanas (Committee Member)

Subjects:

Computer Science

Keywords:

Big Data; Systems; Table Placement; Query Optimization; Out-of-band Communications

Aring, Danielle CIntegrated Real-Time Social Media Sentiment Analysis Service Using a Big Data Analytic Ecosystem
Master of Computer and Information Science, Cleveland State University, 2017, Washkewicz College of Engineering
Big data analytics are at the center of modern science and business. Our social media networks, mobile devices and enterprise systems generate enormous volumes of it on a daily basis. This wide range of availability provides many organizations in every field opportunities to discover valuable intelligence for critical decision-making. However, traditional analytic architectures are insufficient to handle unprecedentedly big volume of data and complexity of data processing. This thesis presents an analytic framework to combat unprecedented scale of big data that performs data stream sentiment analysis effectively in real time. The work presents a Social Media Big Data Sentiment Analytics Service System (SMBDSASS). The architecture leverages Apache Spark stream data processing framework, coupled with a NoSQL Hive big data ecosystem. Two sentiment analysis models were developed; the first, a topic based model, given user provided topic or person of interest sentiment (opinion) analysis was performed on related topic sentences in a tweet stream. The second, an aspect (feature) based model given user provided product of interest and related product features aspect (feature) analysis was performed on reviews containing important feature terms. The experimental results of the proposed framework using real time tweet stream and product reviews show comparable improvements from the results of the existing literature, with 73% accuracy for topic-based sentiment model, and 74% accuracy for aspect (feature) based sentiment model. The work demonstrated that our topic and aspect based sentiment analysis models on the real time stream data processing framework using Apache Spark and machine learning classifiers coupled with a NoSQL big data ecosystem offer an efficient, scalable, real-time stream data-processing alternative for the complex multiphase sentiment analysis over common batch data mining frameworks.

Committee:

Sun Sunnie Chung, Ph.D. (Committee Chair); Yongjigan Fu, Ph.D. (Committee Member); Ifthkar Sikder, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

big data analytics, sentiment analysis, stream data-processing

Lin, CanchuExploring Big Data Capability: Drivers and Impact on Supply Chain Performance
Doctor of Philosophy, University of Toledo, 2016, Manufacturing and Technology Management
Although success stories of some big companies have been reported in the popular press, Big Data remains underexplored in supply chain management research. This project takes the initiative in investigating the role of Big Data in supply chain performance. Specifically, this project develops the construct of Big Data capability to characterize what is involved in Big Data and how companies use it to develop their competitive advantage. Then, the project moves to identify some key antecedents of the development of Big Data capability. Next, the project explores how Big Data capability facilitates the knowledge management process in the supply chain. Finally, the project seeks to measure the impact of knowledge management enacted by Big Data capability on performance. The project used the survey methodology to collect data to empirically assess the validity of the above outlined theoretical model. Data analysis results indicate that: 1) technology orientation facilitates the development of Big Data capability; 2) developmental culture negatively moderates the relationship between technology orientation and the development of Big Data capability; 3) Big Data capability positively impacts firms performance in new product development and product improvement by enhancing their knowledge management process; and 4) relationship building positively moderates that process.

Committee:

Anand Kunnathur (Committee Chair); Jerzy Kamburowski (Committee Member); Michael Mallin (Committee Member); David Black (Committee Member)

Subjects:

Management

Keywords:

Big Data Capability, strategic orientations, developmental culture, knowledge co-creation, knowledge sharing, new product development, product improvement

Fathi Salmi, MeisamProcessing Big Data in Main Memory and on GPU
Master of Science, The Ohio State University, 2016, Computer Science and Engineering
Many large-scale systems were designed with the assumption that I/O is the bottleneck, but this assumption has been challenged in the past decade with new trends in hardware capabilities and workload demands. The computational power of CPU cores has not improved proportional to the performance of disks and network interfaces in the past decade, but the demand for computational power in various workloads has grown out of proportion. GPUs outperform CPUs for various workloads such as query processing and machine learning workloads. When such workloads runs on a single computer, the data processing systems must use GPUs to stay competitive. But GPUs have never been studied for large-scale data analytics systems. To maximize GPUs erformance, core assumptions about the behavior of large-sclale systems should be shaken and the whole systems should be redesigned. In this report, we used Apache Spark as a case to study the performance benefits of using GPUs in a large-scale, distributed, in-memory, data analytics system. Our system, Spark-GPU, exploits the massively parallel processing power of the GPUs in a large-scale, in-memory system and accelerates crucial data analytics workloads. Spark-GPU minimizes memory management overhead, reduces the extraneous garbage collection, minimizes internal and external data transfers, converts data into a GPU-friendly format, and provides batch processing. Spark-GPU detects GPU-friendly tasks based on predefined patterns in computation and automatically schedules them on the available GPUs in the cluster. We have evaluated Spark-GPU with a set of representative data analytics workloads to show its effectiveness. The results show that Spark-GPU significantly accelerates data mining and statistical analysis workloads, but provides limited performance speedup for traditional query processing workloads.

Committee:

Xiaodong Zhang, Dr. (Advisor); Yang Wang, Dr. (Committee Member)

Subjects:

Computer Science; Information Technology; Statistics

Keywords:

big-data; GPU; Spark; Hadoop; data analytics;

Glendenning, Kurtis M.Browser Based Visualization for Parameter Spaces of Big Data Using Client-Server Model
Master of Science (MS), Wright State University, 2015, Computer Science
Visualization is an important task in data analytics, as it allows researchers to view abstract patterns within the data instead of reading through extensive raw data. Allowing the ability to interact with the visualizations is an essential aspect since it provides the ability to intuitively explore data to find meaning and patterns more efficiently. Interactivity, however, becomes progressively more difficult as the size of the dataset increases. This project begins by leveraging existing web-based data visualization technologies and extends their functionality through the use of parallel processing. This methodology utilizes state-of-the-art techniques, such as Node.js, to split the visualization rendering and user interactivity controls between a client-server infrastructure. The approach minimizes data transfer by performing the rendering step on the server while allowing for the use of HPC systems to render the visualizations more quickly. In order to improve the scaling of the system with larger datasets, parallel processing and visualization optimization techniques are used.

Committee:

Thomas Wischgoll, Ph.D. (Advisor); Michael Raymer, Ph.D. (Committee Member); Derek Doran, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

big data, visualization, parallel coordinate plot, client-server model, parallel processing

Jayapandian, Catherine PraveenaCloudwave: A Cloud Computing Framework for Multimodal Electrophysiological Big Data
Doctor of Philosophy, Case Western Reserve University, 2014, EECS - Computer and Information Sciences
Multimodal electrophysiological data, such as electroencephalography (EEG) and electrocardiography (ECG), are central to effective patient care and clinical research in many disease domains (e.g., epilepsy, sleep medicine, and cardiovascular medicine). Electrophysiological data is an example of clinical 'big data' characterized by volume (in the order of terabytes (TB) of data generated every year), velocity (gigabytes (GB) of data per month per facility) and variety (about 20-200 multimodal parameters per study), referred to as '3Vs of Big Data.' Current approaches for storing and analyzing signal data using desktop machines and conventional file formats are inadequate to meet the challenges in the growing volume of data and the need for supporting multi-center collaborative studies with real-time and interactive access. This dissertation introduces a web-based electrophysiological data management framework called Cloudwave using a highly scalable open-source cloud computing approach and hierarchical data format. Cloudwave has been developed as a part of the National Institute of Neurological Disorders and Strokes (NINDS) funded multi-center project called Prevention and Risk Identification of SUDEP Mortality (PRISM). The key contributions of this dissertation are: 1. An expressive data representation format called Cloudwave Signal Format (CSF) suitable for data-interchange in cloud-based web applications; 2. Cloud based storage of CSF files processed from EDF using Hadoop MapReduce and HDFS; 3. Web interface for visualization of multimodal electrophysiological data in CSF; and 4. Computational processing of ECG signals using Hadoop MapReduce for measuring cardiac functions. Comparative evaluations of Cloudwave with traditional desktop approaches demonstrate one order of magnitude improvement in performance over 77GB of patient data for storage, one order of magnitude improvement to compute cardiac measures for signal-channel ECG data, and 20 times improvement for four-channel ECG data using a 6-node cluster in local cloud. Therefore, our Cloudwave approach helps addressing the challenges in the management, access and utilization of an important type of multimodal big data in biomedicine.

Committee:

Guo-Qiang Zhang, PhD (Committee Chair); Satya Sahoo, PhD (Committee Member); Xiang Zhang, PhD (Committee Member); Samden Lhatoo, MD, FRCP (Committee Member)

Subjects:

Bioinformatics; Biomedical Research; Computer Science; Neurosciences

Keywords:

Big Data; Data management; Cloud computing; Electrophysiology; Web application; Ontology; Signal analysis

Chatra Raveesh, SandeepUsing the Architectural Tradeoff Analysis Method to Evaluate the Software Architecture of a Semantic Search Engine: A Case Study
Master of Science, The Ohio State University, 2013, Computer Science and Engineering
The software architecture greatly determines the quality of the system. Evaluating the architecture during the early stage of development can reduce the risk. When used appropriately, it will have a favorable effect on the system. Architectural Tradeoff Analysis Method (ATAM) is an architecture evaluation technique for understanding the tradeoffs in the architecture of software systems. This thesis describes the application of ATAM to evaluate the query engine component of the ResearchIQ. ResearchIQ is a semantically anchored resource discovery tool, which will help the researchers in the domain of clinical and translation science to discover the resources in a simplified manner. The primary goal of ResearchIQ is the delivery of search results effectively to the researchers. A large part of thesis is devoted in evaluating the architectural alternatives of the query engine component of ResearchIQ using ATAM. Three initial architectures alternatives are presented in the thesis. The thesis introduces the system (ResearchIQ) being evaluated, its business drivers and background. It also provides a general overview of the ATAM process describes the application of the ATAM to the ResearchIQ system and presents the important results. The document is intended as a report to develop a prototype implementation that led to the final framework that enhances the performance of ResearchIQ.

Committee:

Jayashree Ramanathan, Dr (Advisor); Rajiv Ramnath, Dr (Advisor)

Subjects:

Computer Science

Keywords:

Semantic; Hadoop; Big data; Search engine

Tepe, EmreStatistical Modeling and Simulation of Land Development Dynamics
Doctor of Philosophy, The Ohio State University, 2016, City and Regional Planning
The impacts of neighborhood and historical conditions on land parcel development have been recognized as important to derive a robust understanding of land dynamics. However, dynamic models that incorporate spatial and temporal dependencies explicitly involve challenges in data availability, methodology and computation. Recent improvements in GIS technology and the growing availability of spatially explicit data at disaggregate levels offer new research opportunities for spatio-temporal modeling of urban dynamics. Parameter estimation requires more complicated methods to maximize complex likelihood functions with analytically intractable normalizing constants. Furthermore, working with a parcel-level dataset quickly increases sample size, with additional computational challenges for handling large datasets. In this research, parcel-level urban dynamics are investigated with the geocoded Auditor’s tax database for Delaware County, Ohio. In contrast to earlier research using time series of remote-sensing and land-cover data to derive measures of urban land-use dynamics, the available information on the year when construction took place on each parcel is used to measure these dynamics. A binary spatio-temporal autologistic model (STARM), incorporating space and time and their interactions, is first used to investigate parcel-level dynamics. This model is able to capture the impacts of the contemporaneous and historical neighborhood conditions around parcels, and is a modified version of the autologistic model introduced by Zhu, Zheng, Carroll,and Aukema (2008). Second, a multinomial STARM is formulated as an extension of the binary case in order to estimate the probability of parcel status change to a discrete land-use category. To the best of our knowledge, methods for the estimation of the parameters of binary spatial-temporal autologistic models are not available in any commercial and open source statistical software. A statistical program was written in Python that estimates Monte Carlo Maximum Likelihood parameters of STARM. Parallel processing techniques are used, due to the computational challenges in parameter estimations when using the complete dataset (73,000 parcels). This study contributes to the modeling of land development by demonstrating quantitatively the impacts of contemporaneous and historical neighborhood conditions on land dynamics, while offering a feasible methodological and computational approach.

Committee:

Jean-Michel Guldmann (Advisor); Philip A. Viton (Committee Member); Gulsah Akar (Committee Member)

Subjects:

Economics; Land Use Planning; Regional Studies; Statistics; Urban Planning

Keywords:

urban dynamics, spatial and temporal modeling, land use change, urban growth, autologistic regression, big data

Chaudhuri, AbonGeometric and Statistical Summaries for Big Data Visualization
Doctor of Philosophy, The Ohio State University, 2013, Computer Science and Engineering
In recent times, the visualization and data analysis paradigm is adapting fast to keep up with the rapid growth in computing power and data size. Modern scientific simulations run at massive scale to produce huge datasets, which must be analyzed and visualized by the domain experts to continue innovation. In the presence of large-scale data, it is important to identify and extract the informative regions at an early stage so that the following analysis algorithms, which are usually memory and compute-intensive, can focus only on those regions. Transforming the raw data to a compact yet meaningful representation also helps to maintain the interactivity of the query and visualization of analysis results. In this dissertation, we propose a novel and general-purpose framework suitable for exploring large-scale data. We propose to use importance-based data summaries, which can substitute for the raw data to answer queries and drive visual exploration. Since the definition of importance is dependent on the nature of the data and the task at hand, we propose to use suitable statistical and geometric measures or combination of various measures to quantify importance and perform data reduction on scalar and vector field data. Our research demonstrates two instances of the proposed framework. The first instance applies to large number of streamlines computed from vector fields. We make the visual exploration of such data much easier compared to navigating through a cluttered 3D visualization of the raw data. In this case, we introduce a fractal dimension based metric called box counting ratio, which quantifies the geometric complexity of streamlines (or parts of streamlines) by their space-filling capacity. We utilize this metric to extract, organize and visualize streamlines of varying density and complexity hidden in large number of streamlines. The extracted complex regions from the streamlines represent the data summaries in this case. We organize and present them on an interactive 2D information space, which allows user selection and visualization of streamlines in the original spatial domain. We also extend this framework to support exploration using an ensemble of measures including the box counting ratio. We strengthen our claims with elaborate case studies using combustion and climate simulation datasets. We also use our framework to speed up query-driven exploration of volume data. Our approach speeds up range query response by using distribution-based data summaries as opposed to repeatedly scanning sub-domains of the raw data. Our work is mainly concerned with the range distribution query, which returns the distribution of an axis-aligned query region. Since the response time of such queries scales up with the data and the query size, maintaining interactivity is a challenging task. Our research offers the ability to answer distribution query for any arbitrary region in constant time, regardless of data and query size. We adapt an integral image based data structure to reduce the computation, I/O and communication cost of answering queries, and propose a similarity-based indexing technique to reduce the storage cost of the data structure. Our scheme exploits the similarity present among the nearby regions in the data, and hence, their respective distributions. We demonstrate the benefits that our technique offers to many visualization applications which directly or indirectly require distributions.

Committee:

Han-Wei Shen (Advisor); Roger Crawfis (Committee Member); Rephael Wenger (Committee Member); Tom Peterka (Committee Member)

Subjects:

Computer Science

Keywords:

big data visualization; scientific visualization; flow visualization; data management; data analysis

Kidd, Ian V.Object Dependent Properties of Multicomponent Acrylic Systems
Master of Sciences (Engineering), Case Western Reserve University, 2014, Materials Science and Engineering
Degradation of multi-component acrylic systems is becoming increasingly important as polymers and complex systems become commonplace in technological applications. For outdoor applications, understanding the interactions between each stressor and the optical, chemical, and mechanical response is important. This study focuses mainly on the magnitude and variance of optical and chemical properties of hardcoat acrylics on PET (Polyethylene terephthalate) or TPU (thermoplastic polyurethane) substrates, using big-data and unbiased statistics and analytics. PET shows a strong tendency to yellow and haze in accelerated and real-world exposures. A 0.90 correlation coefficient exists between yellowness and UVA-340 irradiance. A 0.8 correlation coefficient exists between haze and UVA-340 irradiance, but moisture must be present for hazing to occur. In TPU films, yellowing occurs until 200 MJ/m2 of UVA-340 irradiance, after which the films clear. Meanwhile, hardcoat acrylics with a TPU substrate are highly resistant to haze in all exposures studied. As optical degradation occurs up to 4000 hours of exposure, little correlation to carbonyl, C-H stretch, or N-H stretch area exists. A weak correlation is observed between increasing optical degradation and spectral attenuation, possibly indicating a complete breakdown of the polymers. Development of a model that relates observable degradation to surface and bulk phenomena can give insights into how to reduce degradation. All of this data must be used to focus the direction of R&D efforts to increase the useful lifetime of multi-component acrylic systems.

Committee:

Roger French (Advisor); James McGuffin-Cawley (Committee Member); Timothy Peshek (Committee Member); Laura Bruckman (Other); Olivier Rosseler (Other)

Subjects:

Engineering; Materials Science; Optics; Polymers

Keywords:

Degradation; hardcoat acrylic; accelerated exposure; real-world; optical properties; study protocol; big data; yellowing; haze; PET; TPU; crazing; correlation; lifetime and degradation science; data science