Search Results (1 - 25 of 66 Results)

Sort By  
Sort Dir
 
Results per page  

Kurz, Kyle W.A Parallel, High-Throughput Framework for Discovery of DNA Motifs
Master of Science (MS), Ohio University, 2010, Computer Science (Engineering and Technology)

The search for genomic information has just begun. New genomes are sequenced daily, and each brings new challenges and knowledge to the scientific table that must be carefully mined and studied to glean out every possible bit of information. The amount of data created during genomic sequencing is simply too great for researchers to handle, creating a need for computational tools capable of processing the genomic input and analyzing it for information. The area of bioinformatics focuses on this combination of computer science and biology, bringing useful software applications to the table in an effort to ease the workload of biologists.

One specific area of interest to biological researchers is the study of DNA words or motifs as they relate to gene regulation. These regulatory elements may be transcription factor binding sites (TFBS), which bind RNA polymerase II to the DNA strand, or enhancer/silencer sequences that up- and down-regulate transcription of the gene to which they are related by binding specific proteins. Many tools such as Weeder [43], WordSpy[65] and YMF [55] are currently available for the study of over- and under-represented words in a DNA sequence, a trait which is believed to useful in identification of these regulatory elements. These tools all perform similar tasks by enumerating all words, or substrings, found in their input, then scoring and ranking these resulting words for presentation to the user. Optionally, many tools also cluster groups of words together to form degenerate motifs which allow for evolutionary and environmental variation in the binding site.

The Open Word Enumeration Framework (OWEF), presented in this thesis, providesa new framework on which DNA word enumeration tools can be built. The OWEF framework provides a set of abstract base classes representing the core stages of a word enumeration tool and defines a set of standard interfaces for each stage, allowing multiple algorithmic implementations of these base classes to co-exist and be selected individually at runtime.

In addition to providing a level of abstraction that allows for simpler development, the framework also provides a scalable solution to alleviate memory bottlenecks. The framework contains skeleton code for both a shared memory implementation, providing fast analysis on single-node, multiprocessor systems, and a distributed memory solution, which splits the tasks among several networked nodes to provide a large amount of accessible main memory to the application.

In summary, the OWEF framework is useful as a development tool by providing a set of interfaces and methods to allow developers to focus on specific aspects of the algorithms they are designing, while also providing a standardized, flexible interface to researchers, eliminating the need for specialized tools and providing a general-purpose toolkit for DNA word enumeration tasks.

Committee:

Lonnie Welch, PhD (Committee Chair); Frank Drews, PhD (Committee Member); Chang Liu, PhD (Committee Member); Robert Colvin, PhD (Committee Member)

Subjects:

Bioinformatics; Computer Science

Keywords:

DNA motifs; bioinformatics; motif discovery; bioinformatics framework

GUDIVADA, RANGA CHANDRADISCOVERY AND PRIORITIZATION OF BIOLOGICAL ENTITIES UNDERLYING COMPLEX DISORDERS BY PHENOME-GENOME NETWORK INTEGRATION
PhD, University of Cincinnati, 2007, Engineering : Biomedical Engineering
An important goal for biomedical research is to elucidate causal and modifier networks of human disease. While integrative functional genomics approaches have shown success in the identification of biological modules associated with normal and disease states, a critical bottleneck is representing knowledge capable of encompassing asserted or derivable causality mechanisms. Both single gene and more complex multifactorial diseases often exhibit several phenotypes and a variety of approaches suggest that phenotypic similarity between diseases can be a reflection of shared activities of common biological modules composed of interacting or functionally related genes. Thus, analyzing the overlaps and interrelationships of clinical manifestations of a series of related diseases may provide a window into the complex biological modules that lead to a disease phenotype. In order to evaluate our hypothesis, we are developing a systematic and formal approach to extract phenotypic information present in textual form within Online Mendelian Inheritance in Man (OMIM) and Syndrome DB databases to construct a disease - clinical phenotypic feature matrix to be used by various clustering procedures to find similarity between diseases. Our objective is to demonstrate relationships detectable across a range of disease concept types modeled in UMLS to analyze the detectable clinical overlaps of several Cardiovascular Syndromes (CVS) in OMIM in order to find the associations between phenotypic clusters and the functions of underlying genes and pathways. Most of the current biomedical knowledge is spread across different databases in different formats and mining these datasets leads to large and unmanageable results. Semantic Web principles and standards provide an ideal platform to integrate such heterogeneous information and could allow the detection of implicit relations and the formulation of interesting hypotheses. We implemented a page-ranking algorithm onto Semantic Web to prioritize biological entities for their relative contribution and relevance which can be combined with this clustering approach. In this way, disease-gene, disease-pathway or disease-process relationships could be prioritized by mining a phenome - genome framework that not only discovers but also determines the importance of the resources by making queries of higher order relationships of multi-dimensional data that reflect the feature complexity of diseases.

Committee:

Dr. Bruce Aronow (Advisor)

Keywords:

Semantic Web; RDF; OWL; SPARQL; Ontology; Biomedical Informatics; Bioinformatics; Integrative Bioinformatics; Text Mining; Phenome; Genome; Disease Modularity; Data Integration; Semantic Integration

Hayes, MatthewAlgorithms to Resolve Large Scale and Complex Structural Variants in the Human Genome
Doctor of Philosophy, Case Western Reserve University, 2013, EECS - Computer and Information Sciences
It has been shown that large scale genomic structural variants (SV) are closely associated with disease onset. In particular, the presence of these abnormalities may contribute to the onset and susceptibility of cancer through various mechanisms. Knowing the location and type of these variants can assist medical researchers in making insights into methods for diagnosis and treatment. It is also important to develop efficient methods to locate these variants. This thesis presents several algorithms for identifying and characterizing structural variants using array comparative genomic hybridization (aCGH) and high throughput next-generation sequencing (NGS) platforms. The aCGH-based algorithm (CGH-Triangulator) is considerably faster than a state-of-the-art method for identifying change points in aCGH data, and it has greater prediction power on datasets with low-to-moderate levels of noise. The NGS-based algorithms include methods to identify basic SV types, including deletions, inversions, translocations, and tandem repeats. They also include methods to identify double minute chromosomes, which are more complex structural variants. These methods use a hybrid strategy to identify variants at base-pair resolution. Using two primary prostate cancer datasets and simulated datasets, we compared our methods to previously published NGS algorithms. Overall, our methods had favorable performance with respect to breakpoint prediction accuracy, sensitivity, and specificity. In particular, this thesis presents one of the first attempts to algorithmically detect double minute chromosomes, which are complex rearrangements that are present in many cancers.

Committee:

Jing Li (Advisor)

Subjects:

Bioinformatics; Computer Science; Molecular Biology

Keywords:

Bioinformatics; structural variation; Bellerophon; complex genomic rearrangements; genomic structural variation; inversion; tandem repeat; translocation; interchromosomal insertion; double minute; double minute chromosome; deletion; computational biology

Zhang, XuanSupporting on-the-fly data integration for bioinformatics
Doctor of Philosophy, The Ohio State University, 2007, Computer and Information Science
The use of computational tools and on-line data knowledgebases has changed the way the biologists conduct their research. The fusion of biology and information science is expected to continue. Data integration is one of the challenges faced by bioinformatics. In order to build an integration system for modern biological research, three problems have to be solved. A large number of existing data sources have to be incorporated and when new data sources are discovered, they should be utilized right away. The variety of the biological data formats and access methods have to be addressed. Finally, the system has to be able to understand the rich and often fuzzy semantic of biological data. Motivated by the above challenges, a system and a set of tools have been implemented to support on-the-fly integration of biological data. Metadata about the underlying data sources are the backbone of the system. Data mining tools have been developed to help users to write the descriptors semi-automatically. With automatic code generation approach, we have developed several tools for bioinformatics integration needs. An automatic data wrapper generation tool is able to transform data between heterogeneous data sources. Another code generation system can create programs to answer projection, selection, cross product and join queries from flat file data. Real bioinformatics requests have been used to test our system and tools. These case studies show that our approach can reduce the human efforts involved in an information integration system. Specifically, it makes the following contributions. 1) Data mining tools allow new data sources to be understood with ease and integrated to the system on-the-fly. 2) Changes in data format are localized by using the metadata descriptors. System maintenance cost is low. 3) Users interact with our system through high-level declarative interfaces. Programming efforts are reduced. 4) Our tools process data directly from flat files and requires no database support. Data parsing and processing are done implicitly. 5) Request analysis and request execution are separated and our tools can be used in a data grid environment.

Committee:

Gagan Agrawal (Advisor)

Subjects:

Computer Science

Keywords:

information integration; bioinformatics

Moller, Abraham GhoreishiMapping ecologically important virus-host interactions in geographically diverse solar salterns with metagenomics
Master of Science, Miami University, 2016, Cell, Molecular and Structural Biology (CMSB)
Viruses that infect microbes are critical players in the world’s ecosystems. By lysing microbes, viruses turn over nutrients, regulate microbial populations, and maintain global biogeochemical cycles. Despite this ecological importance, determining viral-microbial interactions - especially interactions with lytic viruses, which do not integrate into their host’s genome - remains a major challenge. In this work, we determine viral interactions with microbes in salt-collecting ponds (salterns), where viruses are the dominant predator of the microbial community. Low microbial diversity, environmental stability, and high viral density also make solar salterns excellent model ecosystems for studying viral-archaeal interactions. By using a suite of bioinformatics tools to analyze saltern metagenomes, we mapped virus-host interactions across geographically diverse salterns and related them to carbon cycling. Our studies suggest viruses are critical players in saltern carbon cycling, and the loss of CRISPRs in archaea hosts may play an important role in regulating virus-mediated nutrient cycling in these environments.

Committee:

Chun Liang, PhD (Advisor); Michael Crowder, PhD (Committee Chair); Gary Lorigan, PhD (Committee Member)

Subjects:

Bioinformatics; Ecology; Microbiology

Keywords:

metagenomics, bioinformatics, CRISPR, virus-host interactions, hypersaline ecosystems, solar salterns, environmental microbiology

Garcia, KrystineBioinformatics Pipeline for Improving Identification of Modified Proteins by Neutral Loss Peak Filtering
Master of Science (MS), Ohio University, 2015, Biomedical Engineering (Engineering and Technology)
Research has found that the central dogma of molecular biology is much more complex than making a gene into mRNA and then protein. One way that this has been found to be true is in post-translational modification (PTM) of proteins. PTMs are changes to specific side chains of amino acids after translation from mRNA. Unfortunately, only a few have been carefully studied meaning that most are not well understood. In turn, this has created problems for current methods of protein identification. Tandem mass spectrometry (MS/MS) is one common method of protein isolation and fragmentation that is usually followed by computational analysis for protein and peptide identification. The output from MS/MS is a spectrum of ion mass-to-charge ratios (m/z) versus their intensity, or a peak list, which is fed into the identification software for analysis. While these programs are generally good at identification, they were not created to incorporate large numbers of modifications. The problem is that spectra can include significant amounts of noise from modifications that can mask the y- and b- ion peaks used for identification. This has caused preprocessing to become both prevalent and necessary for identification of modified proteins. Through preprocessing and filtering of peaks from MS/MS data, we have enhanced the identification of proteins with modifications and specific phenotypes, presenting scientists with a more powerful tool for their protein research.

Committee:

Lonnie Welch, Dr. (Advisor); Jennifer Hines, Dr. (Committee Member); Frank Drews, Dr. (Committee Member); Frank Schwartz, Dr. (Committee Member)

Subjects:

Biochemistry; Bioinformatics; Biomedical Engineering; Computer Science

Keywords:

post-translational modifications; mass spectrometry; bioinformatics; peptide peak filtering; proteins

Camerlengo, Terry LukeTechniques for Storing and Processing Next-Generation DNA Sequencing Data
Master of Science, The Ohio State University, 2014, Biophysics
Genomics is undergoing unprecedented transformation due to rapid improvements in genetic sequencing technology, which has lowered costs for genetic sequencing experiments while increasing the amount of data generated in a typical experiment (McKinsey Global Institute, May 2013, pp. 86-94). The increase in data has shifted the burden from analysis and research to expertise in IT hardware and network support for distributed and efficient processing. Bioinformaticians, in response to a data-rich environment, are challenged to develop better and faster algorithms to solve problems in genomics and molecular biology research. This thesis examines the storage and data processing issues inherent in next- generation DNA sequencing (NGS). This work details the design and implementation of a software prototype that exemplifies the current approaches as it relates to the efficient storage of NGS data. The software library is utilized within the context of a previous software project which accompanies the publication related to the HT_SOSA assay. The software for the HT_SOSA, called NGSPositionCounter, demonstrates a workflow that is common in a molecular biology research lab. In an effort to scale beyond the research institute, the software library’s architecture takes into account scalability considerations for data storage and processing demands that are more likely to be encountered in a clinical or commercial enterprise.

Committee:

Kun Huang, Ph.D (Advisor); Alvarez Carlos, Ph.D (Committee Member); Machiraju Raghu, Ph.D (Committee Member)

Subjects:

Bioinformatics

Keywords:

DNA sequence storage; 4 bit encoding; reference-based compression; Needleman-Wusnch; DNA base pair compression; sequence compression; MongoDB; NGS Data management; bioinformatics; NoSQL; 3 bases per byte;

Johnson, Stephen RobertiPathCase
Master of Sciences, Case Western Reserve University, 2012, EECS - Computer and Information Sciences

PathCase is a system designed to provide life scientists with an integrated environment to study pathways, regardless of the source producing the corresponding data. It includes databases for different pathway data sources, web interfaces for viewing the information in the databases, tools for analyzing the data, and web services to access the data programmatically.

This thesis describes the design and implementation of new touch screen-based tools to view and analyze the data in the PathCaseKEGG and PathCaseMAW databases. The first tool, iPathCase-SMDA, provides compartment-aware pathway visualizations and access to the Steady-State Metabolic Network Dynamics Analysis tool. The second tool, iPathCase-KEGG, provides visualizations of KEGG pathways, including relevant information from other data sources outside of PathCaseKEGG and KEGG.

Committee:

Michael Branicky, ScD (Committee Chair); Gultekin Ozsoyoglu, PhD (Advisor); Mehmet Koyuturk, PhD (Committee Member); Meral Ozsoyoglu, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

bioinformatics; metabolic pathways; iPad; KEGG; MAW; SMDA; PathCase

SINHA, AMIT U.Discovery and Analysis of Genomic Patterns: Applications to Transcription Factor Binding and Genome Rearrangement
PhD, University of Cincinnati, 2008, Engineering : Computer Science

One of the most challenging open problems of the post-genomic era for computer scientists, bioinformaticians and molecular biologists is to identify and characterize all the elements involved in the gene regulation, at both the transcriptional and post-transcriptional level. The rapid availability of multiple genomes also necessitates a need for efficient computational solutions to aid in comparative syntenic analysis - the analysis of relative gene-order conservation between species - which can provide key insights into evolutionary chromosomal dynamics, rearrangement rates between species, and speciation analysis. Thus, in this dissertation, we address these issues and develop computational approaches to identify genomic patterns at both micro and macro levels.

To address the problem of gene regulation, we developed algorithms for identifying transcription factor binding sites (short repeated patterns or sequence motifs) in genomes. We have developed a level based search algorithm which is able to identify regulatory motifs in a wide variety of datasets and demonstrate that our method works more efficiently than the current best methods. Further, we have also developed statistical models for identification of known motifs. Finally, we refine the motif discovery process through methods that discriminate and characterize the co-factors of a transcription factor.

At macro level, we developed efficient methods (Cinteny) for fast identification of syntenic blocks with various levels of coarse graining and determine evolutionary relationships between genomes in terms of the number of rearrangements (the reversal distance). Cinteny web server integrates syntenic region browsing with evolutionary distance assessment, offers flexibility to adjust all parameters and recompute the results on-the-fly, and ability to work with user provided data.

Committee:

Raj Bhatnagar (Advisor); Jaroslaw Meller (Committee Co-Chair); Anil Jegga (Committee Co-Chair); Ali Minai (Committee Member); Yizong Cheng (Committee Member)

Subjects:

Bioinformatics

Keywords:

computational biology; bioinformatics; transcription factor; genome rearrangement

Ozer, Hatice GulcinResidue Associations In Protein Family Alignments
Doctor of Philosophy, The Ohio State University, 2008, Biophysics

The increasing amount of data on biomolecule sequences and their multiple alignments for families, has promoted an interest in discovering structural and functional characteristics of proteins from sequence alone. In many proteins interactions between residues appear to be key players in structure and function. Consensus, or weight matrix, or hidden Markov models cannot detect interpositional correlations and alternating motifs within a family alignment. We propose and analyze a method for detecting interpositional correlations and examine the applicability of this method to structural prediction.

We presented the Multiple Alignment Variation Linker (MAVL) and StickWRLD to analyze biomolecule sequence alignments and visualize positive and negative interpositional residue associations. In the MAVL analysis system, the expected number of sequences that should share identities and residuals are calculated based on the observed population of sequences actually sharing the residues. Correlating pairs of residues are visualized in StickWRLD diagram. This analysis system allows us to extract information from the alignments which is not accessible to traditional column-based methods. In addition, a StickWRLD diagram enables the user to visualize the family alignment and positional dependencies in 3D.

We discuss methodologies to identify residue associations in protein family alignments. We discussed the use of the residuals and the phi coefficient to determine the strength of a residue association, and Fisher Exact probability test to evaluate the statistical significances. We computed identitywise residue associations for 961 Pfam family alignments and examined physical proximity and physiochemical properties of associated residues in the alignments and their presence on secondary structural elements. We observed that the proximity of residues increases as the strength of association and its statistical significance increase. Specifically, associations between aromatic residues and hydrophilic residues are present in closer proximity. The amino acid contact predictivity of the residual parameter is the highest compared to the phi coefficient and the statistical significance. Compared to the expected distributions, we observed larger proportions of the pairs such that both residues are in a helix or one residue is in a structural element while the other is in a flexible region, or both residues are in a flexible region.

Committee:

William Ray, PhD (Advisor); Charles Daniels, PhD (Committee Member); Hakan Ferhatosmanoglu, PhD (Committee Member); Thomas Magliery, PhD (Committee Member)

Subjects:

Bioinformatics; Biophysics

Keywords:

Family Alignment; Positional Dependency; Amino Acid Correlation; Residue Correlation; Residue Association; Protein Sequence; Protein Structure; Pfam database; Bioinformatics; Fisher Exact test; Phi coefficient

Xu, HuaNovel data analysis methods and algorithms for identification of peptides and proteins by use of tandem mass spectrometry
Doctor of Philosophy, The Ohio State University, 2007, Chemistry
Tandem mass spectrometry is one of the most important tools for protein analysis. This thesis is focused on the development of new methods and algorithms for tandem mass spectrometry data analysis. A database search engine, MassMatrix, has also been developed that incorporates these methods and algorithms. The program is publicly available both on the web server at www.massmatrix.net and as a deliverable software package for personal computers. Three different scoring algorithms have been developed to identify and characterize proteins and peptides by use of tandem mass spectrometry data. The first one is targeted at the next generation of tandem mass spectrometers that are capable of high mass accuracy and resolution. Two scores calculated by the algorithm are sensitive to high mass accuracy due to the fact that this new algorithm explicitly incorporates mass accuracy into scoring potential peptide and protein matches for tandem mass spectra. The algorithm is further improved by employing Monte Carlo Simulations to calculate ion abundance based scores without any assumptions or simplifications. For high mass accuracy data, MassMatrix provides improvements in sensitivity over other database search programs. The second scoring algorithm based on peptide sequence tags inferred from tandem mass spectra further improves the performance of MassMatrix for low mass accuracy tandem mass spectrometry data. The third algorithm is the first automated data analysis method that uses peptide retention times in liquid chromatography to evaluate potential peptide matches for tandem mass spectrometry data. The algorithm predicts reverse phase liquid chromatography retention times of peptides by their hydrophobicities and compares the predicted retention times with the observed ones to evaluate the peptide matches. In order to handle low quality data, a new method has also been developed to reduce noise in tandem mass spectra and screen poor quality spectra. In addition, a data analysis method for identification of disulfide bonds in proteins and peptides by tandem mass spectrometry data has been developed and incorporated in MassMatrix. By this new approach, proteins and peptides with disulfide bonds can be directly identified in tandem mass spectrometry with high confidence without any chemical reduction and/or other derivatization.

Committee:

Michael Freitas (Advisor)

Subjects:

Chemistry, Analytical

Keywords:

Mass spectrometry; Proteomics; Database search; Data analysis; Bioinformatics

Kirac, MustafaPattern Oriented Methods for Inferring Protein Annotations within Protein Interaction Networks
Doctor of Philosophy, Case Western Reserve University, 2009, EECS - Computer and Information Sciences

Discovering protein functions is a major task in computational biology, since proteins have key roles in the underlying mechanisms of cellular processes, phenotypes, and diseases. Most common in silico protein annotation method is function transfer through sequence homology that does not always produce correct results. Consequently, we propose an alternative research direction of assigning functional annotations to proteins (and genes) based on biological network information. In general, our approach is to transfer functionality between related proteins. We present our approaches in three parts:

1. In the first part, we compute the probabilistic significance of GO annotation sequences obtained from the annotations of a sequence of proteins in a protein-protein interaction network. After identifying significant annotation sequences, we predict the annotation of a target protein by picking the most significant candidate GO annotation sequence observed in the close neighborhood of the target protein. Our cross-validation prediction experiments with pre-annotated proteins recovered correct annotations of proteins with 81% precision with the recall at 45%.

2. In the second part, we develop and evaluate a new pattern-based function annotation framework. For a given target protein P, and for each GO term t, we compare (through graph alignment) neighborhood of P with neighborhoods of proteins annotated by t. We then assign to P the GO term whose neighborhoods are the most similar to the neighborhood of P. In this part, we improve the accuracy of techniques introduced in the first part, by 30.44%, 41.94%, and 2.62% in the organism-specific networks of fly, worm, and yeast, respectively.

3. In the third part, we present a technique that improves our pattern-based methodologies with an iterative prediction algorithm. In this part, by using a multi-iteration algorithm, we predict functions of protein P at one step, and employ predicted functions of P for fine-tuning the predictions of other target proteins at a later step. Plugging in the iterative prediction algorithm improves the accuracy of pattern-based function annotation framework presented in the second part by 11.24%, 14.32%, 5.6%, and 15.14% in organism-specific networks of fly, human, worm, and yeast, respectively.

Committee:

Gultekin Ozsoyoglu (Advisor); Rob Ewing (Committee Member); Jiong Yang (Committee Member); Mehmet Koyuturk (Committee Member); Zehra Meral Ozsoyoglu (Committee Member)

Subjects:

Bioinformatics; Computer Science

Keywords:

Bioinformatics; protein interaction networks; protein function prediction

Xu, YaominNew Clustering and Feature Selection Procedures with Applications to Gene Microarray Data
Doctor of Philosophy, Case Western Reserve University, 2008, Statistics

Statistical data mining is one of the most active research areas. In this thesis we develop two new data mining procedures and explore their applications to genetic data.

The first procedure is called PfCluster - Profile Cluster Analysis. It is a clustering method designed for profiled genetic data. The PfCluster is efficient and flexible in uncovering clusters determined by a new class of biologically meaningful distance metrics. A new internal quality measure of clusters, coherence index, is developed to find coherent clusters. An efficient mechanism for choosing the threshold of coherent clusters is also derived and implemented. The threshold is based on the first and second order approximations to the true threshold under a null distribution for parallel clusters. The PfCluster has been applied to simulated data and two real data examples: a biomarker LOH dataset and a microarray gene expression dataset. PfCluster is competitive to the correlation-based clustering procedures.

The second procedure is called RPselection - Resampling based partitioning selection. It is a feature selection algorithm designed for microarray studies. It selects a subset of genes that maximizes a fitness score. The fitness score measures the relevance between the partition labels from a clustering result and an external class label derived from the clinical outcomes. The score is computed using a resampling procedure. The RPselection algorithm has been applied to simulated data and a real uveal melanoma gene expression data. RPselection outperforms gene-by-gene test-based feature selection procedures.

Software development is an integral part of modern statistical research. Two software packages, pfclust and rpselect, are developed in this thesis based on our PfCluster method and RPselection algorithm. Packages pfclust and rpselect are implemented based on R object-oriented programming framework, and they can be easily customized and extended by users.

The ideas in our two procedures can be generalized and applied to other data mining tasks. This thesis concludes with discussion on connections between two methods and the related future research.

Committee:

Jiayang Sun (Advisor)

Subjects:

Statistics

Keywords:

Bioinformatics; coherence index; data mining; feature selection; gene expression pathway; gene profiling; informative gene; microarray data; profile cluster analysis; partitioning; regulatory network; statistical pattern recognition

Shankar, VijayExtension of Multivariate Analyses to the Field of Microbial Ecology
Doctor of Philosophy (PhD), Wright State University, 2016, Biomedical Sciences PhD
Ground-breaking advancements in molecular and analytical techniques in the past decade have enabled researchers to accumulate data at an extraordinary rate. Especially in the field of microbial ecology, the introduction of technologies such as high-throughput sequencing, quantitative microarrays, nuclear magnetic resonance and mass spectrometry has led to the interrogation of diverse and previously unexplored microbial communities at unparalleled depth. Analysis and interpretation of patterns within datasets acquired with such high-throughput methods require powerful statistical approaches. A class of such techniques called multivariate statistical analyses is an excellent choice for analysis of complex microbiota-related datasets. This field of statistics is constantly evolving as new techniques and procedures are being developed and applied to explore and interpret the underlying patterns both statistically and visually. As a result, the decision-making process involved in the choice of the technique that best suits the scientific question and the dataset is no longer trivial. Additionally, the current trends in the use of multivariate statistics in microbial ecology indicate a strong preference toward exploratory analyses, resulting in limitations to possible biological interpretations. In order to facilitate a more extensive integration of multivariate statistics in microbial ecology, I apply a diverse set of analytical methods to human-associated microbial and metabolite datasets that allows us to draw biologically relevant inferences. Specifically, I use indirect gradient analyses to show that the largest gradients of variability correspond to the separation of samples based on sample groups. I use direct gradient analyses to explain a significant portion of the overall variability present within the response variables using independently measured environmental variables. I use classifier techniques to build highly accurate discriminant models based on the differences in the response variables across sample groups and identify the variables that contribute the most to sample group separation. Using correlation-based bipartite analyses, I identify statistically significant associations between two different sets of response variable that were measured for the same set of samples. Finally, I integrate the analytical insights from the above approaches into a generalized protocol for the analysis of multivariate datasets in the field of microbial ecology.

Committee:

Oleg Paliy, Ph.D. (Committee Chair); Gerald Alter, Ph.D. (Committee Member); Michael Raymer, Ph.D. (Committee Member); Nicholas Reo, Ph.D. (Committee Member)

Subjects:

Bioinformatics; Biomedical Research; Biostatistics; Microbiology

Keywords:

microbiology;biostatistics;bioinformatics;biomedical research

Gilder, Jason R.Computational methods for the objective review of forensic DNA testing results
Doctor of Philosophy (PhD), Wright State University, 2007, Computer Science and Engineering PhD
Since the advent of criminal investigations, investigators have sought a "gold standard" for the evaluation of forensic evidence. Currently, deoxyribonucleic acid (DNA) technology is the most reliable method of identification. Short Tandem Repeat (STR) DNA genotyping has the potential for impressive match statistics, but the methodology not infallible. The condition of an evidentiary sample and potential issues with the handling and testing of a sample can lead to significant issues with the interpretation of DNA testing results. Forensic DNA interpretation standards are determined by laboratory validation studies that often involve small sample sizes. This dissertation presents novel methodologies to address several open problems in forensic DNA analysis and demonstrates the improvement of the reported statistics over existent methodologies. Establishing a dynamically calculated RFU threshold specific to each analysis run improves the identification of signal from noise in DNA test data. Objectively identifying data consistent with degraded DNA sample input allows for a better understanding of the nature of an evidentiary sample and affects the potential for identifying allelic dropout (missing data). The interpretation of mixtures of two or more individuals has been problematic and new mathematical frameworks are presented to assist in that interpretation. Assessing the weight of a DNA database match (a cold hit) relies on statistics that assume that all individuals in a database are unrelated – this dissertation explores the statistical consequences of related individuals being present in the database. Finally, this dissertation presents a statistical basis for determining if a DNA database search resulting in a very similar but nonetheless non-matching DNA profile indicates that a close relative of the source of the DNA in the database is likely to be the source of an evidentiary sample.

Committee:

Travis Doom (Advisor)

Keywords:

Bioinformatics; Forensics; PCR-STR DNA; Limit of detection; Degraded DNA; Mixture interpretation; DNA databases; Familial searching

Moss, TiffanieCHARACTERIZATION OF STRUCTURAL VARIANTS AND ASSOCIATED MICRORNAS IN FLAX FIBER AND LINSEED GENOTYPES BY BIOINFORMATIC ANALYSIS AND HIGH-THROUGHPUT SEQUENCING
Doctor of Philosophy, Case Western Reserve University, 2012, Biology
The recent sequencing and assembly of the Bethune genome, a linseed type, and the sequencing of several of the genotrophs of Stormont cirrus, a fiber type, provided a platform for analysis of structural variation sites between flax fiber and linseed types which could be used to identify regions worthy of investigation for the improvement of seed and fiber traits in flax. The most well characterized site of structural variation in flax is that of LIS-1, an environmentally inducible and heritable 5.7Kb structural variation. Previous investigations suggest the LIS-1 structural variant is the result of a programmed DNA rearrangement. The only vaguely comparable system for controlled or programmed DNA rearrangement is that seen in the ciliate macronucleus. The scnRNA mechanism used by ciliates utilizes much of the same machinery as that of small RNAs and the action is similar to that of heterochromatin-associated siRNAs. By identifying small RNAs which map to other regions of structural variation in flax, the impact these regions could have on the microRNA regulated biological pathways, and association of other small RNAs mapping to these regions may be discerned and used to prioritize structural variants for further molecular investigation. Regions of the Bethune genome which may be associated with small RNAs was determined by computational prediction of microRNAs and mapping of small RNA reads from high throughput RNA sequencing. Over 25,000 miRNAs were predicted from the flax genome and Unigene database using the novoMIR plant miRNA prediction program. Of these 649 discreet miRNA gene units were identified as having potential targets among the 30,649 flax Unigenes. RNAseq provided preliminary support for 349 of the predicted miRNAs derived from novoMIR and identified an additional 1.4 million unique reads to be included in further investigations. Sequence similarity to public miRNA databases suggest that the flax transcriptome utilizes most of the conserved miRNAs among angiosperm and will likely have similar regulatory roles. Using Perl programming scripts, 44,106 PAV sites of at least 1Kb were identified between flax linseed and fiber types. 143 PAV sites were found to be associated with computationally predicted miRNAs or sRNAs identified by high-throughput sequencing.

Committee:

Christopher Cullis (Advisor); Stephen Haynesworth (Committee Member); Emmitt Jolly (Committee Member); Barbara Kuemerle (Committee Member); Saba Valadkhan (Committee Member)

Subjects:

Bioinformatics; Biology; Genetics; Molecular Biology; Plant Biology

Keywords:

microRNA; miRNA; flax; linseed; bioinformatics; computational prediction; PAV; structural variation; genotroph

MARKEY, MICHAEL PATRICKTRANSCRIPTIONAL REGULATION BY THE RETINOBLASTOMA TUMOR SUPPRESSOR: NOVEL TARGETS AND MECHANISMS
PhD, University of Cincinnati, 2004, Medicine : Cell and Molecular Biology
The retinoblastoma tumor suppressor (RB) is a key regulator of the cell cycle. It is targeted for loss or functional inactivation in the majority of cancers. Through interaction with the E2F family of transcription factors, it regulates the expression of many genes involved in the transition from the G1 phase of the cell cycle to S phase. Beyond this, RB has been implicated to play a role in a variety of cellular processes, including differentiation, development, and progression through S and G2. Despite the importance of RB, the targets of RB-mediated transcriptional repression remain mostly speculative. Here we have undertaken a comprehensive genomic study to identify genes regulated by RB. Microarray analyses of cells expressing activate RB revealed that RB represses a wide variety of genes involved in several cellular functions. Many of these RB targets are genes which are activated by the expression of various E2F family members. However, repression by RB and activation by E2F are not equal and opposite events. Genes may or may not be affected by both, and not always to the same extent. Besides known E2F targets, several novel genes were linked to the RB pathway. These include geminin, an important regulator of DNA licensing, which is discussed here in greater detail. Furthermore, conditional loss of RB from adult cells resulted in deregulation of many of the same genes repressed by active RB. Interestingly, a number of genes were also repressed when RB is lost. These fall into several classes, but include many genes involved in immune functioning. Taken together, these data represent an important step forward in our understanding of this vital tumor suppressor.

Committee:

Dr. Erik Knudsen (Advisor)

Keywords:

retinoblastoma; RB; cancer; cell cycle; bioinformatics; microarray

Kalluru, Vikram GajananIdentify Condition Specific Gene Co-expression Networks
Master of Science, The Ohio State University, 2012, Electrical and Computer Engineering
Since co-expressed genes often are co-regulated by a group of transcription factors, different conditions (e.g., disease versus normal) may lead to different transcription factor activities and therefore different co-expression relationships. A method for identifying condition specific co-expression networks by combining the recently developed network quasi-clique mining algorithm and the Expected Conditional F-statistic has been proposed. This method has been applied to compare the transcriptional programs between the non-basal and basal types of breast cancers. This work is a translational bioinformatics study integrating network analysis which lifts the traditional gene list based disease biomarker discovery to the gene and protein interaction level. This work presents a method for identifying condition specific gene co-expression networks. The method involves construction of a Weighted Graph Co-expression Network (WGCN) and mining the WGCNs to identify dense co-expression networks followed by a chi-square test based enrichment analysis for detecting condition specific co-expression relationship. The expression values in all the conditions for the genes constituting a condition specific co-expression network are visualized as heat maps which suggest that the genes are highly correlated in a specific condition but the correlations are disrupted in other conditions.

Committee:

Kun Huang, PhD (Advisor); Raghu Machiraju, PhD (Committee Member)

Subjects:

Bioinformatics; Computer Engineering; Computer Science; Engineering

Keywords:

gene coexpression; bioinformatics; differential gene coexpression networks; systems biology; basal breast cancer; non-basal breast cancer

Gowrisankar, SivakumarPredicting Functional Impact of Coding and Non-Coding Single Nucleotide Polymorphisms
PhD, University of Cincinnati, 2008, Engineering : Biomedical Engineering

Determining the functional impact of coding and non-coding single nucleotide polymorphisms (SNPs) is one of the primary challenges in establishing genotype-phenotype relations. The SNPs constitute more than 90% of the genetic variation and account for most trait differences among individuals and are one of the primary genotype data captured when studying the genetic basis of disease. The advent of efficient high-throughput DNA sequencers and GeneChips™ necessitates robust computational analysis pipelines to handle the genotype data more efficiently and facilitate seamless integration with clinical data. To address this, we have developed a bioinformatics-based comprehensive analysis pipeline which predicts the effect of coding and non-coding SNPs.

Based on the hypothesis that by integrating multiple coding SNP-impact predictions we can analyze and predict the SNP outcome better, we integrated three impact-prediction scores and one population-based score to obtain a SVM-based meta-prediction model. Through cross-validation studies, we demonstrate that our approach improves the SNP-effect prediction. For the first time, we have used the population-based minor allele frequency (MAF) as one of the features for SNP-effect prediction and prove that it significantly improves the performance of the prediction algorithm. We then extended this approach to predict the impact of non-coding promoter SNPs. Our results, through feature combinations and cross-validation, show that integrating multiple sequence-based features improves performance of the SNP-effect predictor. Also for the first time we demonstrate that the loss or gain of guanine in the SNP-overlapping putative transcription binding sites can be used as a measure of likelihood for an alteration in the native binding site, thereby increasing the odds of the SNP being functional.

Through various test cases, we demonstrate the utility of our algorithm. Using a specific test case of p53 binding sites, we also demonstrate a method for the enhancement of prediction based on the inclusion of experimental-based transactivation data for p53 response-elements (REs) that can enhance the ability to predict the impact of SNPs overlapping p53 REs. Taken together this provides a framework for demonstrating how prediction of TFBS functions may be enhanced in a high throughput fashion using assay screening data.

Committee:

Bruce J. Aronow, PhD (Committee Chair); Anil G. Jegga, DVM, MRes (Committee Member); Marepalli B. Rao, PhD (Other)

Subjects:

Bioinformatics

Keywords:

bioinformatics; SNP; polymorphisms; p53; functional SNPs; coding-SNPs; non-coding SNPs; p53 transactivation;

Choi, IckwonComputational Modeling for Censored Time to Event Data Using Data Integration in Biomedical Research
Doctor of Philosophy, Case Western Reserve University, 2011, EECS - Computer and Information Sciences

Medical prognostic models are designed by clinicians to predict the future course or outcome of disease progression after diagnosis or treatment. The data, which are used when these clinical models are developed, are required to contain a high number of events per variable (EPV) for the resulting model to be reliable. If our objective is to optimize predictive performance by some criterion, we can often achieve a reduced model that has a little bias with low variance, but whose overall performance is improved. To accomplish this goal, we propose a new variable selection approach that combines Stepwise Tuning in the Maximum Concordance Index (STMC) and Forward Nested Subset Selection (FNSS) in two stages. In the first stage, the proposed variable selection is employed to identify the best subset of risk factors optimized with the concordance index using inner cross validation for optimism correction in the outer loop of cross validation, yielding potentially different final models for each of the folds. We then feed the intermediate results of the prior stage into another selection method in the second stage to resolve the overfitting problem and to select a final model from the variation of predictors in the selected models. Two case studies on relatively different sized survival data sets as well as a simulation study demonstrate that the proposed approach is able to select an improved and reduced average model under a sufficient sample and event size compared to other selection methods such as stepwise selection using the likelihood ratio test, Akaike Information Criterion (AIC), and least absolute shrinkage and selection operator (lasso). Finally, we achieve improved final models in each dataset as compared full models according to most criteria. These results of the model selection models and the final models were analyzed in a systematic scheme through validation for independent performance evaluation.

For the second part of this dissertation, we build prognostic models that use clinicopathologic features and predict prognosis after a certain treatment. Most of the recent research efforts have focused on high dimensional genomic data with a small sample. Since clinically similar but molecularly heterogeneous tumors may produce different clinical outcomes, the combination of clinical and genomic information is crucial to improve the quality of prognostic prediction. However, there is lack of an integrating scheme into a clinico-genomic model due to the larger number of variables and small sample size, in particular, for a parsimonious model. We propose a methodology to build a reduced yet accurate integrative model using a hybrid approach based on the Cox regression model, which uses several dimension reduction techniques, L2 penalized maximum likelihood estimation (PMLE), and resampling methods to tackle the problems above. The predictive accuracy of the modeling approach is assessed by several metrics via an independent and thorough scheme to compare competing methods. In breast cancer data studies for metastasis and mortality outcome, in a DLBCL data study, and in simulation studies, we demonstrate that the proposed methodology can improve prediction accuracy and build a final model with a hybrid signature that is parsimonious when integrating both types of variables. The selected clinical factors and genomic biomarkers are found to be highly relevant to the biological processes and can be considered as potential biomarkers for cancer prognosis and therapy. Furthermore, selected but unidentified genes are open to thorough investigation.

Committee:

Michael Kattan (Advisor); Mehmet Koyuturk (Committee Chair); Andy Podgurski (Committee Member); Soumya Ray (Committee Member)

Subjects:

Computer Science

Keywords:

Statistical Machine Learning; Biomedical Informatics; Bioinformatics; Censored Time To Event Data; Clinico-genomic Model; Cox Proportional Hazards Model; Microarray analysis

Marsolo, Keith AllenA workflow for the modeling and analysis of biomedical data
Doctor of Philosophy, The Ohio State University, 2007, Computer and Information Science
The use of data mining techniques for the classification of shape and structure can provide critical results when applied biomedical data. On a molecular level, an object's structure influences its function, so structure-based classification can lead to a notion of functional similarity. On a more macro scale, anatomical features can define the pathology of a disease, while changes in those features over time can illustrate its progression. Thus, structural analysis can play a vital role in clinical diagnosis. When examining the problem of structural or shape classification, one would like to develop a solution that satisfies a specific task, yet is general enough to be applied elsewhere. In this work, we propose a workflow that can be used to model and analyze biomedical data, both static and time-varying. This workflow consists of four stages: 1) Modeling, 2) Biomedical Knowledge Discovery, 3) Incorporation of Domain Knowledge and 4) Visual Interpretation and Query-based Retrieval. For each stage we propose either new algorithms or suggest ways to apply existing techniques in a previously-unused manner. We present our work as a series of case studies and extensions. We also address a number of specific research questions. These contributions are as follows: We show that generalized modeling methods can be used to effectively represent data from several biomedical domains. We detail a multi-stage classification technique that seeks to improve performance by first partitioning data based on global, high-level details, then classifying each partition using local, fine-grained features. We create an ensemble-learning strategy that boosts performance by aggregating the results of classifiers built from models of varying spatial resolutions. This allows a user to benefit from models that provide a global, coarse-grained representation of the object as well as those that contain more fine-grained details, without suffering from the loss of information or noise effects that might arise from using only a single selection. Finally, we propose a method to model and characterize the defects and deterioration of function that can be indicative of certain diseases.

Committee:

Srinivasan Parthasarathy (Advisor)

Subjects:

Computer Science

Keywords:

Biomedical Data Modeling; Spatial Modeling; Biomedical Knowledge Discovery; Classification of Structure-based Data.; Bioinformatics; Protein Modeling; Protein Classification

Wolfe, Richard A. In Silico Discovery of Pollen-specific Cis-regulatory Elements in the Arabidopsis Hydroxyproline-Rich Glycoprotein Gene Family
Master of Science (MS), Ohio University, 2014, Computer Science (Engineering and Technology)
Within every cell is a copy of an organism's DNA. This copy of DNA has all of the information needed for the cell to express every gene in the organism's genome. Although each cell is capable, individual cells do not express every gene in their DNA. The genes expressed by a cell are regulated by transcription factors (TFs) that bind to a transcription factor binding site (TFBS) located in the promoter region of the gene. TFs must bind to TFBSs in order for a gene to be expressed. Tissues are groups of cells that perform a specific function; therefore, the cells of a specific tissue express genes that are not expressed in other cell types. Hydroxyproline-rich glycoprotein (HRGPs) are proteins that are found in the plant cell wall, and they can be further classified according to the degree they are glycosylated as arabinogalactan-proteins (AGPs), extensins (EXTs), and proline-rich proteins (PRPs). Currently, the TFBSs for EXTs, AGPs, and PRPs expressed in the pollen cells of Arabidopsis are unknown andtheir discovery will provide a better understanding of the regulatory and evolutionary processes of these genes. Motif discovery and other bioinformatics tools were used to search the promoter regions of EXT, AGP, and PRP genes expressed in the Arabidopsis pollen cells and select motifs that are putative TFBSs. The best set of motifs discovered as putative pollen-specific TFBSs are GCYAMGKA, ACTMGGAA, CATSAAAMGA, and ATTKGKTTCT. Of the 8 pollen-specific promoters, GCYAMGKA occurs in 5 promoters,ACTMGGAA occurs in 2 promoters, CATSAAAMGA occurs in 4 promoters, and ATTKGKTTCT occurs in 3 promoters. Also, all of the 8 HRGP pollen-specific promoters have anoccurrence of at least one of these four motifs and none of the four motifs occur in the 84 HRGP promoters of genes not expressed in pollen cells.

Committee:

Lonnie Welch (Advisor)

Subjects:

Bioinformatics; Computer Science

Keywords:

Bioinformatics; transcription factor; transcription factor binding site; motif discovery; computer

Albukhnefis, Adil Lateef MahmoodNuclei and Nucleoli Segmentation and Analysis
MS, Kent State University, 2016, College of Arts and Sciences / Department of Computer Science
In biomedical imaging, segmentation and analysis play an important diagnostic role. Nuclei and nucleoli segmentation and classification have a significant impact on the cancer and tumor diagnostics in biological and medical research studies. Typically, segmentation is difficult in microscopic images because of object shapes and clustering in many samples. In this work we introduce a method that combines simplicity and efficiency. The proposed method utilizes ImageJ framework to automatically segment and classify nuclei and nucleoli after applying some preprocessing techniques to improve the image quality and remove noise. The required preprocessing steps differ based on the kind of segmentation required. Both 2D and 3D segmentation are achieved for the nuclei and nucleoli. The analysis approach provides statistics about volume, area, surface and other properties of the segmented nuclei and nucleoli. The classification process then groups the segmented nuclei and nucleoli based on the previous criteria. Finally, the visualization process shows the results of the proposed method overlaid on the original data set. The proposed method provides a very efficient system for nuclei and nucleoli segmentation and achieves about 98 % accuracy. Furthermore, the plugin is extremely fast when compared to manual segmentation especially with large data sets.

Committee:

cheng chang Lu (Advisor); Robert Clements (Committee Member); Austin Melton (Committee Member)

Subjects:

Bioinformatics; Biomedical Research

Keywords:

Bio medical imaging , Segmentation , Analysis ,Bioinformatics , image processing

Krabacher, Rachel M.Identifying Unique Material Binding Peptides Using a High Throughput Method
Master of Science (M.S.), University of Dayton, 2016, Chemical Engineering
Through biotic-abiotic interactions, it has been shown that peptides can recognize and selectively bind to a wide variety of materials dependent on both their surface properties and the environment. Better understanding of these peptides and the materials to which they bind can be beneficial in the development of biofunctionalization approaches for creating hybrid materials and sensors. Several research groups have identified material binding peptides using biopanning with phage or cell peptide display libraries. However, limitations with sequence diversity of traditional bacteriophage (phage) display libraries and loss of unique phage clones during the amplification cycles results in a smaller pool of peptide sequences identified. In order to overcome some of the limitations of traditional biopanning methodology, a modified method using phage display along with high-throughput next generation sequencing to select for unique peptides specific for different classes of single wall carbon nanotubes has been devised. The process, analysis and characterization of peptide sequences identified using the modified method is described and compared to peptides identified using the traditional methods. Selected sequences from this study were immobilized on surfaces and used in site-specific capture of metallic and/or semiconducting carbon nanotubes. A dispersion experiment was carried out to identify chiral specific peptides. From this research, successful methods have been identified to select and confirm binding peptides specific to various materials. Knowledge of chiral specific recognizing peptides can allow for the potential purification and separation of specific chirality carbon nanotubes, thus opening the door for a number of carbon nanotube applications which had been previously hindered by mixed carbon nanotube samples.

Committee:

Kristen Comfort, Dr. (Advisor); Rajesh Naik, Dr. (Advisor); Kevin Myers, Dr. (Committee Member); Christina Harsch, Dr. (Committee Member)

Subjects:

Biochemistry; Bioinformatics; Chemical Engineering; Materials Science

Keywords:

Carbon nanotubes; phage display; peptide immobilization; high throughput sequencing; bioinformatics; peptide binding; biotic-abiotic interaction

Tarcha, Eric J.Application of Immunoproteomics and Bioinformatics to coccidioidomycosis Vaccinology
Doctor of Philosophy in Medical Sciences (Ph.D.), University of Toledo, 2006, College of Graduate Studies
Coccidioides is a primary fungal pathogen endemic to the alkaline desert soil of ||the Southwestern United States and the etiological agent of coccidioidomycosis (Valley fever), a respiratory disease in humans. Coccidioides represents the only fungal pathogen on the Center for Disease Control’s select agent list of possible weapons of bioterrorism, and can cause significant morbidity in infected individuals. Thus, the need for a human vaccine against coccidioidomycosis has come to the forefront. Work has centered on identification and characterization of recombinant T cell-reactive antigens, because clinical and experimental data suggest that activation of a durable MHC II restricted T cell Th1 immune response is of principal importance in establishing durable protective immunity. Immunoprotection experiments in mice using single recombinant vaccine proteins of Coccidioides have resulted in less than optimal survival and clearance of the fungus from the infected host compared to whole cell or multicomponent subcellular vaccines, indicating a lack of protective epitopes in the single protein vaccines. Therefore, it is likely that a protective recombinant protein subunit human vaccine against Coccidioides will contain genetically unrestricted (“promiscuous”) protective T cell epitopes, and will most likely be multivalent in nature. In this study we describe an immunoproteomic and bioinformatic approach for profiling a diverse immunogenic protein component of the coccidioidal parasitic cell wall. A phospholipase B (Plb), alpha-mannosidase (Amn1), and an aspartyl protease (Pep1) were selected as candidate vaccine proteins on the basis of their immunogenicity, cellular localization, predicted | |promiscuous T cell epitope content, and T cell reactivity. These antigens were evaluated individually, and in combination, for their protective efficacy in a pulmonary murine model of infection. Each individual protein showed significant protection in infected mice as evaluated by survival after lethal challenge (53%-61% survival). A combinatorial vaccine composed of all three protective antigens enhanced survival in infected mice (86% survival) and significantly improved clearance of the pathogen from lungs of surviving mice by 90 d post-challenge. This strategy has been successful in producing the most comprehensive profile of immunogenic coccidioidal cell wall antigens to date, and lays the groundwork for the development of an epitope-driven,multivalent human vaccine against coccidioidomycosis.

Committee:

Garry Cole, Ph.D. (Advisor)

Keywords:

vaccine development; multivalent vaccines; bioinformatics; proteomics; coccidioides

Next Page