Search Results (1 - 25 of 74 Results)

Sort By  
Sort Dir
 
Results per page  

Kurz, Kyle W.A Parallel, High-Throughput Framework for Discovery of DNA Motifs
Master of Science (MS), Ohio University, 2010, Computer Science (Engineering and Technology)

The search for genomic information has just begun. New genomes are sequenced daily, and each brings new challenges and knowledge to the scientific table that must be carefully mined and studied to glean out every possible bit of information. The amount of data created during genomic sequencing is simply too great for researchers to handle, creating a need for computational tools capable of processing the genomic input and analyzing it for information. The area of bioinformatics focuses on this combination of computer science and biology, bringing useful software applications to the table in an effort to ease the workload of biologists.

One specific area of interest to biological researchers is the study of DNA words or motifs as they relate to gene regulation. These regulatory elements may be transcription factor binding sites (TFBS), which bind RNA polymerase II to the DNA strand, or enhancer/silencer sequences that up- and down-regulate transcription of the gene to which they are related by binding specific proteins. Many tools such as Weeder [43], WordSpy[65] and YMF [55] are currently available for the study of over- and under-represented words in a DNA sequence, a trait which is believed to useful in identification of these regulatory elements. These tools all perform similar tasks by enumerating all words, or substrings, found in their input, then scoring and ranking these resulting words for presentation to the user. Optionally, many tools also cluster groups of words together to form degenerate motifs which allow for evolutionary and environmental variation in the binding site.

The Open Word Enumeration Framework (OWEF), presented in this thesis, providesa new framework on which DNA word enumeration tools can be built. The OWEF framework provides a set of abstract base classes representing the core stages of a word enumeration tool and defines a set of standard interfaces for each stage, allowing multiple algorithmic implementations of these base classes to co-exist and be selected individually at runtime.

In addition to providing a level of abstraction that allows for simpler development, the framework also provides a scalable solution to alleviate memory bottlenecks. The framework contains skeleton code for both a shared memory implementation, providing fast analysis on single-node, multiprocessor systems, and a distributed memory solution, which splits the tasks among several networked nodes to provide a large amount of accessible main memory to the application.

In summary, the OWEF framework is useful as a development tool by providing a set of interfaces and methods to allow developers to focus on specific aspects of the algorithms they are designing, while also providing a standardized, flexible interface to researchers, eliminating the need for specialized tools and providing a general-purpose toolkit for DNA word enumeration tasks.

Committee:

Lonnie Welch, PhD (Committee Chair); Frank Drews, PhD (Committee Member); Chang Liu, PhD (Committee Member); Robert Colvin, PhD (Committee Member)

Subjects:

Bioinformatics; Computer Science

Keywords:

DNA motifs; bioinformatics; motif discovery; bioinformatics framework

GUDIVADA, RANGA CHANDRADISCOVERY AND PRIORITIZATION OF BIOLOGICAL ENTITIES UNDERLYING COMPLEX DISORDERS BY PHENOME-GENOME NETWORK INTEGRATION
PhD, University of Cincinnati, 2007, Engineering : Biomedical Engineering
An important goal for biomedical research is to elucidate causal and modifier networks of human disease. While integrative functional genomics approaches have shown success in the identification of biological modules associated with normal and disease states, a critical bottleneck is representing knowledge capable of encompassing asserted or derivable causality mechanisms. Both single gene and more complex multifactorial diseases often exhibit several phenotypes and a variety of approaches suggest that phenotypic similarity between diseases can be a reflection of shared activities of common biological modules composed of interacting or functionally related genes. Thus, analyzing the overlaps and interrelationships of clinical manifestations of a series of related diseases may provide a window into the complex biological modules that lead to a disease phenotype. In order to evaluate our hypothesis, we are developing a systematic and formal approach to extract phenotypic information present in textual form within Online Mendelian Inheritance in Man (OMIM) and Syndrome DB databases to construct a disease - clinical phenotypic feature matrix to be used by various clustering procedures to find similarity between diseases. Our objective is to demonstrate relationships detectable across a range of disease concept types modeled in UMLS to analyze the detectable clinical overlaps of several Cardiovascular Syndromes (CVS) in OMIM in order to find the associations between phenotypic clusters and the functions of underlying genes and pathways. Most of the current biomedical knowledge is spread across different databases in different formats and mining these datasets leads to large and unmanageable results. Semantic Web principles and standards provide an ideal platform to integrate such heterogeneous information and could allow the detection of implicit relations and the formulation of interesting hypotheses. We implemented a page-ranking algorithm onto Semantic Web to prioritize biological entities for their relative contribution and relevance which can be combined with this clustering approach. In this way, disease-gene, disease-pathway or disease-process relationships could be prioritized by mining a phenome - genome framework that not only discovers but also determines the importance of the resources by making queries of higher order relationships of multi-dimensional data that reflect the feature complexity of diseases.

Committee:

Dr. Bruce Aronow (Advisor)

Keywords:

Semantic Web; RDF; OWL; SPARQL; Ontology; Biomedical Informatics; Bioinformatics; Integrative Bioinformatics; Text Mining; Phenome; Genome; Disease Modularity; Data Integration; Semantic Integration

Hayes, MatthewAlgorithms to Resolve Large Scale and Complex Structural Variants in the Human Genome
Doctor of Philosophy, Case Western Reserve University, 2013, EECS - Computer and Information Sciences
It has been shown that large scale genomic structural variants (SV) are closely associated with disease onset. In particular, the presence of these abnormalities may contribute to the onset and susceptibility of cancer through various mechanisms. Knowing the location and type of these variants can assist medical researchers in making insights into methods for diagnosis and treatment. It is also important to develop efficient methods to locate these variants. This thesis presents several algorithms for identifying and characterizing structural variants using array comparative genomic hybridization (aCGH) and high throughput next-generation sequencing (NGS) platforms. The aCGH-based algorithm (CGH-Triangulator) is considerably faster than a state-of-the-art method for identifying change points in aCGH data, and it has greater prediction power on datasets with low-to-moderate levels of noise. The NGS-based algorithms include methods to identify basic SV types, including deletions, inversions, translocations, and tandem repeats. They also include methods to identify double minute chromosomes, which are more complex structural variants. These methods use a hybrid strategy to identify variants at base-pair resolution. Using two primary prostate cancer datasets and simulated datasets, we compared our methods to previously published NGS algorithms. Overall, our methods had favorable performance with respect to breakpoint prediction accuracy, sensitivity, and specificity. In particular, this thesis presents one of the first attempts to algorithmically detect double minute chromosomes, which are complex rearrangements that are present in many cancers.

Committee:

Jing Li (Advisor)

Subjects:

Bioinformatics; Computer Science; Molecular Biology

Keywords:

Bioinformatics; structural variation; Bellerophon; complex genomic rearrangements; genomic structural variation; inversion; tandem repeat; translocation; interchromosomal insertion; double minute; double minute chromosome; deletion; computational biology

Zhang, XuanSupporting on-the-fly data integration for bioinformatics
Doctor of Philosophy, The Ohio State University, 2007, Computer and Information Science
The use of computational tools and on-line data knowledgebases has changed the way the biologists conduct their research. The fusion of biology and information science is expected to continue. Data integration is one of the challenges faced by bioinformatics. In order to build an integration system for modern biological research, three problems have to be solved. A large number of existing data sources have to be incorporated and when new data sources are discovered, they should be utilized right away. The variety of the biological data formats and access methods have to be addressed. Finally, the system has to be able to understand the rich and often fuzzy semantic of biological data. Motivated by the above challenges, a system and a set of tools have been implemented to support on-the-fly integration of biological data. Metadata about the underlying data sources are the backbone of the system. Data mining tools have been developed to help users to write the descriptors semi-automatically. With automatic code generation approach, we have developed several tools for bioinformatics integration needs. An automatic data wrapper generation tool is able to transform data between heterogeneous data sources. Another code generation system can create programs to answer projection, selection, cross product and join queries from flat file data. Real bioinformatics requests have been used to test our system and tools. These case studies show that our approach can reduce the human efforts involved in an information integration system. Specifically, it makes the following contributions. 1) Data mining tools allow new data sources to be understood with ease and integrated to the system on-the-fly. 2) Changes in data format are localized by using the metadata descriptors. System maintenance cost is low. 3) Users interact with our system through high-level declarative interfaces. Programming efforts are reduced. 4) Our tools process data directly from flat files and requires no database support. Data parsing and processing are done implicitly. 5) Request analysis and request execution are separated and our tools can be used in a data grid environment.

Committee:

Gagan Agrawal (Advisor)

Subjects:

Computer Science

Keywords:

information integration; bioinformatics

Kuntala, Prashant KumarOptimizing Biomarkers From an Ensemble Learning Pipeline
Master of Science (MS), Ohio University, 2017, Electrical Engineering & Computer Science (Engineering and Technology)
Understanding gene expression pattern is crucial in deciphering any observed biological phenotypes. Transcription factors (TF) are proteins that regulate genes by binding to a transcription factor binding site (TFBS) within the promoter region of a gene. Motif discovery is a computational approach that conventionally uses stochastic models, enumeration methods and many other techniques to report candidate motifs (TFBS). These methods generate similar motifs for a TF due to various reasons. Motif selection algorithms successfully identify a small set of motifs that address the specificity problem and coverage problem in motif discovery. However, these selected motifs do not always capture all the binding site preferences for a TF. This study verifies the hypothesis that motif discovery tools generate similar motifs for a transcription factor and once these variants (similar motifs) are identified, they can be used to form a super motif set, which may improve the accuracy of motif discovery. This study introduces the concept of Super motif set, a new model to accurately predict the binding sites for a TF. Two heuristic algorithms are introduced to identify Super motif sets, utilizing motif selection algorithms and a motif comparison tool. These super motif sets identified, capture the biological diversity in TFBS preferences of a TF. The algorithms are valuated on ChIP-seq data for 54 TF factor groups from the ENCODE project. Moreover, the proposed algorithms are used to optimize the motifs that are reported by motif selection algorithms and to report super motif sets in three case studies: Chagas disease, pollen specific HRGP genes in Arabidopsis thaliana and Shigellosis. On an average two motif variants are added to the selected motifs, which improve the accuracy of motif discovery.

Committee:

Frank Drews (Advisor); Lonnie Welch (Committee Chair); Jundong Liu (Committee Member); Erin Murphy (Committee Member)

Subjects:

Bioinformatics; Biology; Biomedical Research; Computer Engineering; Computer Science; Genetics; Molecular Biology

Keywords:

Motif Discovery; Motif Selection; Super Motif Set; Transcription Factor; Heuristic algorithm; DNA Motifs; Ensemble Learning; Genomics; ENCODE; Chagas disease; Shigellosis; Bioinformatics; Computational Biology;

Moller, Abraham GhoreishiMapping ecologically important virus-host interactions in geographically diverse solar salterns with metagenomics
Master of Science, Miami University, 2016, Cell, Molecular and Structural Biology (CMSB)
Viruses that infect microbes are critical players in the world’s ecosystems. By lysing microbes, viruses turn over nutrients, regulate microbial populations, and maintain global biogeochemical cycles. Despite this ecological importance, determining viral-microbial interactions - especially interactions with lytic viruses, which do not integrate into their host’s genome - remains a major challenge. In this work, we determine viral interactions with microbes in salt-collecting ponds (salterns), where viruses are the dominant predator of the microbial community. Low microbial diversity, environmental stability, and high viral density also make solar salterns excellent model ecosystems for studying viral-archaeal interactions. By using a suite of bioinformatics tools to analyze saltern metagenomes, we mapped virus-host interactions across geographically diverse salterns and related them to carbon cycling. Our studies suggest viruses are critical players in saltern carbon cycling, and the loss of CRISPRs in archaea hosts may play an important role in regulating virus-mediated nutrient cycling in these environments.

Committee:

Chun Liang, PhD (Advisor); Michael Crowder, PhD (Committee Chair); Gary Lorigan, PhD (Committee Member)

Subjects:

Bioinformatics; Ecology; Microbiology

Keywords:

metagenomics, bioinformatics, CRISPR, virus-host interactions, hypersaline ecosystems, solar salterns, environmental microbiology

Garcia, KrystineBioinformatics Pipeline for Improving Identification of Modified Proteins by Neutral Loss Peak Filtering
Master of Science (MS), Ohio University, 2015, Biomedical Engineering (Engineering and Technology)
Research has found that the central dogma of molecular biology is much more complex than making a gene into mRNA and then protein. One way that this has been found to be true is in post-translational modification (PTM) of proteins. PTMs are changes to specific side chains of amino acids after translation from mRNA. Unfortunately, only a few have been carefully studied meaning that most are not well understood. In turn, this has created problems for current methods of protein identification. Tandem mass spectrometry (MS/MS) is one common method of protein isolation and fragmentation that is usually followed by computational analysis for protein and peptide identification. The output from MS/MS is a spectrum of ion mass-to-charge ratios (m/z) versus their intensity, or a peak list, which is fed into the identification software for analysis. While these programs are generally good at identification, they were not created to incorporate large numbers of modifications. The problem is that spectra can include significant amounts of noise from modifications that can mask the y- and b- ion peaks used for identification. This has caused preprocessing to become both prevalent and necessary for identification of modified proteins. Through preprocessing and filtering of peaks from MS/MS data, we have enhanced the identification of proteins with modifications and specific phenotypes, presenting scientists with a more powerful tool for their protein research.

Committee:

Lonnie Welch, Dr. (Advisor); Jennifer Hines, Dr. (Committee Member); Frank Drews, Dr. (Committee Member); Frank Schwartz, Dr. (Committee Member)

Subjects:

Biochemistry; Bioinformatics; Biomedical Engineering; Computer Science

Keywords:

post-translational modifications; mass spectrometry; bioinformatics; peptide peak filtering; proteins

Camerlengo, Terry LukeTechniques for Storing and Processing Next-Generation DNA Sequencing Data
Master of Science, The Ohio State University, 2014, Biophysics
Genomics is undergoing unprecedented transformation due to rapid improvements in genetic sequencing technology, which has lowered costs for genetic sequencing experiments while increasing the amount of data generated in a typical experiment (McKinsey Global Institute, May 2013, pp. 86-94). The increase in data has shifted the burden from analysis and research to expertise in IT hardware and network support for distributed and efficient processing. Bioinformaticians, in response to a data-rich environment, are challenged to develop better and faster algorithms to solve problems in genomics and molecular biology research. This thesis examines the storage and data processing issues inherent in next- generation DNA sequencing (NGS). This work details the design and implementation of a software prototype that exemplifies the current approaches as it relates to the efficient storage of NGS data. The software library is utilized within the context of a previous software project which accompanies the publication related to the HT_SOSA assay. The software for the HT_SOSA, called NGSPositionCounter, demonstrates a workflow that is common in a molecular biology research lab. In an effort to scale beyond the research institute, the software library’s architecture takes into account scalability considerations for data storage and processing demands that are more likely to be encountered in a clinical or commercial enterprise.

Committee:

Kun Huang, Ph.D (Advisor); Alvarez Carlos, Ph.D (Committee Member); Machiraju Raghu, Ph.D (Committee Member)

Subjects:

Bioinformatics

Keywords:

DNA sequence storage; 4 bit encoding; reference-based compression; Needleman-Wusnch; DNA base pair compression; sequence compression; MongoDB; NGS Data management; bioinformatics; NoSQL; 3 bases per byte;

Johnson, Stephen RobertiPathCase
Master of Sciences, Case Western Reserve University, 2012, EECS - Computer and Information Sciences

PathCase is a system designed to provide life scientists with an integrated environment to study pathways, regardless of the source producing the corresponding data. It includes databases for different pathway data sources, web interfaces for viewing the information in the databases, tools for analyzing the data, and web services to access the data programmatically.

This thesis describes the design and implementation of new touch screen-based tools to view and analyze the data in the PathCaseKEGG and PathCaseMAW databases. The first tool, iPathCase-SMDA, provides compartment-aware pathway visualizations and access to the Steady-State Metabolic Network Dynamics Analysis tool. The second tool, iPathCase-KEGG, provides visualizations of KEGG pathways, including relevant information from other data sources outside of PathCaseKEGG and KEGG.

Committee:

Michael Branicky, ScD (Committee Chair); Gultekin Ozsoyoglu, PhD (Advisor); Mehmet Koyuturk, PhD (Committee Member); Meral Ozsoyoglu, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

bioinformatics; metabolic pathways; iPad; KEGG; MAW; SMDA; PathCase

SINHA, AMIT U.Discovery and Analysis of Genomic Patterns: Applications to Transcription Factor Binding and Genome Rearrangement
PhD, University of Cincinnati, 2008, Engineering : Computer Science

One of the most challenging open problems of the post-genomic era for computer scientists, bioinformaticians and molecular biologists is to identify and characterize all the elements involved in the gene regulation, at both the transcriptional and post-transcriptional level. The rapid availability of multiple genomes also necessitates a need for efficient computational solutions to aid in comparative syntenic analysis - the analysis of relative gene-order conservation between species - which can provide key insights into evolutionary chromosomal dynamics, rearrangement rates between species, and speciation analysis. Thus, in this dissertation, we address these issues and develop computational approaches to identify genomic patterns at both micro and macro levels.

To address the problem of gene regulation, we developed algorithms for identifying transcription factor binding sites (short repeated patterns or sequence motifs) in genomes. We have developed a level based search algorithm which is able to identify regulatory motifs in a wide variety of datasets and demonstrate that our method works more efficiently than the current best methods. Further, we have also developed statistical models for identification of known motifs. Finally, we refine the motif discovery process through methods that discriminate and characterize the co-factors of a transcription factor.

At macro level, we developed efficient methods (Cinteny) for fast identification of syntenic blocks with various levels of coarse graining and determine evolutionary relationships between genomes in terms of the number of rearrangements (the reversal distance). Cinteny web server integrates syntenic region browsing with evolutionary distance assessment, offers flexibility to adjust all parameters and recompute the results on-the-fly, and ability to work with user provided data.

Committee:

Raj Bhatnagar (Advisor); Jaroslaw Meller (Committee Co-Chair); Anil Jegga (Committee Co-Chair); Ali Minai (Committee Member); Yizong Cheng (Committee Member)

Subjects:

Bioinformatics

Keywords:

computational biology; bioinformatics; transcription factor; genome rearrangement

Ozer, Hatice GulcinResidue Associations In Protein Family Alignments
Doctor of Philosophy, The Ohio State University, 2008, Biophysics

The increasing amount of data on biomolecule sequences and their multiple alignments for families, has promoted an interest in discovering structural and functional characteristics of proteins from sequence alone. In many proteins interactions between residues appear to be key players in structure and function. Consensus, or weight matrix, or hidden Markov models cannot detect interpositional correlations and alternating motifs within a family alignment. We propose and analyze a method for detecting interpositional correlations and examine the applicability of this method to structural prediction.

We presented the Multiple Alignment Variation Linker (MAVL) and StickWRLD to analyze biomolecule sequence alignments and visualize positive and negative interpositional residue associations. In the MAVL analysis system, the expected number of sequences that should share identities and residuals are calculated based on the observed population of sequences actually sharing the residues. Correlating pairs of residues are visualized in StickWRLD diagram. This analysis system allows us to extract information from the alignments which is not accessible to traditional column-based methods. In addition, a StickWRLD diagram enables the user to visualize the family alignment and positional dependencies in 3D.

We discuss methodologies to identify residue associations in protein family alignments. We discussed the use of the residuals and the phi coefficient to determine the strength of a residue association, and Fisher Exact probability test to evaluate the statistical significances. We computed identitywise residue associations for 961 Pfam family alignments and examined physical proximity and physiochemical properties of associated residues in the alignments and their presence on secondary structural elements. We observed that the proximity of residues increases as the strength of association and its statistical significance increase. Specifically, associations between aromatic residues and hydrophilic residues are present in closer proximity. The amino acid contact predictivity of the residual parameter is the highest compared to the phi coefficient and the statistical significance. Compared to the expected distributions, we observed larger proportions of the pairs such that both residues are in a helix or one residue is in a structural element while the other is in a flexible region, or both residues are in a flexible region.

Committee:

William Ray, PhD (Advisor); Charles Daniels, PhD (Committee Member); Hakan Ferhatosmanoglu, PhD (Committee Member); Thomas Magliery, PhD (Committee Member)

Subjects:

Bioinformatics; Biophysics

Keywords:

Family Alignment; Positional Dependency; Amino Acid Correlation; Residue Correlation; Residue Association; Protein Sequence; Protein Structure; Pfam database; Bioinformatics; Fisher Exact test; Phi coefficient

Xu, HuaNovel data analysis methods and algorithms for identification of peptides and proteins by use of tandem mass spectrometry
Doctor of Philosophy, The Ohio State University, 2007, Chemistry
Tandem mass spectrometry is one of the most important tools for protein analysis. This thesis is focused on the development of new methods and algorithms for tandem mass spectrometry data analysis. A database search engine, MassMatrix, has also been developed that incorporates these methods and algorithms. The program is publicly available both on the web server at www.massmatrix.net and as a deliverable software package for personal computers. Three different scoring algorithms have been developed to identify and characterize proteins and peptides by use of tandem mass spectrometry data. The first one is targeted at the next generation of tandem mass spectrometers that are capable of high mass accuracy and resolution. Two scores calculated by the algorithm are sensitive to high mass accuracy due to the fact that this new algorithm explicitly incorporates mass accuracy into scoring potential peptide and protein matches for tandem mass spectra. The algorithm is further improved by employing Monte Carlo Simulations to calculate ion abundance based scores without any assumptions or simplifications. For high mass accuracy data, MassMatrix provides improvements in sensitivity over other database search programs. The second scoring algorithm based on peptide sequence tags inferred from tandem mass spectra further improves the performance of MassMatrix for low mass accuracy tandem mass spectrometry data. The third algorithm is the first automated data analysis method that uses peptide retention times in liquid chromatography to evaluate potential peptide matches for tandem mass spectrometry data. The algorithm predicts reverse phase liquid chromatography retention times of peptides by their hydrophobicities and compares the predicted retention times with the observed ones to evaluate the peptide matches. In order to handle low quality data, a new method has also been developed to reduce noise in tandem mass spectra and screen poor quality spectra. In addition, a data analysis method for identification of disulfide bonds in proteins and peptides by tandem mass spectrometry data has been developed and incorporated in MassMatrix. By this new approach, proteins and peptides with disulfide bonds can be directly identified in tandem mass spectrometry with high confidence without any chemical reduction and/or other derivatization.

Committee:

Michael Freitas (Advisor)

Subjects:

Chemistry, Analytical

Keywords:

Mass spectrometry; Proteomics; Database search; Data analysis; Bioinformatics

Kirac, MustafaPattern Oriented Methods for Inferring Protein Annotations within Protein Interaction Networks
Doctor of Philosophy, Case Western Reserve University, 2009, EECS - Computer and Information Sciences

Discovering protein functions is a major task in computational biology, since proteins have key roles in the underlying mechanisms of cellular processes, phenotypes, and diseases. Most common in silico protein annotation method is function transfer through sequence homology that does not always produce correct results. Consequently, we propose an alternative research direction of assigning functional annotations to proteins (and genes) based on biological network information. In general, our approach is to transfer functionality between related proteins. We present our approaches in three parts:

1. In the first part, we compute the probabilistic significance of GO annotation sequences obtained from the annotations of a sequence of proteins in a protein-protein interaction network. After identifying significant annotation sequences, we predict the annotation of a target protein by picking the most significant candidate GO annotation sequence observed in the close neighborhood of the target protein. Our cross-validation prediction experiments with pre-annotated proteins recovered correct annotations of proteins with 81% precision with the recall at 45%.

2. In the second part, we develop and evaluate a new pattern-based function annotation framework. For a given target protein P, and for each GO term t, we compare (through graph alignment) neighborhood of P with neighborhoods of proteins annotated by t. We then assign to P the GO term whose neighborhoods are the most similar to the neighborhood of P. In this part, we improve the accuracy of techniques introduced in the first part, by 30.44%, 41.94%, and 2.62% in the organism-specific networks of fly, worm, and yeast, respectively.

3. In the third part, we present a technique that improves our pattern-based methodologies with an iterative prediction algorithm. In this part, by using a multi-iteration algorithm, we predict functions of protein P at one step, and employ predicted functions of P for fine-tuning the predictions of other target proteins at a later step. Plugging in the iterative prediction algorithm improves the accuracy of pattern-based function annotation framework presented in the second part by 11.24%, 14.32%, 5.6%, and 15.14% in organism-specific networks of fly, human, worm, and yeast, respectively.

Committee:

Gultekin Ozsoyoglu (Advisor); Rob Ewing (Committee Member); Jiong Yang (Committee Member); Mehmet Koyuturk (Committee Member); Zehra Meral Ozsoyoglu (Committee Member)

Subjects:

Bioinformatics; Computer Science

Keywords:

Bioinformatics; protein interaction networks; protein function prediction

Xu, YaominNew Clustering and Feature Selection Procedures with Applications to Gene Microarray Data
Doctor of Philosophy, Case Western Reserve University, 2008, Statistics

Statistical data mining is one of the most active research areas. In this thesis we develop two new data mining procedures and explore their applications to genetic data.

The first procedure is called PfCluster - Profile Cluster Analysis. It is a clustering method designed for profiled genetic data. The PfCluster is efficient and flexible in uncovering clusters determined by a new class of biologically meaningful distance metrics. A new internal quality measure of clusters, coherence index, is developed to find coherent clusters. An efficient mechanism for choosing the threshold of coherent clusters is also derived and implemented. The threshold is based on the first and second order approximations to the true threshold under a null distribution for parallel clusters. The PfCluster has been applied to simulated data and two real data examples: a biomarker LOH dataset and a microarray gene expression dataset. PfCluster is competitive to the correlation-based clustering procedures.

The second procedure is called RPselection - Resampling based partitioning selection. It is a feature selection algorithm designed for microarray studies. It selects a subset of genes that maximizes a fitness score. The fitness score measures the relevance between the partition labels from a clustering result and an external class label derived from the clinical outcomes. The score is computed using a resampling procedure. The RPselection algorithm has been applied to simulated data and a real uveal melanoma gene expression data. RPselection outperforms gene-by-gene test-based feature selection procedures.

Software development is an integral part of modern statistical research. Two software packages, pfclust and rpselect, are developed in this thesis based on our PfCluster method and RPselection algorithm. Packages pfclust and rpselect are implemented based on R object-oriented programming framework, and they can be easily customized and extended by users.

The ideas in our two procedures can be generalized and applied to other data mining tasks. This thesis concludes with discussion on connections between two methods and the related future research.

Committee:

Jiayang Sun (Advisor)

Subjects:

Statistics

Keywords:

Bioinformatics; coherence index; data mining; feature selection; gene expression pathway; gene profiling; informative gene; microarray data; profile cluster analysis; partitioning; regulatory network; statistical pattern recognition

Brennan, Patrick JAn Investigation of Personal Ancestry Using Haplotypes
Master of Science, University of Toledo, 2017, Biology (Cell-Molecular Biology)
Several companies over the past decade have started to offer ancestry analysis, the most notable company being 23andMe. For a relatively low price, 23andMe will sequence select variants in a person’s genome to determine where their ancestors came from. Since 23andMe is a private company, the exact techniques and algorithms it uses to determine ancestry are proprietary. Many customers have wondered about the accuracy of these results, often citing their own genealogical research of recent ancestors. To bridge the gap between 23andMe and the public, we sought to provide a tool that could assess the ancestry results of 23andMe. Using publicly available 23andMe genotype files, we constructed a program pipeline that takes these files and compares them against genomes from the 1000 Genomes Project. We constructed haplotypes from the 23andMe file by converting 50 adjacent SNPs (single nucleotide polymorphisms) into haplotypes and comparing them against the haplotypes of 2504 individuals in the Phase 3 data from the 1000 Genome Project. To smooth the data, we bundled together six of our haplotype segments to form an “IBD segment” (Identity-by-descent segment) and used a point scoring system to calculate the highest matching population. Our pipeline determined ancestry results for 57 individuals with similar results to 23andMe. Fifty of our subjects showed European ancestry, while the other seven subjects showed ancestry from East Asia, Africa, America, and South Asia. Of our 5 geographic categories (South Asia, Africa, America, Europe, East Asia), 98% of our subjects showed ancestral representation from 4 of the 5 categories. In addition to ancestry, we also investigated IBD sharing across populations, particularly IBD segments in the Human Leukocyte Antigen (HLA) region on chromosome 6. We hope this tool will help 23andMe customers and those of similar genotyping companies understand the methods used to determine ancestry and verify their results.

Committee:

Alexei Fedorov (Committee Chair); Robert Blumenthal (Committee Member); Sadik Khuder (Committee Member)

Subjects:

Bioinformatics

Keywords:

haplotypes;population;1000 genomes; genomes; ancestry; 23andMe; perl; bioinformatics; SNP

Shankar, VijayExtension of Multivariate Analyses to the Field of Microbial Ecology
Doctor of Philosophy (PhD), Wright State University, 2016, Biomedical Sciences PhD
Ground-breaking advancements in molecular and analytical techniques in the past decade have enabled researchers to accumulate data at an extraordinary rate. Especially in the field of microbial ecology, the introduction of technologies such as high-throughput sequencing, quantitative microarrays, nuclear magnetic resonance and mass spectrometry has led to the interrogation of diverse and previously unexplored microbial communities at unparalleled depth. Analysis and interpretation of patterns within datasets acquired with such high-throughput methods require powerful statistical approaches. A class of such techniques called multivariate statistical analyses is an excellent choice for analysis of complex microbiota-related datasets. This field of statistics is constantly evolving as new techniques and procedures are being developed and applied to explore and interpret the underlying patterns both statistically and visually. As a result, the decision-making process involved in the choice of the technique that best suits the scientific question and the dataset is no longer trivial. Additionally, the current trends in the use of multivariate statistics in microbial ecology indicate a strong preference toward exploratory analyses, resulting in limitations to possible biological interpretations. In order to facilitate a more extensive integration of multivariate statistics in microbial ecology, I apply a diverse set of analytical methods to human-associated microbial and metabolite datasets that allows us to draw biologically relevant inferences. Specifically, I use indirect gradient analyses to show that the largest gradients of variability correspond to the separation of samples based on sample groups. I use direct gradient analyses to explain a significant portion of the overall variability present within the response variables using independently measured environmental variables. I use classifier techniques to build highly accurate discriminant models based on the differences in the response variables across sample groups and identify the variables that contribute the most to sample group separation. Using correlation-based bipartite analyses, I identify statistically significant associations between two different sets of response variable that were measured for the same set of samples. Finally, I integrate the analytical insights from the above approaches into a generalized protocol for the analysis of multivariate datasets in the field of microbial ecology.

Committee:

Oleg Paliy, Ph.D. (Committee Chair); Gerald Alter, Ph.D. (Committee Member); Michael Raymer, Ph.D. (Committee Member); Nicholas Reo, Ph.D. (Committee Member)

Subjects:

Bioinformatics; Biomedical Research; Biostatistics; Microbiology

Keywords:

microbiology;biostatistics;bioinformatics;biomedical research

Gilder, Jason R.Computational methods for the objective review of forensic DNA testing results
Doctor of Philosophy (PhD), Wright State University, 2007, Computer Science and Engineering PhD
Since the advent of criminal investigations, investigators have sought a "gold standard" for the evaluation of forensic evidence. Currently, deoxyribonucleic acid (DNA) technology is the most reliable method of identification. Short Tandem Repeat (STR) DNA genotyping has the potential for impressive match statistics, but the methodology not infallible. The condition of an evidentiary sample and potential issues with the handling and testing of a sample can lead to significant issues with the interpretation of DNA testing results. Forensic DNA interpretation standards are determined by laboratory validation studies that often involve small sample sizes. This dissertation presents novel methodologies to address several open problems in forensic DNA analysis and demonstrates the improvement of the reported statistics over existent methodologies. Establishing a dynamically calculated RFU threshold specific to each analysis run improves the identification of signal from noise in DNA test data. Objectively identifying data consistent with degraded DNA sample input allows for a better understanding of the nature of an evidentiary sample and affects the potential for identifying allelic dropout (missing data). The interpretation of mixtures of two or more individuals has been problematic and new mathematical frameworks are presented to assist in that interpretation. Assessing the weight of a DNA database match (a cold hit) relies on statistics that assume that all individuals in a database are unrelated – this dissertation explores the statistical consequences of related individuals being present in the database. Finally, this dissertation presents a statistical basis for determining if a DNA database search resulting in a very similar but nonetheless non-matching DNA profile indicates that a close relative of the source of the DNA in the database is likely to be the source of an evidentiary sample.

Committee:

Travis Doom (Advisor)

Keywords:

Bioinformatics; Forensics; PCR-STR DNA; Limit of detection; Degraded DNA; Mixture interpretation; DNA databases; Familial searching

Moss, TiffanieCHARACTERIZATION OF STRUCTURAL VARIANTS AND ASSOCIATED MICRORNAS IN FLAX FIBER AND LINSEED GENOTYPES BY BIOINFORMATIC ANALYSIS AND HIGH-THROUGHPUT SEQUENCING
Doctor of Philosophy, Case Western Reserve University, 2012, Biology
The recent sequencing and assembly of the Bethune genome, a linseed type, and the sequencing of several of the genotrophs of Stormont cirrus, a fiber type, provided a platform for analysis of structural variation sites between flax fiber and linseed types which could be used to identify regions worthy of investigation for the improvement of seed and fiber traits in flax. The most well characterized site of structural variation in flax is that of LIS-1, an environmentally inducible and heritable 5.7Kb structural variation. Previous investigations suggest the LIS-1 structural variant is the result of a programmed DNA rearrangement. The only vaguely comparable system for controlled or programmed DNA rearrangement is that seen in the ciliate macronucleus. The scnRNA mechanism used by ciliates utilizes much of the same machinery as that of small RNAs and the action is similar to that of heterochromatin-associated siRNAs. By identifying small RNAs which map to other regions of structural variation in flax, the impact these regions could have on the microRNA regulated biological pathways, and association of other small RNAs mapping to these regions may be discerned and used to prioritize structural variants for further molecular investigation. Regions of the Bethune genome which may be associated with small RNAs was determined by computational prediction of microRNAs and mapping of small RNA reads from high throughput RNA sequencing. Over 25,000 miRNAs were predicted from the flax genome and Unigene database using the novoMIR plant miRNA prediction program. Of these 649 discreet miRNA gene units were identified as having potential targets among the 30,649 flax Unigenes. RNAseq provided preliminary support for 349 of the predicted miRNAs derived from novoMIR and identified an additional 1.4 million unique reads to be included in further investigations. Sequence similarity to public miRNA databases suggest that the flax transcriptome utilizes most of the conserved miRNAs among angiosperm and will likely have similar regulatory roles. Using Perl programming scripts, 44,106 PAV sites of at least 1Kb were identified between flax linseed and fiber types. 143 PAV sites were found to be associated with computationally predicted miRNAs or sRNAs identified by high-throughput sequencing.

Committee:

Christopher Cullis (Advisor); Stephen Haynesworth (Committee Member); Emmitt Jolly (Committee Member); Barbara Kuemerle (Committee Member); Saba Valadkhan (Committee Member)

Subjects:

Bioinformatics; Biology; Genetics; Molecular Biology; Plant Biology

Keywords:

microRNA; miRNA; flax; linseed; bioinformatics; computational prediction; PAV; structural variation; genotroph

MARKEY, MICHAEL PATRICKTRANSCRIPTIONAL REGULATION BY THE RETINOBLASTOMA TUMOR SUPPRESSOR: NOVEL TARGETS AND MECHANISMS
PhD, University of Cincinnati, 2004, Medicine : Cell and Molecular Biology
The retinoblastoma tumor suppressor (RB) is a key regulator of the cell cycle. It is targeted for loss or functional inactivation in the majority of cancers. Through interaction with the E2F family of transcription factors, it regulates the expression of many genes involved in the transition from the G1 phase of the cell cycle to S phase. Beyond this, RB has been implicated to play a role in a variety of cellular processes, including differentiation, development, and progression through S and G2. Despite the importance of RB, the targets of RB-mediated transcriptional repression remain mostly speculative. Here we have undertaken a comprehensive genomic study to identify genes regulated by RB. Microarray analyses of cells expressing activate RB revealed that RB represses a wide variety of genes involved in several cellular functions. Many of these RB targets are genes which are activated by the expression of various E2F family members. However, repression by RB and activation by E2F are not equal and opposite events. Genes may or may not be affected by both, and not always to the same extent. Besides known E2F targets, several novel genes were linked to the RB pathway. These include geminin, an important regulator of DNA licensing, which is discussed here in greater detail. Furthermore, conditional loss of RB from adult cells resulted in deregulation of many of the same genes repressed by active RB. Interestingly, a number of genes were also repressed when RB is lost. These fall into several classes, but include many genes involved in immune functioning. Taken together, these data represent an important step forward in our understanding of this vital tumor suppressor.

Committee:

Dr. Erik Knudsen (Advisor)

Keywords:

retinoblastoma; RB; cancer; cell cycle; bioinformatics; microarray

Kalluru, Vikram GajananIdentify Condition Specific Gene Co-expression Networks
Master of Science, The Ohio State University, 2012, Electrical and Computer Engineering
Since co-expressed genes often are co-regulated by a group of transcription factors, different conditions (e.g., disease versus normal) may lead to different transcription factor activities and therefore different co-expression relationships. A method for identifying condition specific co-expression networks by combining the recently developed network quasi-clique mining algorithm and the Expected Conditional F-statistic has been proposed. This method has been applied to compare the transcriptional programs between the non-basal and basal types of breast cancers. This work is a translational bioinformatics study integrating network analysis which lifts the traditional gene list based disease biomarker discovery to the gene and protein interaction level. This work presents a method for identifying condition specific gene co-expression networks. The method involves construction of a Weighted Graph Co-expression Network (WGCN) and mining the WGCNs to identify dense co-expression networks followed by a chi-square test based enrichment analysis for detecting condition specific co-expression relationship. The expression values in all the conditions for the genes constituting a condition specific co-expression network are visualized as heat maps which suggest that the genes are highly correlated in a specific condition but the correlations are disrupted in other conditions.

Committee:

Kun Huang, PhD (Advisor); Raghu Machiraju, PhD (Committee Member)

Subjects:

Bioinformatics; Computer Engineering; Computer Science; Engineering

Keywords:

gene coexpression; bioinformatics; differential gene coexpression networks; systems biology; basal breast cancer; non-basal breast cancer

Gowrisankar, SivakumarPredicting Functional Impact of Coding and Non-Coding Single Nucleotide Polymorphisms
PhD, University of Cincinnati, 2008, Engineering : Biomedical Engineering

Determining the functional impact of coding and non-coding single nucleotide polymorphisms (SNPs) is one of the primary challenges in establishing genotype-phenotype relations. The SNPs constitute more than 90% of the genetic variation and account for most trait differences among individuals and are one of the primary genotype data captured when studying the genetic basis of disease. The advent of efficient high-throughput DNA sequencers and GeneChips™ necessitates robust computational analysis pipelines to handle the genotype data more efficiently and facilitate seamless integration with clinical data. To address this, we have developed a bioinformatics-based comprehensive analysis pipeline which predicts the effect of coding and non-coding SNPs.

Based on the hypothesis that by integrating multiple coding SNP-impact predictions we can analyze and predict the SNP outcome better, we integrated three impact-prediction scores and one population-based score to obtain a SVM-based meta-prediction model. Through cross-validation studies, we demonstrate that our approach improves the SNP-effect prediction. For the first time, we have used the population-based minor allele frequency (MAF) as one of the features for SNP-effect prediction and prove that it significantly improves the performance of the prediction algorithm. We then extended this approach to predict the impact of non-coding promoter SNPs. Our results, through feature combinations and cross-validation, show that integrating multiple sequence-based features improves performance of the SNP-effect predictor. Also for the first time we demonstrate that the loss or gain of guanine in the SNP-overlapping putative transcription binding sites can be used as a measure of likelihood for an alteration in the native binding site, thereby increasing the odds of the SNP being functional.

Through various test cases, we demonstrate the utility of our algorithm. Using a specific test case of p53 binding sites, we also demonstrate a method for the enhancement of prediction based on the inclusion of experimental-based transactivation data for p53 response-elements (REs) that can enhance the ability to predict the impact of SNPs overlapping p53 REs. Taken together this provides a framework for demonstrating how prediction of TFBS functions may be enhanced in a high throughput fashion using assay screening data.

Committee:

Bruce J. Aronow, PhD (Committee Chair); Anil G. Jegga, DVM, MRes (Committee Member); Marepalli B. Rao, PhD (Other)

Subjects:

Bioinformatics

Keywords:

bioinformatics; SNP; polymorphisms; p53; functional SNPs; coding-SNPs; non-coding SNPs; p53 transactivation;

Choi, IckwonComputational Modeling for Censored Time to Event Data Using Data Integration in Biomedical Research
Doctor of Philosophy, Case Western Reserve University, 2011, EECS - Computer and Information Sciences

Medical prognostic models are designed by clinicians to predict the future course or outcome of disease progression after diagnosis or treatment. The data, which are used when these clinical models are developed, are required to contain a high number of events per variable (EPV) for the resulting model to be reliable. If our objective is to optimize predictive performance by some criterion, we can often achieve a reduced model that has a little bias with low variance, but whose overall performance is improved. To accomplish this goal, we propose a new variable selection approach that combines Stepwise Tuning in the Maximum Concordance Index (STMC) and Forward Nested Subset Selection (FNSS) in two stages. In the first stage, the proposed variable selection is employed to identify the best subset of risk factors optimized with the concordance index using inner cross validation for optimism correction in the outer loop of cross validation, yielding potentially different final models for each of the folds. We then feed the intermediate results of the prior stage into another selection method in the second stage to resolve the overfitting problem and to select a final model from the variation of predictors in the selected models. Two case studies on relatively different sized survival data sets as well as a simulation study demonstrate that the proposed approach is able to select an improved and reduced average model under a sufficient sample and event size compared to other selection methods such as stepwise selection using the likelihood ratio test, Akaike Information Criterion (AIC), and least absolute shrinkage and selection operator (lasso). Finally, we achieve improved final models in each dataset as compared full models according to most criteria. These results of the model selection models and the final models were analyzed in a systematic scheme through validation for independent performance evaluation.

For the second part of this dissertation, we build prognostic models that use clinicopathologic features and predict prognosis after a certain treatment. Most of the recent research efforts have focused on high dimensional genomic data with a small sample. Since clinically similar but molecularly heterogeneous tumors may produce different clinical outcomes, the combination of clinical and genomic information is crucial to improve the quality of prognostic prediction. However, there is lack of an integrating scheme into a clinico-genomic model due to the larger number of variables and small sample size, in particular, for a parsimonious model. We propose a methodology to build a reduced yet accurate integrative model using a hybrid approach based on the Cox regression model, which uses several dimension reduction techniques, L2 penalized maximum likelihood estimation (PMLE), and resampling methods to tackle the problems above. The predictive accuracy of the modeling approach is assessed by several metrics via an independent and thorough scheme to compare competing methods. In breast cancer data studies for metastasis and mortality outcome, in a DLBCL data study, and in simulation studies, we demonstrate that the proposed methodology can improve prediction accuracy and build a final model with a hybrid signature that is parsimonious when integrating both types of variables. The selected clinical factors and genomic biomarkers are found to be highly relevant to the biological processes and can be considered as potential biomarkers for cancer prognosis and therapy. Furthermore, selected but unidentified genes are open to thorough investigation.

Committee:

Michael Kattan (Advisor); Mehmet Koyuturk (Committee Chair); Andy Podgurski (Committee Member); Soumya Ray (Committee Member)

Subjects:

Computer Science

Keywords:

Statistical Machine Learning; Biomedical Informatics; Bioinformatics; Censored Time To Event Data; Clinico-genomic Model; Cox Proportional Hazards Model; Microarray analysis

Sharpnack, Michael FIntegrative Genomics Methods for Personalized Treatment of Non-Small-Cell Lung Cancer
Doctor of Philosophy, The Ohio State University, 2018, Biomedical Sciences
Lung cancer is the most deadly form of cancer, responsible for over 1.6 million deaths annually, the majority of which are due to non-small cell lung cancer, of which adenocarcinoma and squamous cell carcinoma are the major subtypes. Standard chemotherapy produces responses in a small minority of patients, and despite the tremendous growth of personalized therapies in the last decade, only a minority of patients benefit from these treatments in the North American setting. A greater understanding of the biology of non-small cell lung cancer is desperately needed to develop novel targeted therapies and their accompanying biomarkers. Understanding the function of cancer-associated genes requires the integration and analysis of multiple modalities of biological data. Cancer associated genes can be activated or repressed by DNA somatic mutations, RNA alternative splicing, epigenetic changes, microRNA-mediated silencing, post-translational regulation, and other mechanisms. To understand how tumors form and grow, we have to be able to measure DNA, RNA, protein, metabolites, and lipids. Further, integrative and analytical methods are necessary to leverage these data together, collectively termed integrative genomics. Here, we leverage DNA mutations and copy number measurements, RNA transcriptomics, proteomics, and clinical data to discover regulatory relationships in tumors, develop prognostic biomarkers, and identify mediators of tumor mutation burden. First, we focus on the RNA editing protein ADAR, and propose an immune-mediated function in lung adenocarcinoma. Second, we develop a method to integrate RNA and protein expression data to predict binary clinical variables, and test its ability to predict tumor recurrence in surgically resected lung adenocarcinoma samples. Finally, we define the relationship between tumor mutation burden and genome stability protein inactivation to better understand tumor immunogenicity in non-small cell lung cancer. Taken together, these approaches present a comprehensive methodology to utilize integrative genomic data for clinical applications in non-small cell lung cancer.

Committee:

Kun Huang (Advisor); Jeffrey Parvin (Committee Member); David Carbone (Committee Member); Kai He (Committee Member)

Subjects:

Bioinformatics; Oncology

Keywords:

Bioinformatics; Cancer; Genomics; Integrative Genomics; Biology; Oncology

Sullivan, Courtney RBioenergetic Abnormalities in Schizophrenia
PhD, University of Cincinnati, 2018, Medicine: Neuroscience/Medical Science Scholars Interdisciplinary
Schizophrenia is a devastating illness that displays a wide range of psychotic symptoms, as well as cognitive deficits and profound negative symptoms that are often treatment resistant. Cognition is intimately related to synaptic function, which relies on the ability of cells to obtain adequate amounts of energy. Studies have shown that disrupting bioenergetic pathways affects working memory and other cognitive behaviors. Thus, investigating bioenergetic function in schizophrenia could provide important insights into treatments or prevention of cognitive disorders. Therefore, we characterized a major pathway supplying energy to neurons (lactate shuttle) in the dorsolateral prefrontal cortex (DLPFC) in schizophrenia. We found a significant decrease in the activity of two key glycolytic enzymes in schizophrenia (hexokinase and phosphofructokinase), suggesting a decrease in the capacity to generate bioenergetic intermediates through glycolysis in this illness. Notably, we did not detect protein changes in enzymes or transporters in this pathway in the DLPFC, suggesting the bioenergetic interplay of astrocytes and neurons in schizophrenia is highly complex and may not be fully appreciated at the region-level. Thus, we utilized a cell-level approach (laser capture microdissection) and found significant mRNA changes in glycolytic enzymes (hexokinase-1, phosphofructokinase-muscle, phosphofructokinase-liver, and glucose-6-phosphate isomerase), lactate transporters (monocarboxylate transporter 1), and glucose transporters (glucose transporter 1, GLUT1 and GLUT3) in pyramidal neurons in schizophrenia. We did not find any changes in astrocytes, suggesting neuron-specific deficits in glycolytic pathways in the DLPFC in schizophrenia. To build on these findings, we performed bioinformatic analyses to examine the implications of an altered bioenergetic profile in schizophrenia. We first sought to replicate our findings in additional cohorts of schizophrenia and control subjects. We probed 2 independent transcriptomic datasets for our metabolic targets. Supporting our hypothesis, we found several glycolytic targets to also be dysregulated in schizophrenia in these databases. Using the Library of Integrated Network-Based Cellular Signatures (LINCS) database to generate transcriptional signatures containing differentially expressed genes associated with bioenergetic abnormalities in schizophrenia. Using these signatures, we performed enrichment analyses to examine biological significance and found hits for cell metabolism, proliferation, and immunity/inflammation pathways. Furthermore, we compared our disease signatures to a library of “drug activity transcriptional signatures” to identify possible perturbagens with the ability to “reverse” the disease signature. Top perturbagens included peroxisome proliferator-activated receptor (PPAR) agonists, capable of bolstering metabolic pathways and possibly reversing cognitive deficits. To further elucidate the role of bioenergetics in cognitive dysfunction, we utilized the GluN1 knockdown (KD) model of schizophrenia. This model exhibits several endophenotypes of schizophrenia including impairments in executive function. With the goal of reversing these deficits, we selected a top perturbagen from our drug discovery bioinformatic analysis with the hypothesis that this drug intervention may help restore schizophrenia endophenotypes in this model. We investigated the effects of pioglitazone treatment, a ligand for PPARγ in the GluN1 KD model and found that pioglitazone helped restore explicit memory. This suggests pioglitazone may improve specific subtypes of cognition. This work has important implications for the treatment of cognitive illnesses with bioenergetic deficits such as schizophrenia.

Committee:

Mark Baccei, Ph.D. (Committee Chair); Temugin Berta (Committee Member); Michael Lieberman, Ph.D. (Committee Member); Robert McCullumsmith, M.D. (Committee Member); Robert McNamara, Ph.D. (Committee Member)

Subjects:

Neurology

Keywords:

schizophrenia;bioenergetics;glycolysis;GluN1 knockdown mouse;LINCS bioinformatics;pioglitazone

Wiredja, DanicaPhosphoproteomic Characterization of Systems-Wide Differential Signaling Induced by Small Molecule PP2A Activation
Doctor of Philosophy, Case Western Reserve University, Systems Biology and Bioinformatics
Protein phosphatase 2A (PP2A) is a serine/threonine phosphatase that downregulates numerous biological processes implicated in anti-apoptotic and pro-proliferative signaling. Considering that it is frequently inhibited in cancers, PP2A is believed to be a tumor suppressor whose reactivation can potentially suppress oncogenic pathways and drive an anti-cancer response. We have developed and tested a novel series of small molecule activators of PP2A (SMAPs) that not only bind to and activate this phosphatase but also inhibit cell growth and promote apoptosis in non-small cell lung cancer (NSCLC) models. Given PP2A’s breadth of cellular targets, comprehensive characterization of the net signaling changes induced by this treatment requires a systems-level approach. Mass spectrometry-based phosphoproteomics is a high-throughput system capable of quantifying thousands of phosphopeptides in a single experiment. Consequently, this dissertation project aims to explore the global differential signaling fluxes induced by phosphatase activation using this platform, coupled to downstream bioinformatics analyses. To accomplish this, I first generated a new computational tool that performs kinase-level activity inferences from phosphoproteomics data. The resulting KSEA App was then applied, in conjunction with other pathway and network-level bioinformatic methods, to estimate relative changes in signaling pathway regulation and identify the major targets affected by SMAP treatment. Finally, proteome-wide analysis of the time-dependent phosphorylation changes with the PP2A activator versus dual kinase inhibitors revealed some striking differences in the internal rewiring between phosphatase activation and kinase inhibition. These findings, in turn, guided us to design a new drug combination featuring SMAP + AZD-6244 (a MEK inhibitor), which displayed enhanced anti-tumor activity in vivo. Altogether, this project features a unique bioinformatics workflow for phosphoproteomics data analysis that helped uncover the global signaling implications of phosphatase activation.

Committee:

Mark Chance (Advisor); Goutham Narla (Advisor)

Subjects:

Bioinformatics; Cellular Biology

Keywords:

phosphoproteomics; phosphatase; bioinformatics

Next Page