Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Analyzing TCGA Genomic and Expression Data Using SVM with Embedded Parameter Tuning

Abstract Details

2014, Master of Science, University of Akron, Computer Science.
The high-throughput next generation sequencing revolutionized the genomic sequencing techniques. It allows the study of thousands of genes and even the entire exome in a given organism simultaneously. This as well as other high-throughput technologies such as DNA microarray has broadened the genomic sequencing applications and changed biomedical research in a profound way. Comparing with microarray, the big data generated from next generation sequencing is considerably more reliable. As such, the technique has rapidly emerged as a major tool to obtain gene mutation and expression profiles of human cancers. The availability of these big genomic data presents unique scientific challenges and opportunities. One such challenge is to understand and characterize the patterns of genomic mutation and gene expression in different cancer types presented in the datasets. Many data mining approaches have already been developed to analyze the large datasets for feature selections and sample classifications. Since mutation and gene expression profiles are noisy due to both biological and technical variations in the data, it is clear that the effectiveness and robustness of a machine learning based classification system significantly depends upon the nature of the input data. In this study, we explore the DNA mutation and gene expression patterns in lung cancer using support vector machines with embedded parameter tuning. Two datasets used are derived from somatic mutation data and RNA-seq gene expression profiles presented in TCGA (The Cancer Genome Atlas). The embedded parameter tuning is based on data mining the training dataset using validation techniques and concepts of committee voting approach. We show that the support vector machines with tuning significantly improve the robustness and the classification accuracy when they are compared to the regular support vector machines. The approach was applied to the two datasets to explore the mutation patterns in lung adenocarcinoma between smokers and non-smokers as well as the expression patterns between lung adenocarcinoma and lung squamous cell carcinoma, two subtypes of lung cancer. Our results reveal no obvious mutation patterns between smokers and non-smokers. Since the TCGA lung adenocarcinoma dataset contain only few samples from non-smokers, it is possible the results are attributable to both the lung cancer nature and the unbalanced representation of the two different classes under consideration. On the other hand, the gene expression patterns show pronounced difference between the subtypes of lung cancer, validating the conclusion that cancer tissues of different subtypes are differentiable at the expression levels. We conclude the support vector machines with embedded parameter tuning is an effective tool for analyzing RNA-seq gene expression data.
Zhong-Hui Duan, Dr. (Advisor)
Yingcai Xiao, Dr. (Committee Member)
En Cheng, Dr. (Committee Member)
85 p.

Recommended Citations

Citations

  • Zhao, H. (2014). Analyzing TCGA Genomic and Expression Data Using SVM with Embedded Parameter Tuning [Master's thesis, University of Akron]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=akron1415629295

    APA Style (7th edition)

  • Zhao, Haitao. Analyzing TCGA Genomic and Expression Data Using SVM with Embedded Parameter Tuning . 2014. University of Akron, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=akron1415629295.

    MLA Style (8th edition)

  • Zhao, Haitao. "Analyzing TCGA Genomic and Expression Data Using SVM with Embedded Parameter Tuning ." Master's thesis, University of Akron, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=akron1415629295

    Chicago Manual of Style (17th edition)