Analyzing TCGA Genomic and Expression Data Using SVM with Embedded Parameter Tuning

Zhao, Haitao

Keyword Search

School Logo

ZhaoH.the (final comments 1).pdf (1.41 MB)

Analyzing TCGA Genomic and Expression Data Using SVM with Embedded Parameter Tuning

Author Info

Zhao, Haitao

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=akron1415629295

Year and Degree

2014, Master of Science, University of Akron, Computer Science.

Abstract

The high-throughput next generation sequencing revolutionized the genomic sequencing techniques. It allows the study of thousands of genes and even the entire exome in a given organism simultaneously. This as well as other high-throughput technologies such as DNA microarray has broadened the genomic sequencing applications and changed biomedical research in a profound way. Comparing with microarray, the big data generated from next generation sequencing is considerably more reliable. As such, the technique has rapidly emerged as a major tool to obtain gene mutation and expression profiles of human cancers. The availability of these big genomic data presents unique scientific challenges and opportunities. One such challenge is to understand and characterize the patterns of genomic mutation and gene expression in different cancer types presented in the datasets. Many data mining approaches have already been developed to analyze the large datasets for feature selections and sample classifications. Since mutation and gene expression profiles are noisy due to both biological and technical variations in the data, it is clear that the effectiveness and robustness of a machine learning based classification system significantly depends upon the nature of the input data. In this study, we explore the DNA mutation and gene expression patterns in lung cancer using support vector machines with embedded parameter tuning. Two datasets used are derived from somatic mutation data and RNA-seq gene expression profiles presented in TCGA (The Cancer Genome Atlas). The embedded parameter tuning is based on data mining the training dataset using validation techniques and concepts of committee voting approach. We show that the support vector machines with tuning significantly improve the robustness and the classification accuracy when they are compared to the regular support vector machines. The approach was applied to the two datasets to explore the mutation patterns in lung adenocarcinoma between smokers and non-smokers as well as the expression patterns between lung adenocarcinoma and lung squamous cell carcinoma, two subtypes of lung cancer. Our results reveal no obvious mutation patterns between smokers and non-smokers. Since the TCGA lung adenocarcinoma dataset contain only few samples from non-smokers, it is possible the results are attributable to both the lung cancer nature and the unbalanced representation of the two different classes under consideration. On the other hand, the gene expression patterns show pronounced difference between the subtypes of lung cancer, validating the conclusion that cancer tissues of different subtypes are differentiable at the expression levels. We conclude the support vector machines with embedded parameter tuning is an effective tool for analyzing RNA-seq gene expression data.

Committee

Zhong-Hui Duan, Dr. (Advisor)
Yingcai Xiao, Dr. (Committee Member)
En Cheng, Dr. (Committee Member)

Pages

85 p.

Subject Headings

Bioinformatics; Computer Science

Keywords

SVM; TCGA; Gene Expression; RNA-seq; Classification

Zhao, H. (2014). Analyzing TCGA Genomic and Expression Data Using SVM with Embedded Parameter Tuning [Master's thesis, University of Akron]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=akron1415629295
APA Style (7th edition)
Zhao, Haitao. Analyzing TCGA Genomic and Expression Data Using SVM with Embedded Parameter Tuning . 2014. University of Akron, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=akron1415629295.
MLA Style (8th edition)
Zhao, Haitao. "Analyzing TCGA Genomic and Expression Data Using SVM with Embedded Parameter Tuning ." Master's thesis, University of Akron, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=akron1415629295
Chicago Manual of Style (17th edition)

Document number:

akron1415629295

Download Count:

1,342

Copyright Info

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Analyzing TCGA Genomic and Expression Data Using SVM with Embedded Parameter Tuning

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Analyzing TCGA Genomic and Expression Data Using SVM with Embedded Parameter Tuning

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Recommended Citations