Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
PremBhusalThesis.pdf (2.13 MB)
ETD Abstract Container
Abstract Header
Scalable Clustering for Immune Repertoire Sequence Analysis
Author Info
Bhusal, Prem
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=wright1558631347622374
Abstract Details
Year and Degree
2019, Master of Science (MS), Wright State University, Computer Science.
Abstract
The development of the next-generation sequencing technology has enabled systems immunology researchers to conduct detailed immune repertoire analysis at the molecule level. Large sequence datasets (e.g., millions of sequences) are being collected to com- prehensively understand how the immune system of a patient evolves over different stages of disease development. A recent study has shown that the hierarchical clustering (HC) algorithm gives the best results for B-cell clones analysis - an important type of immune repertoire sequencing (IR-Seq) analysis. However, due to the inherent complexity, the classical hierarchical clustering algorithm does not scale well to large sequence datasets. Surprisingly, no algorithms have been developed to address this scalability issue for im- munology research. In this thesis, we study two different strategies, aiming at finding the best scalable methods that can preserve the quality of hierarchical clustering structure. The two strategies include (1) non-Euclidean indexing methods for speeding up the clas- sical hierarchical clustering(HC), (2) a new tree-based sequence summarization approach - SCT that scans the large sequence dataset once and generates summaries for hierarchi- cal clusters(HC). And we also experimented with the Spark based minimum-spanning-tree algorithm (SparkMST) that generates the equivalent result of single linkage hierarchical clustering (SLINK) for comparative analysis. We have implemented all these algorithms and experimented with real sequence datasets for B-cell clones analysis. The result shows that (1) the indexing-enhanced HC (e.g., us- ing the Vantage-Point tree for indexing) preserves the clustering quality very well, while also significantly reducing the time complexity of the original HC; (2) SCT with HC is the fastest approximate HC method with slightly sacrificed quality; and (3) SparkMST scales out satisfactorily and gives significant performance gain with a large Spark cluster.
Committee
Keke Chen, Ph.D. (Advisor)
Krishnaprasad Thirunarayan, Ph.D. (Committee Member)
Tanvi Banerjee, Ph.D. (Committee Member)
Pages
50 p.
Subject Headings
Computer Science
Keywords
Clustering
;
Immune-Repertoire
;
Sequence
;
Hierarchical Clustering
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Bhusal, P. (2019).
Scalable Clustering for Immune Repertoire Sequence Analysis
[Master's thesis, Wright State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=wright1558631347622374
APA Style (7th edition)
Bhusal, Prem.
Scalable Clustering for Immune Repertoire Sequence Analysis.
2019. Wright State University, Master's thesis.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=wright1558631347622374.
MLA Style (8th edition)
Bhusal, Prem. "Scalable Clustering for Immune Repertoire Sequence Analysis." Master's thesis, Wright State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=wright1558631347622374
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
wright1558631347622374
Download Count:
324
Copyright Info
© 2019, all rights reserved.
This open access ETD is published by Wright State University and OhioLINK.