Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems

Jain, Arpan

Keyword Search

School Logo

Arpan_Phd_Thesis.pdf (12.68 MB)

Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems

Author Info

Jain, Arpan

ORCID® Identifier

http://orcid.org/0000-0003-2522-8522

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=osu1672752270919153

Year and Degree

2023, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.

Abstract

Deep Learning has achieved state-of-the-art performance in several artificial intelligence tasks like object recognition, speech recognition, machine translation, and summarization. Deep learning is a subset of machine learning that learns multiple levels of data representation using Neural Networks (NNs). The rise of deep learning can be attributed to the presence of large datasets and computation power. Large-scale Deep Neural Networks (DNNs) can provide state-of-the-art performance by learning complex relationships, enabling them to push the boundaries in artificial intelligence. However, training such large-scale DNNs is a compute-intensive task as it can have billions of parameters, which increases both the memory and computational requirement of DNN training. Hence, distributed DNN training has become the default approach to train large-scale DNNs like AmoebaNet, GPT3, and T5. Broadly, the DNN training pipeline can be divided into multiple phases: 1) Data Loading and Data Augmentation, 2) Forward/Backward Pass, and 3) Model Validation. Traditionally, these phases are executed sequentially on a single CPU or GPU due to a lack of additional resources. Multiple processing elements can be used to parallelize the computation in each phase and reduce the overall training time. In this dissertation, we propose novel parallelization strategies for distributed DNN training to alleviate bottlenecks in different phases of DNN training and parallelize the computation across multiple processing elements. Novel parallelization strategies are required to efficiently distribute the work among multiple processing elements and reduce communication overhead, as naive parallelization strategies may not give performance benefits when distributing work among multiple processing elements because of high communication overhead. Therefore, we need novel parallelization strategies designed to distribute the work while keeping the communication overhead low. There are several challenges in the existing DNN training pipeline. Data loading/augmentation and model validation can be up to 20% of the overall training time, making training large-scale DNNs time-consuming. Therefore, we propose a new parallelization scheme that uses the computing power of NVIDIA's recently released Data Processing Units (DPUs) to offload data loading and model validation phases and accelerate the performance of Data Parallelism. Forward and backward passes remain the most compute-intensive phase in the DNN training pipeline. Increasing the number of layers and parameters in DNNs to achieve better accuracy has become a common approach in deep learning. In the last couple of years, several DNNs like AmoebaNet, T5, and GPT3 have been proposed in the literature, pushing the boundary of the number of parameters and layers. However, computation and memory requirements also increase with the increase in the number of layers and parameters. Therefore, these models cannot be trained on a single processing element. Broadly, large-scale DNNs can be categorized into two categories: 1) In-core models (DNNs that fit inside the memory of a single processing element) and 2) Out-of-core models (DNNs that are too large to fit inside the memory of a single processing element). Technically, the in-core model can be trained on a single processing element, but the training time will be too high, which makes the training impossible. Therefore, we need a novel parallelization strategy that accelerates the training and uses the inherent parallelism of DNN architecture. Because of limited memory in modern accelerators like GPUs, several large-scale DNNs cannot be trained on a single processing element. Therefore, we need novel parallelization strategies to distribute the layers/neurons among multiple processing elements to reduce the memory requirement on each processing element. In this dissertation, we propose several novel parallelization strategies to alleviate current bottlenecks in different phases DNN training pipeline and reduce the overall training time. The key idea is that one should develop custom parallelization strategies for each DNN architecture type so that it can utilize the inherent parallelism and computation pattern to reduce the training time while keeping it generic and applicable to a large number of models in deep learning. We have developed several novel parallelization strategies like Data Sub-Graph Parallelism, Bi-Directional Parallelism, and Hybrid Five-Dimensional Parallelism that accelerate DNN training for in-core, out-of-core model, and out-of-core layer DNNs, respectively. Developed strategies are evaluated on large-scale GPU systems and are made available as public releases or published papers.

Committee

Dhabaleswar Panda (Advisor)
Raghu Machiraju (Committee Member)
Aamir Shafi (Committee Member)
Hari Subramoni (Committee Member)
Rajiv Ramnath (Committee Member)

Pages

243 p.

Subject Headings

Computer Science

Keywords

Deep Learning, HPC, Distributed DNN Training, Model Parallelism, Data Parallelism, Hybrid Parallelism, MPI

Jain, A. (2023). Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1672752270919153
APA Style (7th edition)
Jain, Arpan. Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems. 2023. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1672752270919153.
MLA Style (8th edition)
Jain, Arpan. "Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems." Doctoral dissertation, Ohio State University, 2023. http://rave.ohiolink.edu/etdc/view?acc_num=osu1672752270919153
Chicago Manual of Style (17th edition)

Document number:

osu1672752270919153

Download Count:

372

Copyright Info

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations