Loading...
Please wait, while we are loading the content...
Similar Documents
Lobster: Load Balance-Aware I/O for Distributed DNN Training
| Content Provider | Hyper Articles en Ligne (HAL) |
|---|---|
| Author | Liu, Jie Nicolae, Bogdan Li, Dong |
| Copyright Year | 2022 |
| Abstract | The resource-hungry and time-consuming process of training Deep Neural Networks (DNNs) can be accelerated by optimizing and/or scaling computations on accelerators such as GPUs. However, the loading and pre-processing of training samples then often emerges as a new bottleneck. This data loading process engages a complex pipeline that extends from the sampling of training data on external storage to delivery of those data to GPUs, and that comprises not only expensive I/O operations but also decoding, shuffling, batching, augmentation, and other operations. We propose in this paper a new holistic approach to data loading that addresses three challenges not sufficiently addressed by other methods: I/O load imbalances among the GPUs on a node; rigid resource allocations to data loading and data preprocessing steps, which lead to idle resources and bottlenecks; and limited efficiency of caching strategies based on pre-fetching due to eviction of training samples needed soon at the expense of those needed later. We first present a study of key bottlenecks observed as training samples flow through the data loading and preprocessing pipeline. Then, we describe Lobster, a data loading runtime that uses performance modeling and advanced heuristics to combine flexible thread management with optimized eviction for distributed caching in order to mitigate I/O overheads and load imbalances. Experiments with a range of models and datasets show that the Lobster approach reduces both I/O overheads and end-to-end training times by up to 1.5× compared with stateof-the-art approaches. |
| Related Links | https://hal.science/hal-03718681/file/icpp22-81.pdf |
| Conference Proceedings | ICPP '22: The 51st International Conference on Parallel Processing |
| DOI | 10.1145/3545008.3545090 |
| Language | English |
| Publisher | HAL CCSD |
| Access Restriction | Open |
| Subject Keyword | Distributed, Parallel, and Cluster Computing [cs.DC] Computer Science [cs] |
| Content Type | Text |
| Resource Type | Conference Proceedings |
| Subject | Medicine |