TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Hydra: A System for Large Multi-Model Deep Learning

Hydra: A System for Large Multi-Model Deep Learning

Kabir Nagrecha, Arun Kumar

2021-10-16Transfer LearningModel SelectionSchedulingDeep LearningLanguage Modelling
PaperPDFCode(official)

Abstract

Scaling up model depth and size is now a common approach to raise accuracy in many deep learning (DL) applications, as evidenced by the widespread success of multi-billion or even trillion parameter models in natural language processing (NLP) research. Despite success in DL research and at major technology companies, broader practical adoption of such large models among domain scientists and businesses is still bottlenecked by GPU memory limits, high training costs, and low GPU availability, even on public clouds. Model selection needs further compound these resource challenges: users often need to compare dozens of models with different hyper-parameters or neural architectures to suit their specific task and dataset. In this paper, we present Hydra, a system designed to tackle such challenges by enabling out-of-the-box scaling for multi-large-model DL workloads on even commodity GPUs in a resource-efficient manner. Hydra is the first approach to holistically optimize the execution of multi-model workloads for large DL models. We do this by adapting prior "model-parallel" execution schemes to work with scalable parameter offloading across the memory hierarchy and further hybridizing this approach with task-parallel job scheduling techniques. Hydra decouples scalability of model parameters from parallelism of execution, thus enabling DL users to train even a 6-billion parameter model on a single commodity GPU. It also fully exploits the speedup potential of task parallelism in multi-GPU setups, yielding near-linear strong scaling and making rigorous model selection perhaps more practical for such models. We evaluate end-to-end performance by fine-tuning GPT-2 for language modeling. We find that Hydra offers between 50% and 100% higher training throughput than even the best settings of state-of-the-art industrial frameworks such as DeepSpeed and GPipe for multi-large-model training.

Results

TaskDatasetMetricValueModel
Language ModellingWikiText-2Test perplexity15.17GPT-2 (fine-tuned)
Language ModellingWikiText-2Validation perplexity15.69GPT-2 (fine-tuned)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC2025-07-18Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17Fremer: Lightweight and Effective Frequency Transformer for Workload Forecasting in Cloud Services2025-07-17Transient-Stability-Aware Frequency Provision in IBR-Rich Grids via Information Gap Decision Theory and Deep Learning2025-07-17Leveraging Asynchronous Cross-border Market Data for Improved Day-Ahead Electricity Price Forecasting in European Markets2025-07-17