Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

Hassan Akbari, Dan Kondratyuk, Yin Cui, Rachel Hornung, Huisheng Wang, Hartwig Adam

2023-05-10NeurIPS 2023 11Image Classification Video-Text Retrieval Text Retrieval Zero-Shot Environment Sound Classification Zero-Shot Action Recognition Zero-Shot Transfer Image Classification Video Classification Classification Zero-Shot Learning

Paper PDF

Abstract

We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model and task scaling. We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model. 2) Sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigates the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including video classification, image classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L variant focusing on video tasks that achieves new state-of-the-art in zero-shot video classification: 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 68.3% on Kinetics-700, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.

Results

Task	Dataset	Metric	Value	Model
Zero-Shot Transfer Image Classification	ImageNet	Accuracy (Private)	83.9	IMP-MoE-L
Zero-Shot Action Recognition	UCF101	Top-1 Accuracy	91.5	IMP-MoE-L
Zero-Shot Action Recognition	Kinetics	Top-1 Accuracy	76.8	IMP-MoE-L
Zero-Shot Action Recognition	HMDB51	Top-1 Accuracy	59.1	IMP-MoE-L

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

Abstract

Results

Related Papers

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

Abstract

Results

Related Papers