TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Mirasol3B: A Multimodal Autoregressive model for time-alig...

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova

2023-11-09CVPR 2024 1Action ClassificationAudio ClassificationVideo Question Answering
PaperPDF

Abstract

One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.

Results

TaskDatasetMetricValueModel
VideoKinetics-SoundsTop 1 Accuracy90.1Mirasol3B
Video Question AnsweringActivityNet-QAAccuracy51.13Mirasol3B
Video Question AnsweringMSRVTT-QAAccuracy50.42Mirasol3B
Video Question AnsweringNExT-QAAccuracy72Mirasol3B
Audio ClassificationEPIC-SOUNDSAccuracy78.2Mirasol3B
Audio ClassificationVGGSoundTop 1 Accuracy69.8Mirasol3B
ClassificationEPIC-SOUNDSAccuracy78.2Mirasol3B
ClassificationVGGSoundTop 1 Accuracy69.8Mirasol3B

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder2025-06-28LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs2025-06-27Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons2025-06-24Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier2025-06-23How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?2025-06-19video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18