TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Omnivore: A Single Model for Many Visual Modalities

Omnivore: A Single Model for Many Visual Modalities

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra

2022-01-20CVPR 2022 1Image ClassificationAction ClassificationScene RecognitionSemantic SegmentationAction Recognition
PaperPDFCode(official)Code

Abstract

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivore is simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. Omnivore's shared visual representation naturally enables cross-modal recognition without access to correspondences between modalities. We hope our results motivate researchers to model visual modalities together.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@184.1OMNIVORE (Swin-L)
VideoKinetics-400Acc@596.1OMNIVORE (Swin-L)
VideoKinetics-400Acc@184OMNIVORE (Swin-B)
VideoKinetics-400Acc@596.2OMNIVORE (Swin-B)
Scene ParsingSUN-RGBDAccuracy (%)67.2OMNIVORE (Swin-B)
Activity RecognitionEPIC-KITCHENS-100Action@149.9OMNIVORE (Swin-B, finetuned)
Activity RecognitionEPIC-KITCHENS-100Noun@161.7OMNIVORE (Swin-B, finetuned)
Activity RecognitionEPIC-KITCHENS-100Verb@169.5OMNIVORE (Swin-B, finetuned)
Activity RecognitionSomething-Something V2Top-1 Accuracy71.4OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)
Activity RecognitionSomething-Something V2Top-5 Accuracy93.5OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)
AnimationSUN-RGBDAccuracy (%)67.2OMNIVORE (Swin-B)
Action RecognitionEPIC-KITCHENS-100Action@149.9OMNIVORE (Swin-B, finetuned)
Action RecognitionEPIC-KITCHENS-100Noun@161.7OMNIVORE (Swin-B, finetuned)
Action RecognitionEPIC-KITCHENS-100Verb@169.5OMNIVORE (Swin-B, finetuned)
Action RecognitionSomething-Something V2Top-1 Accuracy71.4OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)
Action RecognitionSomething-Something V2Top-5 Accuracy93.5OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)
3D Character Animation From A Single PhotoSUN-RGBDAccuracy (%)67.2OMNIVORE (Swin-B)
2D Semantic SegmentationSUN-RGBDAccuracy (%)67.2OMNIVORE (Swin-B)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17