TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/OmniVec: Learning robust representations with cross modal ...

OmniVec: Learning robust representations with cross modal sharing

Siddharth Srivastava, Gaurav Sharma

2023-11-07Image ClassificationAction ClassificationAudio ClassificationText SummarizationSemantic Segmentation3D Point Cloud ClassificationFine-Grained Image Classification
PaperPDF

Abstract

Majority of research in learning based methods has been towards designing and training networks for specific tasks. However, many of the learning based tasks, across modalities, share commonalities and could be potentially tackled in a joint framework. We present an approach in such direction, to learn multiple tasks, in multiple modalities, with a unified architecture. The proposed network is composed of task specific encoders, a common trunk in the middle, followed by task specific prediction heads. We first pre-train it by self-supervised masked training, followed by sequential training for the different tasks. We train the network on all major modalities, e.g.\ visual, audio, text and 3D, and report results on $22$ diverse and challenging public benchmarks. We demonstrate empirically that, using a joint network to train across modalities leads to meaningful information sharing and this allows us to achieve state-of-the-art results on most of the benchmarks. We also show generalization of the trained network on cross-modal tasks as well as unseen datasets and tasks.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video R@1089.4OmniVec
VideoMSR-VTT-1kAtext-to-video R@1078.6OmniVec (pretrained)
VideoYouCook2text-to-video R@1070.8OmniVec
VideoYouCook2text-to-video R@1064.2OmniVec (pretrained)
VideoMoments in TimeTop 1 Accuracy49.8OmniVec
VideoKinetics-400Acc@191.1OmniVec
VideoMITTop 1 Accuracy49.8OmniVec
Activity RecognitionUCF1013-fold Accuracy99.6OmniVec
Semantic SegmentationNYU Depth v2Mean IoU60.8OmniVec
Semantic SegmentationS3DIS Area5mIoU75.9OmniVec
Audio ClassificationESC-50Accuracy (5-fold)98.4OmniVec
Audio ClassificationESC-50Top-1 Accuracy98.4OmniVec
Audio ClassificationAudioSetTest mAP0.548OmniVec
Text SummarizationDialogSumBertScore71.91OmniVec
Text SummarizationDialogSumRouge146.91OmniVec
Text SummarizationDialogSumRouge221.22OmniVec
Text SummarizationDialogSumRougeL40.19OmniVec
Image ClassificationiNaturalist 2018Top-1 Accuracy93.8OmniVec
Image ClassificationPlaces365Top 1 Accuracy63.5OmniVec(ViT)
Image ClassificationOxford-IIIT Pet DatasetAccuracy99.2OmniVec
Shape Representation Of 3D Point CloudsScanObjectNNOverall Accuracy96.1OmniVec
Shape Representation Of 3D Point CloudsModelNet40-CError Rate0.156OmniVec
Fine-Grained Image ClassificationOxford-IIIT Pet DatasetAccuracy99.2OmniVec
Action RecognitionUCF1013-fold Accuracy99.6OmniVec
3D Point Cloud ClassificationScanObjectNNOverall Accuracy96.1OmniVec
3D Point Cloud ClassificationModelNet40-CError Rate0.156OmniVec
Video RetrievalMSR-VTT-1kAtext-to-video R@1089.4OmniVec
Video RetrievalMSR-VTT-1kAtext-to-video R@1078.6OmniVec (pretrained)
Video RetrievalYouCook2text-to-video R@1070.8OmniVec
Video RetrievalYouCook2text-to-video R@1064.2OmniVec (pretrained)
ClassificationESC-50Accuracy (5-fold)98.4OmniVec
ClassificationESC-50Top-1 Accuracy98.4OmniVec
ClassificationAudioSetTest mAP0.548OmniVec
10-shot image generationNYU Depth v2Mean IoU60.8OmniVec
10-shot image generationS3DIS Area5mIoU75.9OmniVec
3D Point Cloud ReconstructionScanObjectNNOverall Accuracy96.1OmniVec
3D Point Cloud ReconstructionModelNet40-CError Rate0.156OmniVec

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17