Siddharth Srivastava, Gaurav Sharma
Majority of research in learning based methods has been towards designing and training networks for specific tasks. However, many of the learning based tasks, across modalities, share commonalities and could be potentially tackled in a joint framework. We present an approach in such direction, to learn multiple tasks, in multiple modalities, with a unified architecture. The proposed network is composed of task specific encoders, a common trunk in the middle, followed by task specific prediction heads. We first pre-train it by self-supervised masked training, followed by sequential training for the different tasks. We train the network on all major modalities, e.g.\ visual, audio, text and 3D, and report results on $22$ diverse and challenging public benchmarks. We demonstrate empirically that, using a joint network to train across modalities leads to meaningful information sharing and this allows us to achieve state-of-the-art results on most of the benchmarks. We also show generalization of the trained network on cross-modal tasks as well as unseen datasets and tasks.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MSR-VTT-1kA | text-to-video R@10 | 89.4 | OmniVec |
| Video | MSR-VTT-1kA | text-to-video R@10 | 78.6 | OmniVec (pretrained) |
| Video | YouCook2 | text-to-video R@10 | 70.8 | OmniVec |
| Video | YouCook2 | text-to-video R@10 | 64.2 | OmniVec (pretrained) |
| Video | Moments in Time | Top 1 Accuracy | 49.8 | OmniVec |
| Video | Kinetics-400 | Acc@1 | 91.1 | OmniVec |
| Video | MIT | Top 1 Accuracy | 49.8 | OmniVec |
| Activity Recognition | UCF101 | 3-fold Accuracy | 99.6 | OmniVec |
| Semantic Segmentation | NYU Depth v2 | Mean IoU | 60.8 | OmniVec |
| Semantic Segmentation | S3DIS Area5 | mIoU | 75.9 | OmniVec |
| Audio Classification | ESC-50 | Accuracy (5-fold) | 98.4 | OmniVec |
| Audio Classification | ESC-50 | Top-1 Accuracy | 98.4 | OmniVec |
| Audio Classification | AudioSet | Test mAP | 0.548 | OmniVec |
| Text Summarization | DialogSum | BertScore | 71.91 | OmniVec |
| Text Summarization | DialogSum | Rouge1 | 46.91 | OmniVec |
| Text Summarization | DialogSum | Rouge2 | 21.22 | OmniVec |
| Text Summarization | DialogSum | RougeL | 40.19 | OmniVec |
| Image Classification | iNaturalist 2018 | Top-1 Accuracy | 93.8 | OmniVec |
| Image Classification | Places365 | Top 1 Accuracy | 63.5 | OmniVec(ViT) |
| Image Classification | Oxford-IIIT Pet Dataset | Accuracy | 99.2 | OmniVec |
| Shape Representation Of 3D Point Clouds | ScanObjectNN | Overall Accuracy | 96.1 | OmniVec |
| Shape Representation Of 3D Point Clouds | ModelNet40-C | Error Rate | 0.156 | OmniVec |
| Fine-Grained Image Classification | Oxford-IIIT Pet Dataset | Accuracy | 99.2 | OmniVec |
| Action Recognition | UCF101 | 3-fold Accuracy | 99.6 | OmniVec |
| 3D Point Cloud Classification | ScanObjectNN | Overall Accuracy | 96.1 | OmniVec |
| 3D Point Cloud Classification | ModelNet40-C | Error Rate | 0.156 | OmniVec |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 89.4 | OmniVec |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 78.6 | OmniVec (pretrained) |
| Video Retrieval | YouCook2 | text-to-video R@10 | 70.8 | OmniVec |
| Video Retrieval | YouCook2 | text-to-video R@10 | 64.2 | OmniVec (pretrained) |
| Classification | ESC-50 | Accuracy (5-fold) | 98.4 | OmniVec |
| Classification | ESC-50 | Top-1 Accuracy | 98.4 | OmniVec |
| Classification | AudioSet | Test mAP | 0.548 | OmniVec |
| 10-shot image generation | NYU Depth v2 | Mean IoU | 60.8 | OmniVec |
| 10-shot image generation | S3DIS Area5 | mIoU | 75.9 | OmniVec |
| 3D Point Cloud Reconstruction | ScanObjectNN | Overall Accuracy | 96.1 | OmniVec |
| 3D Point Cloud Reconstruction | ModelNet40-C | Error Rate | 0.156 | OmniVec |