OmniVec: Learning robust representations with cross modal sharing

Siddharth Srivastava, Gaurav Sharma

2023-11-07Image Classification Action Classification Audio Classification Text Summarization Semantic Segmentation 3D Point Cloud Classification Fine-Grained Image Classification

Paper PDF

Abstract

Majority of research in learning based methods has been towards designing and training networks for specific tasks. However, many of the learning based tasks, across modalities, share commonalities and could be potentially tackled in a joint framework. We present an approach in such direction, to learn multiple tasks, in multiple modalities, with a unified architecture. The proposed network is composed of task specific encoders, a common trunk in the middle, followed by task specific prediction heads. We first pre-train it by self-supervised masked training, followed by sequential training for the different tasks. We train the network on all major modalities, e.g.\ visual, audio, text and 3D, and report results on $22$ diverse and challenging public benchmarks. We demonstrate empirically that, using a joint network to train across modalities leads to meaningful information sharing and this allows us to achieve state-of-the-art results on most of the benchmarks. We also show generalization of the trained network on cross-modal tasks as well as unseen datasets and tasks.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video R@10	89.4	OmniVec
Video	MSR-VTT-1kA	text-to-video R@10	78.6	OmniVec (pretrained)
Video	YouCook2	text-to-video R@10	70.8	OmniVec
Video	YouCook2	text-to-video R@10	64.2	OmniVec (pretrained)
Video	Moments in Time	Top 1 Accuracy	49.8	OmniVec
Video	Kinetics-400	Acc@1	91.1	OmniVec
Video	MIT	Top 1 Accuracy	49.8	OmniVec
Activity Recognition	UCF101	3-fold Accuracy	99.6	OmniVec
Semantic Segmentation	NYU Depth v2	Mean IoU	60.8	OmniVec
Semantic Segmentation	S3DIS Area5	mIoU	75.9	OmniVec
Audio Classification	ESC-50	Accuracy (5-fold)	98.4	OmniVec
Audio Classification	ESC-50	Top-1 Accuracy	98.4	OmniVec
Audio Classification	AudioSet	Test mAP	0.548	OmniVec
Text Summarization	DialogSum	BertScore	71.91	OmniVec
Text Summarization	DialogSum	Rouge1	46.91	OmniVec
Text Summarization	DialogSum	Rouge2	21.22	OmniVec
Text Summarization	DialogSum	RougeL	40.19	OmniVec
Image Classification	iNaturalist 2018	Top-1 Accuracy	93.8	OmniVec
Image Classification	Places365	Top 1 Accuracy	63.5	OmniVec(ViT)
Image Classification	Oxford-IIIT Pet Dataset	Accuracy	99.2	OmniVec
Shape Representation Of 3D Point Clouds	ScanObjectNN	Overall Accuracy	96.1	OmniVec
Shape Representation Of 3D Point Clouds	ModelNet40-C	Error Rate	0.156	OmniVec
Fine-Grained Image Classification	Oxford-IIIT Pet Dataset	Accuracy	99.2	OmniVec
Action Recognition	UCF101	3-fold Accuracy	99.6	OmniVec
3D Point Cloud Classification	ScanObjectNN	Overall Accuracy	96.1	OmniVec
3D Point Cloud Classification	ModelNet40-C	Error Rate	0.156	OmniVec
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	89.4	OmniVec
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	78.6	OmniVec (pretrained)
Video Retrieval	YouCook2	text-to-video R@10	70.8	OmniVec
Video Retrieval	YouCook2	text-to-video R@10	64.2	OmniVec (pretrained)
Classification	ESC-50	Accuracy (5-fold)	98.4	OmniVec
Classification	ESC-50	Top-1 Accuracy	98.4	OmniVec
Classification	AudioSet	Test mAP	0.548	OmniVec
10-shot image generation	NYU Depth v2	Mean IoU	60.8	OmniVec
10-shot image generation	S3DIS Area5	mIoU	75.9	OmniVec
3D Point Cloud Reconstruction	ScanObjectNN	Overall Accuracy	96.1	OmniVec
3D Point Cloud Reconstruction	ModelNet40-C	Error Rate	0.156	OmniVec

OmniVec: Learning robust representations with cross modal sharing

Abstract

Results

Related Papers

OmniVec: Learning robust representations with cross modal sharing

Abstract

Results

Related Papers