VicTR: Video-conditioned Text Representations for Activity Recognition

Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo

2023-04-05CVPR 2024 1Action Classification Form Zero-Shot Action Recognition Activity Recognition

Abstract

Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image $\rightarrow$ video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. More specifically, we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings, creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot, zero-shot (HMDB-51, UCF-101), short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks, showing strong performance among video-VLMs.

Results

Task	Dataset	Metric	Value	Model
Video	Charades	MAP	57.6	VicTR (ViT-L/14)
Video	Kinetics-400	Acc@1	87	VicTR (ViT-L/14)
Zero-Shot Action Recognition	UCF101	Top-1 Accuracy	72.4	VicTR (ViT-B/16)
Zero-Shot Action Recognition	HMDB51	Top-1 Accuracy	51	VicTR (ViT-B/16)

Related Papers

ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs2025-07-15 FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11 SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network2025-06-25 Controlled Retrieval-augmented Context Evaluation for Long-form RAG2025-06-24 FormGym: Doing Paperwork with Agents2025-06-17 Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis2025-06-17 FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding2025-06-16 Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks2025-06-16