TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VicTR: Video-conditioned Text Representations for Activity...

VicTR: Video-conditioned Text Representations for Activity Recognition

Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo

2023-04-05CVPR 2024 1Action ClassificationFormZero-Shot Action RecognitionActivity Recognition
PaperPDF

Abstract

Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image $\rightarrow$ video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. More specifically, we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings, creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot, zero-shot (HMDB-51, UCF-101), short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks, showing strong performance among video-VLMs.

Results

TaskDatasetMetricValueModel
VideoCharadesMAP57.6VicTR (ViT-L/14)
VideoKinetics-400Acc@187VicTR (ViT-L/14)
Zero-Shot Action RecognitionUCF101Top-1 Accuracy72.4VicTR (ViT-B/16)
Zero-Shot Action RecognitionHMDB51Top-1 Accuracy51VicTR (ViT-B/16)

Related Papers

ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs2025-07-15FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network2025-06-25Controlled Retrieval-augmented Context Evaluation for Long-form RAG2025-06-24FormGym: Doing Paperwork with Agents2025-06-17Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis2025-06-17FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding2025-06-16Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks2025-06-16