TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TACo: Token-aware Cascade Contrastive Learning for Video-T...

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

Jianwei Yang, Yonatan Bisk, Jianfeng Gao

2021-08-23ICCV 2021 10Action SegmentationVideo RetrievalRepresentation LearningZero-Shot Video RetrievalContrastive LearningRetrievalTemporal Action Localization
PaperPDF

Abstract

Contrastive learning has been widely used to train transformer-based vision-language models for video-text alignment and multi-modal representation learning. This paper presents a new algorithm called Token-Aware Cascade contrastive learning (TACo) that improves contrastive learning using two novel techniques. The first is the token-aware contrastive loss which is computed by taking into account the syntactic classes of words. This is motivated by the observation that for a video-text pair, the content words in the text, such as nouns and verbs, are more likely to be aligned with the visual contents in the video than the function words. Second, a cascade sampling method is applied to generate a small set of hard negative examples for efficient loss estimation for multi-modal fusion layers. To validate the effectiveness of TACo, in our experiments we finetune pretrained models for a set of downstream tasks including text-video retrieval (YouCook2, MSR-VTT and ActivityNet), video action step localization (CrossTask), video action segmentation (COIN). The results show that our models attain consistent improvements across different experimental settings over previous methods, setting new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.

Results

TaskDatasetMetricValueModel
VideoCrossTaskRecall42.5TACo
VideoMSR-VTT-1kAtext-to-video Median Rank4TACo
VideoMSR-VTT-1kAtext-to-video R@128.4TACo
VideoMSR-VTT-1kAtext-to-video R@1071.2TACo
VideoMSR-VTT-1kAtext-to-video R@557.8TACo
VideoActivityNettext-to-video Median Rank3TACo
VideoActivityNettext-to-video R@130.4TACo
VideoActivityNettext-to-video R@561.2TACo
VideoActivityNettext-to-video R@5093.4TACo
VideoYouCook2text-to-video Median Rank4TACo
VideoYouCook2text-to-video R@129.6TACo
VideoYouCook2text-to-video R@1072.7TACo
VideoYouCook2text-to-video R@559.7TACo
VideoMSR-VTTtext-to-video Median Rank5TACo
VideoMSR-VTTtext-to-video R@124.8TACo
VideoMSR-VTTtext-to-video R@1064TACo
VideoMSR-VTTtext-to-video R@552.1TACo
Temporal Action LocalizationCrossTaskRecall42.5TACo
Zero-Shot LearningCrossTaskRecall42.5TACo
Action LocalizationCrossTaskRecall42.5TACo
Action LocalizationCOINFrame accuracy68.4TACo
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank4TACo
Video RetrievalMSR-VTT-1kAtext-to-video R@128.4TACo
Video RetrievalMSR-VTT-1kAtext-to-video R@1071.2TACo
Video RetrievalMSR-VTT-1kAtext-to-video R@557.8TACo
Video RetrievalActivityNettext-to-video Median Rank3TACo
Video RetrievalActivityNettext-to-video R@130.4TACo
Video RetrievalActivityNettext-to-video R@561.2TACo
Video RetrievalActivityNettext-to-video R@5093.4TACo
Video RetrievalYouCook2text-to-video Median Rank4TACo
Video RetrievalYouCook2text-to-video R@129.6TACo
Video RetrievalYouCook2text-to-video R@1072.7TACo
Video RetrievalYouCook2text-to-video R@559.7TACo
Video RetrievalMSR-VTTtext-to-video Median Rank5TACo
Video RetrievalMSR-VTTtext-to-video R@124.8TACo
Video RetrievalMSR-VTTtext-to-video R@1064TACo
Video RetrievalMSR-VTTtext-to-video R@552.1TACo
Action SegmentationCOINFrame accuracy68.4TACo
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@19.8TACo
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1033.4TACo
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@525TACo
Zero-Shot Video RetrievalYouCook2text-to-video Mean Rank8TACo
Zero-Shot Video RetrievalYouCook2text-to-video R@119.9TACo
Zero-Shot Video RetrievalYouCook2text-to-video R@1055.7TACo
Zero-Shot Video RetrievalYouCook2text-to-video R@543.2TACo

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17