TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TempCLR: Temporal Alignment Representation with Contrastiv...

TempCLR: Temporal Alignment Representation with Contrastive Learning

Yuncong Yang, Jiawei Ma, Shiyuan Huang, Long Chen, Xudong Lin, Guangxing Han, Shih-Fu Chang

2022-12-28Video RetrievalRepresentation LearningLong Video Retrieval (Background Removed)Few Shot Action RecognitionContrastive LearningAction RecognitionRetrieval
PaperPDFCode(official)

Abstract

Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitly. However, such unit-level comparison may ignore global temporal context, which inevitably limits the generalization ability. In this paper, we propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly. As the video/paragraph is formulated as a sequence of clips/sentences, under the constraint of their temporal order, we use dynamic time warping to compute the minimum cumulative cost over sentence-clip pairs as the sequence-level distance. To explore the temporal dynamics, we break the consistency of temporal succession by shuffling video clips w.r.t. temporal granularity. Then, we obtain the representations for clips/sentences, which perceive the temporal information and thus facilitate the sequence alignment. In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances. We evaluate our approach on video retrieval, action step localization, and few-shot action recognition, and achieve consistent performance gain over all three tasks. Detailed ablation studies are provided to justify the approach design.

Results

TaskDatasetMetricValueModel
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@174.5TempCLR
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@1097TempCLR
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@594.6TempCLR
Long Video Retrieval (Background Removed)YouCook2DTW R@183.5TempCLR
Long Video Retrieval (Background Removed)YouCook2DTW R@1099.3TempCLR
Long Video Retrieval (Background Removed)YouCook2DTW R@597.2TempCLR
Long Video Retrieval (Background Removed)YouCook2OTAM R@184.9TempCLR
Long Video Retrieval (Background Removed)YouCook2OTAM R@1099.5TempCLR
Long Video Retrieval (Background Removed)YouCook2OTAM R@597.9TempCLR

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17