TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Temporal Context Aggregation for Video Retrieval with Cont...

Temporal Context Aggregation for Video Retrieval with Contrastive Learning

Jie Shao, Xin Wen, Bingchen Zhao, xiangyang xue

2020-08-04Video RetrievalRepresentation LearningContrastive LearningRetrieval
PaperPDFCode(official)

Abstract

The current research focus on Content-Based Video Retrieval requires higher-level video representation describing the long-range semantic dependencies of relevant incidents, events, etc. However, existing methods commonly process the frames of a video as individual images or short clips, making the modeling of long-range semantic dependencies difficult. In this paper, we propose TCA (Temporal Context Aggregation for Video Retrieval), a video representation learning framework that incorporates long-range temporal information between frame-level features using the self-attention mechanism. To train it on video retrieval datasets, we propose a supervised contrastive learning method that performs automatic hard negative mining and utilizes the memory bank mechanism to increase the capacity of negative samples. Extensive experiments are conducted on multiple video retrieval tasks, such as CC_WEB_VIDEO, FIVR-200K, and EVVE. The proposed method shows a significant performance advantage (~17% mAP on FIVR-200K) over state-of-the-art methods with video-level features, and deliver competitive results with 22x faster inference time comparing with frame-level features.

Results

TaskDatasetMetricValueModel
VideoFIVR-200KmAP (CSVR)0.83TCAf
VideoFIVR-200KmAP (DSVR)0.877TCAf
VideoFIVR-200KmAP (ISVR)0.703TCAf
VideoFIVR-200KmAP (CSVR)0.698TCAsym
VideoFIVR-200KmAP (DSVR)0.728TCAsym
VideoFIVR-200KmAP (ISVR)0.592TCAsym
VideoFIVR-200KmAP (CSVR)0.553TCAc
VideoFIVR-200KmAP (DSVR)0.57TCAc
VideoFIVR-200KmAP (ISVR)0.473TCAc
Video RetrievalFIVR-200KmAP (CSVR)0.83TCAf
Video RetrievalFIVR-200KmAP (DSVR)0.877TCAf
Video RetrievalFIVR-200KmAP (ISVR)0.703TCAf
Video RetrievalFIVR-200KmAP (CSVR)0.698TCAsym
Video RetrievalFIVR-200KmAP (DSVR)0.728TCAsym
Video RetrievalFIVR-200KmAP (ISVR)0.592TCAsym
Video RetrievalFIVR-200KmAP (CSVR)0.553TCAc
Video RetrievalFIVR-200KmAP (DSVR)0.57TCAc
Video RetrievalFIVR-200KmAP (ISVR)0.473TCAc

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17