TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multi-granularity Correspondence Learning from Long-term N...

Multi-granularity Correspondence Learning from Long-term Noisy Videos

Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, Xi Peng

2024-01-30Action SegmentationVideo RetrievalLong Video Retrieval (Background Removed)Video Understanding
PaperPDFCode

Abstract

Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at https://lin-yijie.github.io/projects/Norton.

Results

TaskDatasetMetricValueModel
Video Question AnsweringMSRVTT-MCAccuracy92.7Norton
Action LocalizationCOINFrame accuracy69.8Norton
Action SegmentationCOINFrame accuracy69.8Norton
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@175.5Norton
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@1097.7Norton
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@595Norton
Long Video Retrieval (Background Removed)YouCook2DTW R@188.7Norton
Long Video Retrieval (Background Removed)YouCook2DTW R@1099.5Norton
Long Video Retrieval (Background Removed)YouCook2DTW R@598.8Norton
Long Video Retrieval (Background Removed)YouCook2OTAM R@188.9Norton
Long Video Retrieval (Background Removed)YouCook2OTAM R@1099.5Norton
Long Video Retrieval (Background Removed)YouCook2OTAM R@598.4Norton
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@110.7Norton
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@524.1Norton
Zero-Shot Video RetrievalYouCook2text-to-video R@124.2Norton
Zero-Shot Video RetrievalYouCook2text-to-video R@1064.1Norton
Zero-Shot Video RetrievalYouCook2text-to-video R@551.9Norton

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Self-supervised pretraining of vision transformers for animal behavioral analysis and neural encoding2025-07-13Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08