Multi-granularity Correspondence Learning from Long-term Noisy Videos

Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, Xi Peng

2024-01-30Action Segmentation Video Retrieval Long Video Retrieval (Background Removed)Video Understanding

Abstract

Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at https://lin-yijie.github.io/projects/Norton.

Results

Task	Dataset	Metric	Value	Model
Video Question Answering	MSRVTT-MC	Accuracy	92.7	Norton
Action Localization	COIN	Frame accuracy	69.8	Norton
Action Segmentation	COIN	Frame accuracy	69.8	Norton
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@1	75.5	Norton
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@10	97.7	Norton
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@5	95	Norton
Long Video Retrieval (Background Removed)	YouCook2	DTW R@1	88.7	Norton
Long Video Retrieval (Background Removed)	YouCook2	DTW R@10	99.5	Norton
Long Video Retrieval (Background Removed)	YouCook2	DTW R@5	98.8	Norton
Long Video Retrieval (Background Removed)	YouCook2	OTAM R@1	88.9	Norton
Long Video Retrieval (Background Removed)	YouCook2	OTAM R@10	99.5	Norton
Long Video Retrieval (Background Removed)	YouCook2	OTAM R@5	98.4	Norton
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	10.7	Norton
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	24.1	Norton
Zero-Shot Video Retrieval	YouCook2	text-to-video R@1	24.2	Norton
Zero-Shot Video Retrieval	YouCook2	text-to-video R@10	64.1	Norton
Zero-Shot Video Retrieval	YouCook2	text-to-video R@5	51.9	Norton

Abstract

Results

Task	Dataset	Metric	Value	Model
Video Question Answering	MSRVTT-MC	Accuracy	92.7	Norton
Action Localization	COIN	Frame accuracy	69.8	Norton
Action Segmentation	COIN	Frame accuracy	69.8	Norton
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@1	75.5	Norton
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@10	97.7	Norton
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@5	95	Norton
Long Video Retrieval (Background Removed)	YouCook2	DTW R@1	88.7	Norton
Long Video Retrieval (Background Removed)	YouCook2	DTW R@10	99.5	Norton
Long Video Retrieval (Background Removed)	YouCook2	DTW R@5	98.8	Norton
Long Video Retrieval (Background Removed)	YouCook2	OTAM R@1	88.9	Norton
Long Video Retrieval (Background Removed)	YouCook2	OTAM R@10	99.5	Norton
Long Video Retrieval (Background Removed)	YouCook2	OTAM R@5	98.4	Norton
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	10.7	Norton
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	24.1	Norton
Zero-Shot Video Retrieval	YouCook2	text-to-video R@1	24.2	Norton
Zero-Shot Video Retrieval	YouCook2	text-to-video R@10	64.1	Norton
Zero-Shot Video Retrieval	YouCook2	text-to-video R@5	51.9	Norton

Multi-granularity Correspondence Learning from Long-term Noisy Videos

Abstract

Results

Related Papers

Multi-granularity Correspondence Learning from Long-term Noisy Videos

Abstract

Results

Related Papers