Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Zhenzhi Wang, LiMin Wang, Tao Wu, TianHao Li, Gangshan Wu

2021-09-10Representation Learning Video Grounding Metric Learning Temporal Sentence Grounding

Abstract

Temporal grounding aims to localize a video moment which is semantically aligned with a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with the research focus on designing complicated prediction heads or fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs in a mutual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal mutual matching to maximize their mutual information. Experiments show that our MMN achieves highly competitive performance compared with the state-of-the-art methods on four video grounding benchmarks. Based on MMN, we present a winner solution for the HC-STVG challenge of the 3rd PIC workshop. This suggests that metric learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space. Code is available at https://github.com/MCG-NJU/MMN.

Results

Task	Dataset	Metric	Value	Model
Video Understanding	Charades-STA	R1@0.5	55.2	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R1@0.7	32.2	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.5	88.3	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.7	62.7	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R1@0.5	49.4	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R1@0.7	29.8	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.5	85.8	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.7	60.5	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.5	55.2	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.7	32.2	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.5	88.3	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.7	62.7	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.5	49.4	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.7	29.8	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.5	85.8	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.7	60.5	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.5	55.2	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.7	32.2	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.5	88.3	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.7	62.7	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.5	49.4	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.7	29.8	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.5	85.8	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.7	60.5	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)

Abstract

Results

Task	Dataset	Metric	Value	Model
Video Understanding	Charades-STA	R1@0.5	55.2	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R1@0.7	32.2	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.5	88.3	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.7	62.7	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R1@0.5	49.4	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R1@0.7	29.8	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.5	85.8	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.7	60.5	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.5	55.2	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.7	32.2	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.5	88.3	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.7	62.7	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.5	49.4	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.7	29.8	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.5	85.8	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.7	60.5	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.5	55.2	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.7	32.2	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.5	88.3	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.7	62.7	MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.5	49.4	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.7	29.8	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.5	85.8	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.7	60.5	MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Abstract

Results

Related Papers

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Abstract

Results

Related Papers