TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Negative Sample Matters: A Renaissance of Metric Learning ...

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Zhenzhi Wang, LiMin Wang, Tao Wu, TianHao Li, Gangshan Wu

2021-09-10Representation LearningVideo GroundingMetric LearningTemporal Sentence Grounding
PaperPDFCode(official)Code

Abstract

Temporal grounding aims to localize a video moment which is semantically aligned with a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with the research focus on designing complicated prediction heads or fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs in a mutual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal mutual matching to maximize their mutual information. Experiments show that our MMN achieves highly competitive performance compared with the state-of-the-art methods on four video grounding benchmarks. Based on MMN, we present a winner solution for the HC-STVG challenge of the 3rd PIC workshop. This suggests that metric learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space. Code is available at https://github.com/MCG-NJU/MMN.

Results

TaskDatasetMetricValueModel
Video UnderstandingCharades-STAR1@0.555.2MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video UnderstandingCharades-STAR1@0.732.2MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video UnderstandingCharades-STAR5@0.588.3MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video UnderstandingCharades-STAR5@0.762.7MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video UnderstandingCharades-STAR1@0.549.4MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video UnderstandingCharades-STAR1@0.729.8MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video UnderstandingCharades-STAR5@0.585.8MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video UnderstandingCharades-STAR5@0.760.5MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
VideoCharades-STAR1@0.555.2MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
VideoCharades-STAR1@0.732.2MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
VideoCharades-STAR5@0.588.3MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
VideoCharades-STAR5@0.762.7MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
VideoCharades-STAR1@0.549.4MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
VideoCharades-STAR1@0.729.8MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
VideoCharades-STAR5@0.585.8MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
VideoCharades-STAR5@0.760.5MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence GroundingCharades-STAR1@0.555.2MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence GroundingCharades-STAR1@0.732.2MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence GroundingCharades-STAR5@0.588.3MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence GroundingCharades-STAR5@0.762.7MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence GroundingCharades-STAR1@0.549.4MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence GroundingCharades-STAR1@0.729.8MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence GroundingCharades-STAR5@0.585.8MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence GroundingCharades-STAR5@0.760.5MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Unsupervised Ground Metric Learning2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16