D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation

Hanjun Li, Xiujun Shu, Sunan He, Ruizhi Qiao, Wei Wen, Taian Guo, Bei Gan, Xing Sun

2023-08-08ICCV 2023 1Temporal Sentence Grounding Contrastive Learning

Abstract

Temporal sentence grounding (TSG) aims to locate a specific moment from an untrimmed video with a given natural language query. Recently, weakly supervised methods still have a large performance gap compared to fully supervised ones, while the latter requires laborious timestamp annotations. In this study, we aim to reduce the annotation cost yet keep competitive performance for TSG task compared to fully supervised ones. To achieve this goal, we investigate a recently proposed glance-supervised temporal sentence grounding task, which requires only single frame annotation (referred to as glance annotation) for each query. Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL samples reliable positive moments from a 2D temporal map via jointly leveraging Gaussian prior and semantic consistency, which contributes to aligning the positive sentence-moment pairs in the joint embedding space. Moreover, to alleviate the annotation bias resulting from glance annotation and model complex queries consisting of multiple events, we propose the DGA module, which adjusts the distribution dynamically to approximate the ground truth of target moments. Extensive experiments on three challenging benchmarks verify the effectiveness of the proposed D3G. It outperforms the state-of-the-art weakly supervised methods by a large margin and narrows the performance gap compared to fully supervised methods. Code is available at https://github.com/solicucu/D3G.

Results

Task	Dataset	Metric	Value	Model
Video Understanding	Charades-STA	R1@0.5	46	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R1@0.7	20.2	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.5	83.1	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.7	50.2	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R1@0.5	41.7	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R1@0.7	18.8	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.5	78.2	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.7	48	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.5	46	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.7	20.2	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.5	83.1	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.7	50.2	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.5	41.7	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.7	18.8	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.5	78.2	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.7	48	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.5	46	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.7	20.2	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.5	83.1	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.7	50.2	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.5	41.7	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.7	18.8	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.5	78.2	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.7	48	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)

Abstract

Results

Task	Dataset	Metric	Value	Model
Video Understanding	Charades-STA	R1@0.5	46	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R1@0.7	20.2	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.5	83.1	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.7	50.2	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R1@0.5	41.7	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R1@0.7	18.8	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.5	78.2	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video Understanding	Charades-STA	R5@0.7	48	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.5	46	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.7	20.2	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.5	83.1	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.7	50.2	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.5	41.7	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R1@0.7	18.8	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.5	78.2	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Video	Charades-STA	R5@0.7	48	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.5	46	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.7	20.2	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.5	83.1	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.7	50.2	D3G (Semi-weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.5	41.7	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R1@0.7	18.8	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.5	78.2	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)
Temporal Sentence Grounding	Charades-STA	R5@0.7	48	D3G (Semi-weak, I3D-K400-Pretrain-feature, evaluated by AdaFocus)

D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation

Abstract

Results

Related Papers

D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation

Abstract

Results

Related Papers