Papers With Code 2 | ML Benchmarks, SotA Results & Code

Description

Sscs, or Support-set Based Cross-Supervision, is a module for video grounding which consists of two main components: a discriminative contrastive objective and a generative caption objective. The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts. Due to the co-existence of some visual entities in both ground-truth and background intervals, i.e., mutual exclusion, naively contrastive learning is unsuitable to video grounding. This problem is addressed by boosting the cross-supervision with the support-set concept, which collects visual information from the whole video and eliminates the mutual exclusion of entities.

Specifically, in the Figure to the right, two video-text pairs { $V\_{i}, L\_{i}$ }, { $V\_{j} , L\_{j}$ } in the batch are presented for clarity. After feeding them into a video and text encoder, the clip-level and sentence-level embedding ( { $X\_{i}, Y\_{i}$ } and { $X\_{j} , Y\_{j}$ } ) in a shared space are acquired. Base on the support-set module, the weighted average of $X\_{i}$ and $X\_{j}$ is computed to obtain $\bar{X}\_{i}$ , $\bar{X}\_{j}$ respectively. Finally, the contrastive and caption objectives are combined to pull close the representations of the clips and text from the same samples and push away those from other pairs

Description

Sscs

Description

Papers Using This Method

Sscs

Description

Papers Using This Method