TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Sscs

Sscs

Support-set Based Cross-Supervision

Computer VisionIntroduced 20001 papers
Source Paper

Description

Sscs, or Support-set Based Cross-Supervision, is a module for video grounding which consists of two main components: a discriminative contrastive objective and a generative caption objective. The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts. Due to the co-existence of some visual entities in both ground-truth and background intervals, i.e., mutual exclusion, naively contrastive learning is unsuitable to video grounding. This problem is addressed by boosting the cross-supervision with the support-set concept, which collects visual information from the whole video and eliminates the mutual exclusion of entities.

Specifically, in the Figure to the right, two video-text pairs { V_i,L_iV\_{i}, L\_{i}V_i,L_i}, {V_j,L_jV\_{j} , L\_{j}V_j,L_j } in the batch are presented for clarity. After feeding them into a video and text encoder, the clip-level and sentence-level embedding ( {X_i,Y_iX\_{i}, Y\_{i}X_i,Y_i} and {X_j,Y_jX\_{j} , Y\_{j}X_j,Y_j} ) in a shared space are acquired. Base on the support-set module, the weighted average of X_iX\_{i}X_i and X_jX\_{j}X_j is computed to obtain Xˉ_i\bar{X}\_{i}Xˉ_i, Xˉ_j\bar{X}\_{j}Xˉ_j respectively. Finally, the contrastive and caption objectives are combined to pull close the representations of the clips and text from the same samples and push away those from other pairs

Papers Using This Method

Support-Set Based Cross-Supervision for Video Grounding2021-08-24