Wayner Barrios, Mattia Soldan, Alberto Mario Ceballos-Arroyo, Fabian Caba Heilbron, Bernard Ghanem
The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate the performance of current state-of-the-art methods for video grounding in the long-form setup, with interesting findings: current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows. We design a guided grounding framework consisting of a Guidance Model and a base grounding model. The Guidance Model emphasizes describable windows, while the base grounding model analyzes short temporal windows to determine which segments accurately match a given language query. We offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent, which balance efficiency and accuracy. Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to reproduce our experiments are available at: https://github.com/waybarrios/guidance-based-video-grounding.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MAD | R@1,IoU=0.1 | 9.3 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@1,IoU=0.3 | 4.65 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@1,IoU=0.5 | 2.16 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@10,IoU=0.1 | 24.3 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@10,IoU=0.3 | 17.73 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@10,IoU=0.5 | 11.09 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@100,IoU=0.1 | 47.35 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@100,IoU=0.3 | 39.58 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@100,IoU=0.5 | 29.68 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@5,IoU=0.1 | 18.96 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@5,IoU=0.3 | 13.06 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@5,IoU=0.5 | 7.4 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@50,IoU=0.1 | 39.79 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@50,IoU=0.3 | 32.23 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@50,IoU=0.5 | 23.21 | Zero-Shot CLIP + Guidance Model |
| Video | MAD | R@1,IoU=0.1 | 5.6 | VLG-Net + Guidance Model |
| Video | MAD | R@1,IoU=0.3 | 4.28 | VLG-Net + Guidance Model |
| Video | MAD | R@1,IoU=0.5 | 2.48 | VLG-Net + Guidance Model |
| Video | MAD | R@10,IoU=0.1 | 23.64 | VLG-Net + Guidance Model |
| Video | MAD | R@10,IoU=0.3 | 19.86 | VLG-Net + Guidance Model |
| Video | MAD | R@10,IoU=0.5 | 13.72 | VLG-Net + Guidance Model |
| Video | MAD | R@100,IoU=0.1 | 55.59 | VLG-Net + Guidance Model |
| Video | MAD | R@100,IoU=0.3 | 49.38 | VLG-Net + Guidance Model |
| Video | MAD | R@100,IoU=0.5 | 39.12 | VLG-Net + Guidance Model |
| Video | MAD | R@5,IoU=0.1 | 16.07 | VLG-Net + Guidance Model |
| Video | MAD | R@5,IoU=0.5 | 8.78 | VLG-Net + Guidance Model |
| Video | MAD | R@50,IoU=0.1 | 45.35 | VLG-Net + Guidance Model |
| Video | MAD | R@50,IoU=0.3 | 39.77 | VLG-Net + Guidance Model |
| Video | MAD | R@50,IoU=0.5 | 30.22 | VLG-Net + Guidance Model |