Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox
Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | YouCook2 | text-to-video Median Rank | 9 | COOT |
| Video | YouCook2 | text-to-video R@1 | 16.7 | COOT |
| Video | YouCook2 | text-to-video R@10 | 52.3 | COOT |
| Video Captioning | YouCook2 | BLEU-3 | 17.97 | COOT |
| Video Captioning | YouCook2 | BLEU-4 | 11.3 | COOT |
| Video Captioning | YouCook2 | CIDEr | 0.57 | COOT |
| Video Captioning | YouCook2 | METEOR | 19.85 | COOT |
| Video Captioning | YouCook2 | ROUGE-L | 37.94 | COOT |
| Video Captioning | ActivityNet Captions | BLEU-3 | 17.43 | COOT (ae-test split) - Only Appearance features |
| Video Captioning | ActivityNet Captions | BLEU4 | 10.85 | COOT (ae-test split) - Only Appearance features |
| Video Captioning | ActivityNet Captions | CIDEr | 28.19 | COOT (ae-test split) - Only Appearance features |
| Video Captioning | ActivityNet Captions | METEOR | 15.99 | COOT (ae-test split) - Only Appearance features |
| Video Captioning | ActivityNet Captions | ROUGE-L | 31.45 | COOT (ae-test split) - Only Appearance features |
| Video Retrieval | YouCook2 | text-to-video Median Rank | 9 | COOT |
| Video Retrieval | YouCook2 | text-to-video R@1 | 16.7 | COOT |
| Video Retrieval | YouCook2 | text-to-video R@10 | 52.3 | COOT |