COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox

2020-11-01NeurIPS 2020 12Cross-Modal Retrieval Video Retrieval Representation Learning Video-Text Retrieval Video Captioning

Paper PDF Code(official)

Abstract

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

Results

Task	Dataset	Metric	Value	Model
Video	YouCook2	text-to-video Median Rank	9	COOT
Video	YouCook2	text-to-video R@1	16.7	COOT
Video	YouCook2	text-to-video R@10	52.3	COOT
Video Captioning	YouCook2	BLEU-3	17.97	COOT
Video Captioning	YouCook2	BLEU-4	11.3	COOT
Video Captioning	YouCook2	CIDEr	0.57	COOT
Video Captioning	YouCook2	METEOR	19.85	COOT
Video Captioning	YouCook2	ROUGE-L	37.94	COOT
Video Captioning	ActivityNet Captions	BLEU-3	17.43	COOT (ae-test split) - Only Appearance features
Video Captioning	ActivityNet Captions	BLEU4	10.85	COOT (ae-test split) - Only Appearance features
Video Captioning	ActivityNet Captions	CIDEr	28.19	COOT (ae-test split) - Only Appearance features
Video Captioning	ActivityNet Captions	METEOR	15.99	COOT (ae-test split) - Only Appearance features
Video Captioning	ActivityNet Captions	ROUGE-L	31.45	COOT (ae-test split) - Only Appearance features
Video Retrieval	YouCook2	text-to-video Median Rank	9	COOT
Video Retrieval	YouCook2	text-to-video R@1	16.7	COOT
Video Retrieval	YouCook2	text-to-video R@10	52.3	COOT

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Abstract

Results

Related Papers

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Abstract

Results

Related Papers