TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/COOT: Cooperative Hierarchical Transformer for Video-Text ...

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox

2020-11-01NeurIPS 2020 12Cross-Modal RetrievalVideo RetrievalRepresentation LearningVideo-Text RetrievalVideo Captioning
PaperPDFCode(official)

Abstract

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

Results

TaskDatasetMetricValueModel
VideoYouCook2text-to-video Median Rank9COOT
VideoYouCook2text-to-video R@116.7COOT
VideoYouCook2text-to-video R@1052.3COOT
Video CaptioningYouCook2BLEU-317.97COOT
Video CaptioningYouCook2BLEU-411.3COOT
Video CaptioningYouCook2CIDEr0.57COOT
Video CaptioningYouCook2METEOR19.85COOT
Video CaptioningYouCook2ROUGE-L37.94COOT
Video CaptioningActivityNet CaptionsBLEU-317.43COOT (ae-test split) - Only Appearance features
Video CaptioningActivityNet CaptionsBLEU410.85COOT (ae-test split) - Only Appearance features
Video CaptioningActivityNet CaptionsCIDEr28.19COOT (ae-test split) - Only Appearance features
Video CaptioningActivityNet CaptionsMETEOR15.99COOT (ae-test split) - Only Appearance features
Video CaptioningActivityNet CaptionsROUGE-L31.45COOT (ae-test split) - Only Appearance features
Video RetrievalYouCook2text-to-video Median Rank9COOT
Video RetrievalYouCook2text-to-video R@116.7COOT
Video RetrievalYouCook2text-to-video R@1052.3COOT

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction2025-07-15UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15