TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/BMT

BMT

Reported on 18 benchmarks across 6 tasks · 1 paper · 17 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision15 results

  • VideoonActivityNet Captions
    Average F1· 2020-05-17
    60.27
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • VideoonActivityNet Captions
    Average Precision· 2020-05-17
    48.23
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • VideoonActivityNet Captions
    Average Recall· 2020-05-17
    80.31
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Temporal Action LocalizationonActivityNet Captions
    Average F1· 2020-05-17
    60.27
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Temporal Action LocalizationonActivityNet Captions
    Average Precision· 2020-05-17
    48.23
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Temporal Action LocalizationonActivityNet Captions
    Average Recall· 2020-05-17
    80.31
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Action LocalizationonActivityNet Captions
    Average F1· 2020-05-17
    60.27
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Action LocalizationonActivityNet Captions
    Average Precision· 2020-05-17
    48.23
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Action LocalizationonActivityNet Captions
    Average Recall· 2020-05-17
    80.31
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Video CaptioningonActivityNet Captions
    BLEU-3· 2020-05-17
    3.84
    best: 17.43 (COOT (ae-test split) - Only Appearance features)
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Video CaptioningonActivityNet Captions
    BLEU-4· 2020-05-17
    1.88
    best: 9.45 (ADV-INF + Global)
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Dense Video CaptioningonActivityNet Captions
    BLEU-3· 2020-05-17
    3.84
    best: 4.16 (TSP)
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Dense Video CaptioningonActivityNet Captions
    BLEU-4· 2020-05-17
    1.88
    best: 9.45 (ADV-INF + Global)
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Dense Video CaptioningonActivityNet Captions
    METEOR· 2020-05-17
    8.44
    best: 17 (Vid2Seq)
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Video CaptioningonActivityNet Captions
    METEOR· 2020-05-17
    8.44
    best: 17.97 (VLTinT (ae-test split) C3D/Ling)
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271

Methodology3 results

  • Zero-Shot LearningonActivityNet Captions
    Average F1· 2020-05-17
    60.27
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Zero-Shot LearningonActivityNet Captions
    Average Precision· 2020-05-17
    48.23
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271
  • Zero-Shot LearningonActivityNet Captions
    Average Recall· 2020-05-17
    80.31
    SOTA
    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerarXiv:2005.08271