TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/MaMMUT (ours)

MaMMUT (ours)

Reported on 17 benchmarks across 6 tasks · 1 paper · 4 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision7 results

  • Image RetrievalonFlickr30k
    Image-to-text R@1· 2023-03-29
    94.9
    SOTA
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Image RetrievalonFlickr30k
    Image-to-text R@10· 2023-03-29
    99.9
    SOTA
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Image RetrievalonFlickr30k
    Image-to-text R@5· 2023-03-29
    99.5
    SOTA
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Video CaptioningonMSR-VTT
    CIDEr· 2023-03-29
    73.6
    best: 80 (mPLUG-2)
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Image RetrievalonFlickr30k
    Recall@1· 2023-03-29
    82.5
    best: 89.7 (BLIP-2 ViT-G (zero-shot, 1K test set))
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Image RetrievalonFlickr30k
    Recall@10· 2023-03-29
    98
    best: 98.9 (BLIP-2 ViT-G (zero-shot, 1K test set))
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Image RetrievalonFlickr30k
    Recall@5· 2023-03-29
    96
    best: 98.1 (BLIP-2 ViT-G (zero-shot, 1K test set))
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839

Miscellaneous6 results

  • Image Retrieval with Multi-Modal QueryonCOCO 2014
    Image-to-text R@1· 2023-03-29
    70.7
    best: 84.8 (BEiT-3)
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Image Retrieval with Multi-Modal QueryonCOCO 2014
    Image-to-text R@10· 2023-03-29
    93.7
    best: 98.5 (X2-VLM (large))
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Image Retrieval with Multi-Modal QueryonCOCO 2014
    Image-to-text R@5· 2023-03-29
    89.1
    best: 96.5 (X2-VLM (large))
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Cross-Modal Information RetrievalonCOCO 2014
    Image-to-text R@1· 2023-03-29
    70.7
    best: 84.8 (BEiT-3)
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Cross-Modal Information RetrievalonCOCO 2014
    Image-to-text R@10· 2023-03-29
    93.7
    best: 98.5 (X2-VLM (large))
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Cross-Modal Information RetrievalonCOCO 2014
    Image-to-text R@5· 2023-03-29
    89.1
    best: 96.5 (X2-VLM (large))
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839

Natural Language Processing4 results

  • Visual Question Answering (VQA)onMSVD-QA
    Accuracy· uses extra data· 2023-03-29
    0.602
    best: 0.61 (VLAB)
    SOTA
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Cross-Modal RetrievalonCOCO 2014
    Image-to-text R@1· 2023-03-29
    70.7
    best: 84.8 (BEiT-3)
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Cross-Modal RetrievalonCOCO 2014
    Image-to-text R@10· 2023-03-29
    93.7
    best: 98.5 (X2-VLM (large))
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839
  • Cross-Modal RetrievalonCOCO 2014
    Image-to-text R@5· 2023-03-29
    89.1
    best: 96.5 (X2-VLM (large))
    MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksarXiv:2303.16839