Supervised Multimodal Bitransformers for Classifying Images and Text

Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Ethan Perez, Davide Testuggine

2019-09-06Natural Language Inference General Classification

Paper PDF Code Code(official)Code Code Code Code

Abstract

Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks, outperforming strong baselines, including on hard test sets specifically designed to measure multimodal performance.

Results

Task	Dataset	Metric	Value	Model
Natural Language Inference	V-SNLI	Accuracy	90.5	MMBT

Related Papers

LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification2025-07-15 DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification2025-07-08 ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation2025-06-27 Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding2025-06-17 When Does Meaning Backfire? Investigating the Role of AMRs in NLI2025-06-17 Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure2025-06-10 Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models2025-06-08 A MISMATCHED Benchmark for Scientific Natural Language Inference2025-06-05