TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Harnessing the Power of Multi-Task Pretraining for Ground-...

Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations

Björn Plüster, Jakob Ambsdorf, Lukas Braach, Jae Hee Lee, Stefan Wermter

2022-12-08Visual EntailmentExplanation GenerationVisual Question Answering (VQA)
PaperPDFCode(official)

Abstract

Natural language explanations promise to offer intuitively understandable explanations of a neural network's decision process in complex vision-language tasks, as pursued in recent VL-NLE models. While current models offer impressive performance on task accuracy and explanation plausibility, they suffer from a range of issues: Some models feature a modular design where the explanation generation module is poorly integrated with a separate module for task-answer prediction, employ backbone models trained on limited sets of tasks, or incorporate ad hoc solutions to increase performance on single datasets. We propose to evade these limitations by applying recent advances in large-scale multi-task pretraining of generative Transformer models to the problem of VL-NLE tasks. Our approach outperforms recent models by a large margin, with human annotators preferring the generated explanations over the ground truth in two out of three evaluated datasets. As a novel challenge in VL-NLE research, we propose the problem of multi-task VL-NLE and show that jointly training on multiple tasks can increase the explanation quality. We discuss the ethical implications of high-quality NLE generation and other issues in recent VL-NLE research.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VQA-XAccuracy92.6OFA-X-MT
Visual Question Answering (VQA)VQA-XAccuracy91.2OFA-X
Visual Question Answering (VQA)VCR (Q-A) testAccuracy71.2OFA-X
Visual Question Answering (VQA)VCR (Q-A) testAccuracy62OFA-X-MT
Natural Language Inferencee-SNLI-VEAccuracy80.9OFA-X
Natural Language Inferencee-SNLI-VEAccuracy78.9OFA-X-MT
Explanation GenerationVCRHuman Explanation Rating77.3OFA-X-MT
Explanation GenerationVCRHuman Explanation Rating68.9OFA-X
Explanation GenerationVQA-XHuman Explanation Rating89.5OFA-X
Explanation GenerationVQA-XHuman Explanation Rating87.8OFA-X-MT
Explanation Generatione-SNLI-VEHuman Explanation Rating85.7OFA-X
Explanation Generatione-SNLI-VEHuman Explanation Rating80.4OFA-X-MT

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09Hierarchical Interaction Summarization and Contrastive Prompting for Explainable Recommendations2025-07-08The Future is Agentic: Definitions, Perspectives, and Open Challenges of Multi-Agent Recommender Systems2025-07-02Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder2025-06-28