TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/FiLM: Visual Reasoning with a General Conditioning Layer

FiLM: Visual Reasoning with a General Conditioning Layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, Aaron Courville

2017-09-22Image Retrieval with Multi-Modal QueryVisual Question Answering (VQA) Split BVisual ReasoningVisual Question Answering (VQA) Split AVisual Question Answering (VQA)
PaperPDFCodeCodeCode(official)CodeCodeCodeCode

Abstract

We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation. FiLM layers influence neural network computation via a simple, feature-wise affine transformation based on conditioning information. We show that FiLM layers are highly effective for visual reasoning - answering image-related questions which require a multi-step, high-level process - a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning. Specifically, we show on visual reasoning tasks that FiLM layers 1) halve state-of-the-art error for the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are robust to ablations and architectural modifications, and 4) generalize well to challenging, new data from few examples or even zero-shot.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)CLEVRAccuracy97.7CNN+GRU+FiLM
Visual Question Answering (VQA)CLEVR-HumansAccuracy75.9CNN+GRU+FiLM
Image Retrieval with Multi-Modal QueryMIT-StatesRecall@110.1FiLM
Image Retrieval with Multi-Modal QueryMIT-StatesRecall@1038.3FiLM
Image Retrieval with Multi-Modal QueryMIT-StatesRecall@527.7FiLM

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15PyVision: Agentic Vision with Dynamic Tooling2025-07-10Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09