Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/VisualBERT

VisualBERT

Computer VisionIntroduced 200025 papers

Description

VisualBERT aims to reuse self-attention to implicitly align elements of the input text and regions in the input image. Visual embeddings are used to model images where the representations are represented by a bounding region in an image obtained from an object detector. These visual embeddings are constructed by summing three embeddings: 1) visual feature representation, 2) a segment embedding indicate whether it is an image embedding, and 3) position embedding. Essentially, image regions and language are combined with a Transformer to allow self-attention to discover implicit alignments between language and vision. VisualBERT is trained using COCO, which consists of images paired with captions. It is pre-trained using two objectives: masked language modeling objective and sentence-image prediction task. It can then be fine-tuned on different downstream tasks.

Papers Using This Method

Visual Question Answering on Multiple Remote Sensing Image Modalities2025-05-21 Seeing Through VisualBERT: A Causal Adventure on Memetic Landscapes2024-10-17 OSPC: Detecting Harmful Memes with Large Language Model as a Catalyst2024-06-14 Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking2024-01-29 A Review of Vision-Language Models and their Performance on the Hateful Memes Challenge2023-05-09 Controlling for Stereotypes in Multimodal Language Model Evaluation2023-02-03 A survey on knowledge-enhanced multimodal learning2022-11-19 Transfer Learning with Joint Fine-Tuning for Multimodal Sentiment Analysis2022-10-11 Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing2022-10-10 Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer2022-06-22 Visual Spatial Reasoning2022-04-30 Visio-Linguistic Brain Encoding2022-04-18 Hateful Memes Challenge: An Enhanced Multimodal Framework2021-12-20 Multimodal Learning: Are Captions All You Need?2021-11-16 Seeing things or seeing scenes: Investigating the capabilities of V&L models to align scene descriptions to images2021-10-16 Understanding of Emotion Perception from Art2021-10-13 MARMOT: A Deep Learning Framework for Constructing Multimodal Representations for Vision-and-Language Tasks2021-09-23 What Vision-Language Models `See' when they See Scenes2021-09-15 Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning2021-09-14 BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis2021-08-10