TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/VisualBERT

VisualBERT

Computer VisionIntroduced 200025 papers
Source Paper

Description

VisualBERT aims to reuse self-attention to implicitly align elements of the input text and regions in the input image. Visual embeddings are used to model images where the representations are represented by a bounding region in an image obtained from an object detector. These visual embeddings are constructed by summing three embeddings: 1) visual feature representation, 2) a segment embedding indicate whether it is an image embedding, and 3) position embedding. Essentially, image regions and language are combined with a Transformer to allow self-attention to discover implicit alignments between language and vision. VisualBERT is trained using COCO, which consists of images paired with captions. It is pre-trained using two objectives: masked language modeling objective and sentence-image prediction task. It can then be fine-tuned on different downstream tasks.

Papers Using This Method

Visual Question Answering on Multiple Remote Sensing Image Modalities2025-05-21Seeing Through VisualBERT: A Causal Adventure on Memetic Landscapes2024-10-17OSPC: Detecting Harmful Memes with Large Language Model as a Catalyst2024-06-14Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking2024-01-29A Review of Vision-Language Models and their Performance on the Hateful Memes Challenge2023-05-09Controlling for Stereotypes in Multimodal Language Model Evaluation2023-02-03A survey on knowledge-enhanced multimodal learning2022-11-19Transfer Learning with Joint Fine-Tuning for Multimodal Sentiment Analysis2022-10-11Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing2022-10-10Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer2022-06-22Visual Spatial Reasoning2022-04-30Visio-Linguistic Brain Encoding2022-04-18Hateful Memes Challenge: An Enhanced Multimodal Framework2021-12-20Multimodal Learning: Are Captions All You Need?2021-11-16Seeing things or seeing scenes: Investigating the capabilities of V&L models to align scene descriptions to images2021-10-16Understanding of Emotion Perception from Art2021-10-13MARMOT: A Deep Learning Framework for Constructing Multimodal Representations for Vision-and-Language Tasks2021-09-23What Vision-Language Models `See' when they See Scenes2021-09-15Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning2021-09-14BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis2021-08-10