ViLBERT

Vision-and-Language BERT

Computer VisionIntroduced 200030 papers

Description

Vision-and-Language BERT (ViLBERT) is a BERT-based model for learning task-agnostic joint representations of image content and natural language. ViLBERT extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

Papers Using This Method

A Multimodal Fusion Network For Student Emotion Recognition Based on Transformer and Tensor Product2024-03-13Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking2024-01-29The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models2023-10-23Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving2023-07-18Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input2023-06-25Weakly Supervised Visual Question Answer Generation2023-06-11Unified Multimodal Model with Unlikelihood Training for Visual Dialog2022-11-23A survey on knowledge-enhanced multimodal learning2022-11-19Probing Cross-modal Semantics Alignment Capability from the Textual Perspective2022-10-18Image Retrieval from Contextual Descriptions2022-03-29TriBERT: Human-centric Audio-visual Representation Learning2021-12-01Multimodal Learning: Are Captions All You Need?2021-11-16Image Retrieval from Contextual Descriptions2021-11-16TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation2021-10-26Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning2021-09-14Enhance Multimodal Model Performance with Data Augmentation: Facebook Hateful Meme Challenge Solution2021-05-25Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's Heads2021-04-30Playing Lottery Tickets with Vision and Language2021-04-23On the Role of Images for Analyzing Claims in Social Media2021-03-17Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks2020-12-22