ViLBERT

Vision-and-Language BERT

Computer VisionIntroduced 200030 papers

Description

Vision-and-Language BERT (ViLBERT) is a BERT-based model for learning task-agnostic joint representations of image content and natural language. ViLBERT extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

Papers Using This Method

A Multimodal Fusion Network For Student Emotion Recognition Based on Transformer and Tensor Product2024-03-13 Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking2024-01-29 The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models2023-10-23 Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving2023-07-18 Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input2023-06-25 Weakly Supervised Visual Question Answer Generation2023-06-11 Unified Multimodal Model with Unlikelihood Training for Visual Dialog2022-11-23 A survey on knowledge-enhanced multimodal learning2022-11-19 Probing Cross-modal Semantics Alignment Capability from the Textual Perspective2022-10-18 Image Retrieval from Contextual Descriptions2022-03-29 TriBERT: Human-centric Audio-visual Representation Learning2021-12-01 Multimodal Learning: Are Captions All You Need?2021-11-16 Image Retrieval from Contextual Descriptions2021-11-16 TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation2021-10-26 Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning2021-09-14 Enhance Multimodal Model Performance with Data Augmentation: Facebook Hateful Meme Challenge Solution2021-05-25 Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's Heads2021-04-30 Playing Lottery Tickets with Vision and Language2021-04-23 On the Role of Images for Analyzing Claims in Social Media2021-03-17 Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks2020-12-22