InterBERT

Computer VisionIntroduced 20001 papers

Description

InterBERT aims to model interaction between information flows pertaining to different modalities. This new architecture builds multi-modal interaction and preserves the independence of single modal representation. InterBERT is built with an image embedding layer, a text embedding layer, a single-stream interaction module, and a two stream extraction module. The model is pre-trained with three tasks: 1) masked segment modeling, 2) masked region modeling, and 3) image-text matching.

Papers Using This Method

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining2020-03-30