TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/UNITER

UNITER

UNiversal Image-TExt Representation Learning

Natural Language ProcessingIntroduced 200023 papers
Source Paper

Description

UNITER or UNiversal Image-TExt Representation model is a large-scale pre-trained model for joint multimodal embedding. It is pre-trained using four image-text datasets COCO, Visual Genome, Conceptual Captions, and SBU Captions. It can power heterogeneous downstream V+L tasks with joint multimodal embeddings. UNITER takes the visual regions of the image and textual tokens of the sentence as inputs. A faster R-CNN is used in Image Embedder to extract the visual features of each region and a Text Embedder is used to tokenize the input sentence into WordPieces.

It proposes WRA via the Optimal Transport to provide more fine-grained alignment between word tokens and image regions that is effective in calculating the minimum cost of transporting the contextualized image embeddings to word embeddings and vice versa.

Four pretraining tasks were designed for this model. They are Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). This model is different from the previous models because it uses conditional masking on pre-training tasks.

Papers Using This Method

Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking2024-01-29Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks2023-07-14Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input2023-06-25Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment2022-12-20Probing Cross-modal Semantics Alignment Capability from the Textual Perspective2022-10-18VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations2022-07-01Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval2022-06-17UPB at SemEval-2022 Task 5: Enhancing UNITER with Image Sentiment and Graph Convolutional Networks for Multimedia Automatic Misogyny Identification2022-05-29HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval2022-05-24A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension2022-04-17Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering2022-03-24Hateful Memes Challenge: An Enhanced Multimodal Framework2021-12-20Dense Contrastive Visual-Linguistic Pretraining2021-09-24Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots2021-07-02e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks2021-05-08Playing Lottery Tickets with Vision and Language2021-04-23WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training2021-03-11A Closer Look at the Robustness of Vision-and-Language Pre-trained Models2020-12-15X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers2020-09-23What Does BERT with Vision Look At?2020-07-01