Description
UNIMO is a multi-modal pre-training architecture that can effectively adapt to both single modal and multimodal understanding and generation tasks. UNIMO learns visual representations and textual representations simultaneously, and unifies them into the same semantic space via cross-modal contrastive learning (CMCL) based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representation and textual representation, and unifies them into the same semantic space based on image-text pairs.
Papers Using This Method
WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models2022-03-22UNIMO-2: End-to-End Unified Vision-Language Grounded Learning2022-03-17A Multimodal Sentiment Dataset for Video Recommendation2021-09-17UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning2020-12-31