TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/AltCLIP: Altering the Language Encoder in CLIP for Extende...

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu

2022-11-12Cross-Modal RetrievalText-to-Image GenerationZero-Shot Cross-Modal RetrievalImage ClassificationZero-Shot Image ClassificationXLM-RZero-Shot Transfer Image ClassificationContrastive LearningImage-to-Text RetrievalZero-shot Text RetrievalZero-Shot Transfer Image Classification (CN)Zero-shot Image RetrievalImage Retrieval
PaperPDFCode(official)Code

Abstract

In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model. Starting from the pre-trained multimodal representation model CLIP released by OpenAI, we altered its text encoder with a pre-trained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k-CN, COCO-CN and XTD. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@186AltCLIP
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1099.1AltCLIP
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@598AltCLIP
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@172.5AltCLIP
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1095.4AltCLIP
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@591.6AltCLIP
Zero-Shot Transfer Image ClassificationImageNet V2Accuracy (Private)68.1AltCLIP
Zero-Shot Transfer Image ClassificationImageNet-AAccuracy (Private)69.5AltCLIP
Zero-Shot Transfer Image ClassificationImageNetAccuracy (Private)74.5AltCLIP
Zero-Shot Transfer Image ClassificationImageNet-RAccuracy87.2AltCLIP
Zero-Shot Transfer Image ClassificationCN-ImageNet V2Accuracy (Private)50.9AltCLIP
Zero-Shot Transfer Image ClassificationCN-ImageNetAccuracy (Private)59.6AltCLIP
Zero-Shot Transfer Image ClassificationCN-ImageNet-AAccuracy (Private)58.5AltCLIP
Zero-Shot Transfer Image ClassificationCN-ImageNet-SketchAccuracy (Private)46.5AltCLIP
Zero-Shot Transfer Image ClassificationCN-ImageNet-RAccuracy (Private)79.9AltCLIP
Zero-Shot Transfer Image ClassificationImageNet-SketchAccuracy (Private)58.7AltCLIP

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17