TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Infinity-MM: Scaling Multimodal Performance with Large-Sca...

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, YiXuan Wang, Zhenchong Hu, Bo-Wen Zhang, Jijie Li, Dong Liang, Yingli Zhao, Songjing Wang, Yulong Ao, Yiming Ju, Huanhuan Ma, Xiaotong Li, Haiwen Diao, Yufeng Cui, Xinlong Wang, Yaoqi Liu, Fangxiang Feng, Guang Liu

2024-10-24Question GenerationImage GenerationVisual Question Answering (VQA)
PaperPDFCodeCodeCodeCode

Abstract

Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a significant gap compared to models trained on closed-source data. To address this challenge, we introduce Infinity-MM, a large-scale multimodal instruction dataset. We collected the available multimodal instruction datasets and performed unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy. Furthermore, to enable large-scale expansion of instruction data and support the continuous acquisition of high-quality data, we propose a synthetic instruction generation method based on a tagging system and open-source VLMs. By establishing correspondences between different types of images and associated instruction types, this method can provide essential guidance during data synthesis. Leveraging this high-quality data, we have trained a 2-billion-parameter Vision-Language Model, Aquila-VL-2B, which achieves state-of-the-art (SOTA) performance among models of similar scale. The data is available at: https://huggingface.co/datasets/BAAI/Infinity-MM.

Results

TaskDatasetMetricValueModel
Image GenerationTextAtlasEvalStyledTextSynth Clip Score0.2727Infinity-2B
Image GenerationTextAtlasEvalStyledTextSynth FID84.95Infinity-2B
Image GenerationTextAtlasEvalStyledTextSynth OCR (Accuracy)0.8Infinity-2B
Image GenerationTextAtlasEvalStyledTextSynth OCR (Cer)0.93Infinity-2B
Image GenerationTextAtlasEvalStyledTextSynth OCR (F1 Score)1.42Infinity-2B
Image GenerationTextAtlasEvalTextScenesHQ Clip Score0.2346Infinity-2B
Image GenerationTextAtlasEvalTextScenesHQ FID71.59Infinity-2B
Image GenerationTextAtlasEvalTextScenesHQ OCR (Accuracy)1.06Infinity-2B
Image GenerationTextAtlasEvalTextScenesHQ OCR (Cer)0.88Infinity-2B
Image GenerationTextAtlasEvalTextScenesHQ OCR (F1 Score)1.74Infinity-2B
Image GenerationTextAtlasEvalTextVisionBlend Clip Score0.1979Infinity-2B
Image GenerationTextAtlasEvalTextVisionBlend FID95.69Infinity-2B
Image GenerationTextAtlasEvalTextVisionBlend OCR (Accuracy)2.98Infinity-2B
Image GenerationTextAtlasEvalTextVisionBlend OCR (Cer)0.83Infinity-2B
Image GenerationTextAtlasEvalTextVsionBlend OCR (F1 Score)3.44Infinity-2B

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17FADE: Adversarial Concept Erasure in Flow Models2025-07-16MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16