| 1 | MMCTAgent (GPT-4 + GPT-4V) | 74.24 | No | MMCTAgent: Multi-modal Critical Thinking Agent F... | 2024-05-28 | - |
| 2 | Qwen2-VL-72B | 74 | No | Qwen2-VL: Enhancing Vision-Language Model's Perc... | 2024-09-18 | Code |
| 3 | InternVL2.5-78B | 72.3 | No | Expanding Performance Boundaries of Open-Source ... | 2024-12-06 | Code |
| 4 | GPT-4o +text rationale +IoT | 72.2 | No | Image-of-Thought Prompting for Visual Reasoning ... | 2024-05-22 | - |
| 5 | Lyra-Pro | 71.4 | No | Lyra: An Efficient and Speech-Centric Framework ... | 2024-12-12 | Code |
| 6 | GLM-4V-Plus | 71.1 | No | CogVLM2: Visual Language Models for Image and Vi... | 2024-08-29 | Code |
| 7 | Phantom-7B | 70.8 | No | Phantom of Latent for Large Language and Vision ... | 2024-09-23 | Code |
| 8 | InternVL2.5-38B | 68.8 | No | Expanding Performance Boundaries of Open-Source ... | 2024-12-06 | Code |
| 9 | InternVL2-26B (SGP, token ratio 64%) | 65.6 | No | A Stitch in Time Saves Nine: Small VLM is a Prec... | 2024-12-04 | Code |
| 10 | Baichuan-Omni (7B) | 65.4 | No | Baichuan-Omni Technical Report | 2024-10-11 | Code |
| 11 | InternVL2.5-26B | 65 | No | Expanding Performance Boundaries of Open-Source ... | 2024-12-06 | Code |
| 12 | Qwen2-VL-7B (finetuned on GAP-VQA train) | 64.954 | No | Gamified crowd-sourcing of high-quality data for... | 2024-10-05 | - |
| 13 | InternVL2-Llama3-76B | 64.4 | No | - | - | - |
| 14 | GLM4 Vision | 63.9 | No | CogVLM: Visual Expert for Pretrained Language Mo... | 2023-11-06 | Code |
| 15 | LLaVA-OneVision-72B | 63.7 | No | LLaVA-OneVision: Easy Visual Task Transfer | 2024-08-06 | Code |
| 16 | Lyra-Base | 63.5 | No | Lyra: An Efficient and Speech-Centric Framework ... | 2024-12-12 | Code |
| 17 | InternVL2-26B (SGP, token ratio 35%) | 63.2 | No | A Stitch in Time Saves Nine: Small VLM is a Prec... | 2024-12-04 | Code |
| 18 | InternVL 1.5 | 62.8 | No | How Far Are We to GPT-4V? Closing the Gap to Com... | 2024-04-25 | Code |
| 19 | InternVL2.5-8B | 62.8 | No | Expanding Performance Boundaries of Open-Source ... | 2024-12-06 | Code |
| 20 | MAmmoTH-VL-8B | 62.3 | No | MAmmoTH-VL: Eliciting Multimodal Reasoning with ... | 2024-12-06 | Code |
| 21 | Qwen2-VL-7B | 62 | No | Qwen2-VL: Enhancing Vision-Language Model's Perc... | 2024-09-18 | Code |
| 22 | InternVL2-40B | 61.8 | No | - | - | - |
| 23 | Mini-Gemini-HD-BS | 60.8 | No | Mini-Gemini: Mining the Potential of Multi-modal... | 2024-03-27 | Code |
| 24 | InternVL2.5-2B | 60.8 | No | Expanding Performance Boundaries of Open-Source ... | 2024-12-06 | Code |
| 25 | MAmmoTH-VL-8B (SI) | 60.6 | No | MAmmoTH-VL: Eliciting Multimodal Reasoning with ... | 2024-12-06 | Code |
| 26 | InternVL2.5-4B | 60.6 | No | Expanding Performance Boundaries of Open-Source ... | 2024-12-06 | Code |
| 27 | Mini-Gemini-HD | 59.3 | No | Mini-Gemini: Mining the Potential of Multi-modal... | 2024-03-27 | Code |
| 28 | GLM-4V-9B | 58 | No | CogVLM2: Visual Language Models for Image and Vi... | 2024-08-29 | Code |
| 29 | LLaVA-OneVision-7B | 57.5 | No | LLaVA-OneVision: Easy Visual Task Transfer | 2024-08-06 | Code |
| 30 | LLaVA-NeXT-34B | 57.4 | No | - | - | - |
| 31 | Meteor | 57.3 | No | Meteor: Mamba-based Traversal of Rationale for L... | 2024-05-24 | Code |
| 32 | CROME (Vicuna-13B) | 55.1 | No | CROME: Cross-Modal Adapters for Efficient Multim... | 2024-08-13 | - |
| 33 | IXC2-4KHD | 54.9 | No | InternLM-XComposer2-4KHD: A Pioneering Large Vis... | 2024-04-09 | Code |
| 34 | Weitu-VL-1.0 | 54.7 | No | - | - | - |
| 35 | TroL-7B | 54.7 | No | TroL: Traversal of Layers for Large Language and... | 2024-06-18 | Code |
| 36 | Mini-Gemini | 53 | No | Mini-Gemini: Mining the Potential of Multi-modal... | 2024-03-27 | Code |
| 37 | CogVLM(Vicuna-7B) | 52.8 | No | CogVLM: Visual Expert for Pretrained Language Mo... | 2023-11-06 | Code |
| 38 | CogAgent | 52.8 | No | CogAgent: A Visual Language Model for GUI Agents | 2023-12-14 | Code |
| 39 | Qwen2-VL-2B (finetuned on GAP-VQA train) | 52.43 | No | Gamified crowd-sourcing of high-quality data for... | 2024-10-05 | - |
| 40 | InternVL2-26B (SGP, token ratio 9%) | 52.1 | No | A Stitch in Time Saves Nine: Small VLM is a Prec... | 2024-12-04 | Code |
| 41 | MM1.5-30B | 52 | No | MM1.5: Methods, Analysis & Insights from Multimo... | 2024-09-30 | - |
| 42 | MiniCPM-Llama3-V-2.5-8B (finetuned on GAP-VQA train) | 51.789 | No | Gamified crowd-sourcing of high-quality data for... | 2024-10-05 | - |
| 43 | IXC-2.5-7B | 51.7 | No | InternLM-XComposer-2.5: A Versatile Large Vision... | 2024-07-03 | Code |
| 44 | InternLM-XComposer2 | 51.2 | No | InternLM-XComposer2: Mastering Free-form Text-Im... | 2024-01-29 | Code |
| 45 | Lyra-Mini | 51.2 | No | Lyra: An Efficient and Speech-Centric Framework ... | 2024-12-12 | Code |
| 46 | CuMo-7B | 51 | No | CuMo: Scaling Multimodal LLM with Co-Upcycled Mi... | 2024-05-09 | Code |
| 47 | TACO (Qwen2-7B / SigLIP) | 50.9 | No | TACO: Learning Multi-modal Action Models with Sy... | 2024-12-07 | Code |
| 48 | Qwen-VL-Chat (+ SFT (GPT-4V in VLFeedback)) | 50.7 | No | VLFeedback: A Large-Scale AI Feedback Dataset fo... | 2024-10-12 | - |
| 49 | POINTS-9B | 50 | No | POINTS: Improving Your Vision-language Model wit... | 2024-09-07 | - |
| 50 | VILA^2-8B | 50 | No | VILA$^2$: VILA Augmented VILA | 2024-07-24 | - |
| 51 | Janus-Pro-7B | 50 | No | Janus-Pro: Unified Multimodal Understanding and ... | 2025-01-29 | Code |
| 52 | Silkie | 49.9 | No | Silkie: Preference Distillation for Large Visual... | 2023-12-17 | - |
| 53 | Silkie (Qwen-VL-Chat + DPO w/ VLFeedback) | 49.9 | No | VLFeedback: A Large-Scale AI Feedback Dataset fo... | 2024-10-12 | - |
| 54 | Qwen2-VL-2B | 49.5 | No | Qwen2-VL: Enhancing Vision-Language Model's Perc... | 2024-09-18 | Code |
| 55 | FlashSloth-HD | 49 | No | FlashSloth: Lightning Multimodal Large Language ... | 2024-12-05 | Code |
| 56 | InternVL 1.2 | 48.9 | No | How Far Are We to GPT-4V? Closing the Gap to Com... | 2024-04-25 | Code |
| 57 | SEA-PRIME (Vicuna-13B) | 48.8 | No | SEA: Supervised Embedding Alignment for Token-Le... | 2024-08-21 | - |
| 58 | InternVL2.5-1B | 48.8 | No | Expanding Performance Boundaries of Open-Source ... | 2024-12-06 | Code |
| 59 | MM1-30B-Chat | 48.7 | No | MM1: Methods, Analysis & Insights from Multimoda... | 2024-03-14 | - |
| 60 | SETOKIM (13B) | 48.7 | No | Towards Semantic Equivalence of Tokenization in ... | 2024-06-07 | - |
| 61 | Emu2-Chat | 48.5 | No | Generative Multimodal Models are In-Context Lear... | 2023-12-20 | Code |
| 62 | MG-LLaVA(34B) | 48.5 | No | MG-LLaVA: Towards Multi-Granularity Visual Instr... | 2024-06-25 | Code |
| 63 | SPHINX-Plus | 47.9 | No | SPHINX-X: Scaling Data and Parameters for a Fami... | 2024-02-08 | Code |
| 64 | ConvLLaVA | 45.9 | No | ConvLLaVA: Hierarchical Backbones as Visual Enco... | 2024-05-24 | Code |
| 65 | VILA-13B | 45.7 | No | VILA: On Pre-training for Visual Language Models | 2023-12-12 | Code |
| 66 | TACO (LLaMA3-8B / SigLIP) | 45.7 | No | TACO: Learning Multi-modal Action Models with Sy... | 2024-12-07 | Code |
| 67 | HPT 1.5 Edge | 45.3 | No | - | - | - |
| 68 | TACO (LLaMA3-8B / CLIP) | 45.2 | No | TACO: Learning Multi-modal Action Models with Sy... | 2024-12-07 | Code |
| 69 | LLaVA-v1.6 (7B, w/ STIC) | 45 | No | Enhancing Large Vision Language Models with Self... | 2024-05-30 | Code |
| 70 | H2OVL-Mississippi-2B | 44.7 | No | H2OVL-Mississippi Vision Language Models Technic... | 2024-10-17 | - |
| 71 | PIIP-LLaVA (Vicuna-7B, ConvNeXt-L, CLIP-L ) | 44.7 | No | Parameter-Inverted Image Pyramid Networks for Vi... | 2025-01-14 | Code |
| 72 | Imp-4B | 44.6 | No | Imp: Highly Capable Large Multimodal Models for ... | 2024-05-20 | Code |
| 73 | LLaVA-Next-Mistral-7b (+ DPO w/ VLFeedback) | 44.2 | No | VLFeedback: A Large-Scale AI Feedback Dataset fo... | 2024-10-12 | - |
| 74 | MGM-7B+RP | 44.1 | No | Img-Diff: Contrastive Data Synthesis for Multimo... | 2024-08-08 | - |
| 75 | LLaVA-Next-Vicuna-7b (+ DPO w/ VLFeedback) | 44.1 | No | VLFeedback: A Large-Scale AI Feedback Dataset fo... | 2024-10-12 | - |
| 76 | VW-LMM | 44 | No | Multi-modal Auto-regressive Modeling via Visual ... | 2024-03-12 | Code |
| 77 | MoAI | 43.7 | No | MoAI: Mixture of All Intelligence for Large Lang... | 2024-03-12 | Code |
| 78 | MM1-3B-Chat | 43.7 | No | MM1: Methods, Analysis & Insights from Multimoda... | 2024-03-14 | - |
| 79 | MM1.5-3B-MoE | 43.7 | No | MM1.5: Methods, Analysis & Insights from Multimo... | 2024-09-30 | - |
| 80 | Imp-3B | 43.3 | No | Imp: Highly Capable Large Multimodal Models for ... | 2024-05-20 | Code |
| 81 | ShareGPT4V-13B | 43.1 | No | ShareGPT4V: Improving Large Multi-Modal Models w... | 2023-11-21 | Code |
| 82 | Mini-Gemini (+MoCa) | 42.9 | No | Deciphering Cross-Modal Alignment in Large Visio... | 2024-10-09 | Code |
| 83 | MM1.5-7B | 42.2 | No | MM1.5: Methods, Analysis & Insights from Multimo... | 2024-09-30 | - |
| 84 | MM1-7B-Chat | 42.1 | No | MM1: Methods, Analysis & Insights from Multimoda... | 2024-03-14 | - |
| 85 | FlashSloth | 41.9 | No | FlashSloth: Lightning Multimodal Large Language ... | 2024-12-05 | Code |
| 86 | DeepSeek-VL | 41.5 | No | DeepSeek-VL: Towards Real-World Vision-Language ... | 2024-03-08 | Code |
| 87 | LaVA1.5-13B-BPO | 41.4 | No | Strengthening Multimodal Large Language Model wi... | 2024-03-13 | - |
| 88 | ASMv2 | 41.3 | No | The All-Seeing Project V2: Towards General Relat... | 2024-02-29 | Code |
| 89 | FocusLLaVA | 41.3 | No | FocusLLaVA: A Coarse-to-Fine Approach for Effici... | 2024-11-21 | - |
| 90 | SeVa-13B | 41 | No | Self-Supervised Visual Preference Alignment | 2024-04-16 | Code |
| 91 | MM1.5-3B | 41 | No | MM1.5: Methods, Analysis & Insights from Multimo... | 2024-09-30 | - |
| 92 | LLaVA-1.5-7B (VG-S) | 40.4 | No | ProVision: Programmatically Scaling Vision-centr... | 2024-12-09 | Code |
| 93 | CoLLaVO | 40.3 | No | CoLLaVO: Crayon Large Language and Vision mOdel | 2024-02-17 | Code |
| 94 | SPHINX-2k | 40.2 | No | SPHINX: The Joint Mixing of Weights, Tasks, and ... | 2023-11-13 | Code |
| 95 | LLaVA-1.5 (LVIS-Instrcut4V) | 40.2 | No | To See is to Believe: Prompting GPT-4V for Bette... | 2023-11-13 | Code |
| 96 | mPLUG-Owl3 | 40.1 | No | mPLUG-Owl3: Towards Long Image-Sequence Understa... | 2024-08-09 | Code |
| 97 | Mono-InternVL-2B | 40.1 | No | Mono-InternVL: Pushing the Boundaries of Monolit... | 2024-10-10 | - |
| 98 | LLaVA1.5-13B-MDA | 39.9 | No | Looking Beyond Text: Reducing Language bias in L... | 2024-11-21 | - |
| 99 | LLaVA-VT (Vicuna-13B) | 39.8 | No | Beyond Embeddings: The Promise of Visual Table i... | 2024-03-27 | Code |
| 100 | MM1.5-1B-MoE | 39.8 | No | MM1.5: Methods, Analysis & Insights from Multimo... | 2024-09-30 | - |
| 101 | Janus-Pro-1B | 39.8 | No | Janus-Pro: Unified Multimodal Understanding and ... | 2025-01-29 | Code |
| 102 | SQ-LLaVA∗ | 39.7 | No | SQ-LLaVA: Self-Questioning for Large Vision-Lang... | 2024-03-17 | Code |
| 103 | OmniFusion (grid split + ruDocVQA) | 39.4 | No | OmniFusion Technical Report | 2024-04-09 | - |
| 104 | DeepStack-L-HD (Vicuna-13B) | 39.3 | No | DeepStack: Deeply Stacking Visual Tokens is Surp... | 2024-06-06 | - |
| 105 | LAF-13B | 38.9 | No | From Training-Free to Adaptive: Empirical Insigh... | 2024-01-31 | - |
| 106 | InfiMM-HD | 38.9 | No | InfiMM-HD: A Leap Forward in High-Resolution Mul... | 2024-03-03 | - |
| 107 | InternLM-XC2 + MMDU-45k | 38.8 | No | MMDU: A Multi-Turn Multi-Image Dialog Understand... | 2024-06-17 | Code |
| 108 | LLaVA-1.5-7B (DC-S) | 38.5 | No | ProVision: Programmatically Scaling Vision-centr... | 2024-12-09 | Code |
| 109 | LayoutLMv3+ConvNeXt+CLIP | 38.4 | No | MouSi: Poly-Visual-Expert Vision-Language Models | 2024-01-30 | Code |
| 110 | VOLCANO 13B | 38 | No | Volcano: Mitigating Multimodal Hallucination thr... | 2023-11-13 | Code |
| 111 | LLaVA-1.5+MMInstruct (Vicuna-13B) | 37.9 | No | MMInstruct: A High-Quality Multi-Modal Instructi... | 2024-07-22 | Code |
| 112 | LLaVA-1.5-13B (+CSR) | 37.8 | No | Calibrated Self-Rewarding Vision Language Models | 2024-05-23 | Code |
| 113 | LLaVA-1.5-LLaMA3-8B | 37.8 | No | What If We Recaption Billions of Web Images with... | 2024-06-12 | - |
| 114 | LLaVA-1.5 + DenseFusion-1M (Vicuna-7B) | 37.8 | No | DenseFusion-1M: Merging Vision Experts for Compr... | 2024-07-11 | Code |
| 115 | ShareGPT4V-7B | 37.6 | No | ShareGPT4V: Improving Large Multi-Modal Models w... | 2023-11-21 | Code |
| 116 | LLaVA-1.5+CoS | 37.6 | No | Chain-of-Spot: Interactive Reasoning Improves La... | 2024-03-19 | Code |
| 117 | LLaVA-COCO-13B | 37.5 | No | COCO is "ALL'' You Need for Visual Instruction F... | 2024-01-17 | - |
| 118 | LLaVA-S^2 + DenseFusion-1M (Vicuna-7B) | 37.5 | No | DenseFusion-1M: Merging Vision Experts for Compr... | 2024-07-11 | Code |
| 119 | MM1.5-1B | 37.4 | No | MM1.5: Methods, Analysis & Insights from Multimo... | 2024-09-30 | - |
| 120 | Dynamic-LLaVA-13B | 37.3 | No | Dynamic-LLaVA: Efficient Multimodal Large Langua... | 2024-12-01 | Code |
| 121 | SeVa-7B | 37.2 | No | Self-Supervised Visual Preference Alignment | 2024-04-16 | Code |
| 122 | SoM-LLaVA-1.5-T | 37.2 | No | List Items One by One: A New Data Source and Lea... | 2024-04-25 | Code |
| 123 | Emu3 | 37.2 | No | Emu3: Next-Token Prediction is All You Need | 2024-09-27 | Code |
| 124 | LLaVA-Instruct (Vicuna-1.5-13B) | 37.1 | No | MM-Instruct: Generated Visual Instructions for L... | 2024-06-28 | Code |
| 125 | ILLUME | 37 | No | ILLUME: Illuminating Your LLMs to See, Draw, and... | 2024-12-09 | - |
| 126 | LLaVA1.5-7B-BPO | 36.8 | No | Strengthening Multimodal Large Language Model wi... | 2024-03-13 | - |
| 127 | LLaVA-1.5-13B (+ MMFuser) | 36.6 | No | MMFuser: Multimodal Multi-Layer Feature Fuser fo... | 2024-10-15 | Code |
| 128 | CaMML-13B | 36.4 | No | CaMML: Context-Aware Multimodal Learner for Larg... | 2024-01-06 | Code |
| 129 | LLaVA-65B (Data Mixing) | 36.4 | No | An Empirical Study of Scaling Instruct-Tuned Lar... | 2023-09-18 | Code |
| 130 | Vary-base | 36.2 | No | Vary: Scaling up the Vision Vocabulary for Large... | 2023-12-11 | Code |
| 131 | StableLLaVA | 36.1 | No | StableLLaVA: Enhanced Visual Instruction Tuning ... | 2023-08-20 | Code |
| 132 | DreamLLM-7B | 35.9 | No | DreamLLM: Synergistic Multimodal Comprehension a... | 2023-09-20 | Code |
| 133 | MoE-LLaVA-2.7B×4-Top2 | 35.9 | No | MoE-LLaVA: Mixture of Experts for Large Vision-L... | 2024-01-29 | Code |
| 134 | SoM-LLaVA-1.5 | 35.9 | No | List Items One by One: A New Data Source and Lea... | 2024-04-25 | Code |
| 135 | Dragonfly (Llama3-8B) | 35.9 | No | Dragonfly: Multi-Resolution Zoom-In Encoding Enh... | 2024-06-03 | Code |
| 136 | Ferret-v2-13B | 35.7 | No | Ferret-v2: An Improved Baseline for Referring an... | 2024-04-11 | Code |
| 137 | AlignGPT (Vicuna-13B) | 35.6 | No | AlignGPT: Multi-modal Large Language Models with... | 2024-05-23 | - |
| 138 | LLaVA-HR-X | 35.5 | No | Feast Your Eyes: Mixture-of-Resolution Adaptatio... | 2024-03-05 | Code |
| 139 | SQ-LLaVA | 35.5 | No | SQ-LLaVA: Self-Questioning for Large Vision-Lang... | 2024-03-17 | Code |
| 140 | LOVA$^3$ | 35.2 | No | LOVA3: Learning to Visual Question Answering, As... | 2024-05-23 | Code |
| 141 | LLaVA-InternLM2-7B-ViT + MoSLoRA | 35.2 | No | Mixture-of-Subspaces in Low-Rank Adaptation | 2024-06-16 | Code |
| 142 | InternLM2+ViT (QMoSLoRA) | 35.2 | No | Mixture-of-Subspaces in Low-Rank Adaptation | 2024-06-16 | Code |
| 143 | LLaVA1.5-7B-MDA | 35.2 | No | Looking Beyond Text: Reducing Language bias in L... | 2024-11-21 | - |
| 144 | Mipha-3B+ | 35.1 | No | Rethinking Visual Prompting for Multimodal Large... | 2024-07-05 | - |
| 145 | Merlin | 34.9 | No | Merlin:Empowering Multimodal LLMs with Foresight... | 2023-11-30 | - |
| 146 | Arcana | 34.8 | No | Improving Multi-modal Large Language Model throu... | 2024-10-17 | - |
| 147 | INF-LLaVA | 34.5 | No | INF-LLaVA: Dual-perspective Perception for High-... | 2024-07-23 | Code |
| 148 | LLaVA-1.5+MMInstruct (Vicuna-7B) | 34.4 | No | MMInstruct: A High-Quality Multi-Modal Instructi... | 2024-07-22 | Code |
| 149 | Janus | 34.3 | No | Janus: Decoupling Visual Encoding for Unified Mu... | 2024-10-17 | Code |
| 150 | LLaVA-TokenPacker (Vicuna-13B) | 34.1 | No | TokenPacker: Efficient Visual Projector for Mult... | 2024-07-02 | Code |
| 151 | γ-MoD-LLaVA-HR | 34 | No | $γ-$MoD: Exploring Mixture-of-Depth Adaptation f... | 2024-10-17 | - |
| 152 | LLaVA-1.5-7B (CSR) | 33.9 | No | Calibrated Self-Rewarding Vision Language Models | 2024-05-23 | Code |
| 153 | DynMOE-LLaVA | 33.6 | No | Dynamic Mixture of Experts: An Auto-Tuning Appro... | 2024-05-23 | Code |
| 154 | Imp-2B | 33.5 | No | Imp: Highly Capable Large Multimodal Models for ... | 2024-05-20 | Code |
| 155 | InfMLLM-7B-Chat | 33.4 | No | InfMLLM: A Unified Framework for Visual-Language... | 2023-11-12 | Code |
| 156 | Video-LaVIT | 33.2 | No | Video-LaVIT: Unified Video-Language Pre-training... | 2024-02-05 | Code |
| 157 | LLaVA-Instruct (Vicuna-1.5-7B) | 32.9 | No | MM-Instruct: Generated Visual Instructions for L... | 2024-06-28 | Code |
| 158 | VisionZip (Retain 128 Tokens, fine-tuning) | 32.9 | No | VisionZip: Longer is Better but Not Necessary in... | 2024-12-05 | Code |
| 159 | Uni-MoE | 32.8 | No | Uni-MoE: Scaling Unified Multimodal LLMs with Mi... | 2024-05-18 | Code |
| 160 | VL-Mamba (Mamba LLM-2.8B) | 32.6 | No | VL-Mamba: Exploring State Space Models for Multi... | 2024-03-20 | - |
| 161 | LLaVA-v1.5 (7B, w/ STIC) | 32.6 | No | Enhancing Large Vision Language Models with Self... | 2024-05-30 | Code |
| 162 | VisionZip (Retain 192 Tokens, fine-tuning) | 32.6 | No | VisionZip: Longer is Better but Not Necessary in... | 2024-12-05 | Code |
| 163 | VisionZip (Retain 128 Tokens) | 32.6 | No | VisionZip: Longer is Better but Not Necessary in... | 2024-12-05 | Code |
| 164 | LLaVA-v1.5 (+MoCa) | 32.2 | No | Deciphering Cross-Modal Alignment in Large Visio... | 2024-10-09 | Code |
| 165 | Dynamic-LLaVA-7B | 32.2 | No | Dynamic-LLaVA: Efficient Multimodal Large Langua... | 2024-12-01 | Code |
| 166 | Mipha-3B | 32.1 | No | Mipha: A Comprehensive Overhaul of Multimodal As... | 2024-03-10 | Code |
| 167 | VOLCANO 7B | 32 | No | Volcano: Mitigating Multimodal Hallucination thr... | 2023-11-13 | Code |
| 168 | Video-LLaVA | 32 | No | Video-LLaVA: Learning United Visual Representati... | 2023-11-16 | Code |
| 169 | TinyLLaVA-share-Sig-Ph | 32 | No | TinyLLaVA: A Framework of Small-scale Large Mult... | 2024-02-22 | Code |
| 170 | LLaVA-VT (Vicuna-7B) | 31.8 | No | Beyond Embeddings: The Promise of Visual Table i... | 2024-03-27 | Code |
| 171 | VisionZip (Retain 192 Tokens) | 31.7 | No | VisionZip: Longer is Better but Not Necessary in... | 2024-12-05 | Code |
| 172 | VisionZip (Retain 64 Tokens) | 31.7 | No | VisionZip: Longer is Better but Not Necessary in... | 2024-12-05 | Code |
| 173 | LLaVA-1.5-7B (+ SIMA) | 31.6 | No | Enhancing Visual-Language Modality Alignment in ... | 2024-05-24 | Code |
| 174 | MiCo-Chat-7B | 31.4 | No | Explore the Limits of Omni-modal Pretraining at ... | 2024-06-13 | Code |
| 175 | LLaVA-1.5-7B + TeamLoRA | 31.2 | No | TeamLoRA: Boosting Low-Rank Adaptation with Expe... | 2024-08-19 | Code |
| 176 | RoboCodeX-13B | 31 | No | RoboCodeX: Multimodal Code Generation for Roboti... | 2024-02-25 | - |
| 177 | HyperLLaVA | 31 | No | HyperLLaVA: Dynamic Visual and Language Expert T... | 2024-03-20 | Code |
| 178 | FAST (Vicuna-7B) | 31 | No | Visual Agents as Fast and Slow Thinkers | 2024-08-16 | Code |
| 179 | JanusFlow | 30.9 | No | JanusFlow: Harmonizing Autoregression and Rectif... | 2024-11-12 | Code |
| 180 | AlignGPT (Vicuna-7B) | 30.8 | No | AlignGPT: Multi-modal Large Language Models with... | 2024-05-23 | - |
| 181 | LLaVolta | 30.7 | No | Efficient Large Multi-modal Models via Visual Co... | 2024-06-28 | Code |
| 182 | LLaVA-AlignedVQ | 30.7 | No | Aligned Vector Quantization for Edge-Cloud Colla... | 2024-11-08 | - |
| 183 | LLaVA-1.5-HACL | 30.4 | No | Hallucination Augmented Contrastive Learning for... | 2023-12-12 | Code |
| 184 | MaVEn | 30.4 | No | MaVEn: An Effective Multi-granularity Hybrid Vis... | 2024-08-22 | - |
| 185 | VisionZip (Retain 64 Tokens, fine-tuning) | 30.2 | No | VisionZip: Longer is Better but Not Necessary in... | 2024-12-05 | Code |
| 186 | H2OVL-Mississippi-0.8B | 30 | No | H2OVL-Mississippi Vision Language Models Technic... | 2024-10-17 | - |
| 187 | RoboMamba | 29.7 | No | RoboMamba: Efficient Vision-Language-Action Mode... | 2024-06-06 | - |
| 188 | LLaVA-TokenPacker (Vicuna-7B) | 29.6 | No | TokenPacker: Efficient Visual Projector for Mult... | 2024-07-02 | Code |
| 189 | OneLLM-7B | 29.1 | No | OneLLM: One Framework to Align All Modalities wi... | 2023-12-06 | Code |
| 190 | LLaVA-OneVision-0.5B | 29.1 | No | LLaVA-OneVision: Easy Visual Task Transfer | 2024-08-06 | Code |
| 191 | Vary-toy | 29 | No | Small Language Model Meets with Reinforced Visio... | 2024-01-23 | - |
| 192 | LLaVA-Phi | 28.9 | No | LLaVA-Phi: Efficient Multi-Modal Assistant with ... | 2024-01-04 | Code |
| 193 | MMAR-7B | 27.8 | No | MMAR: Towards Lossless Multi-Modal Auto-Regressi... | 2024-10-14 | - |
| 194 | SEAL (7B) | 27.7 | No | V*: Guided Visual Search as a Core Mechanism in ... | 2023-12-21 | Code |
| 195 | OtterHD-8B | 26.3 | No | OtterHD: A High-Resolution Multi-modality Model | 2023-11-07 | Code |
| 196 | TGA-7B | 25.6 | No | Cross-Modal Safety Mechanism Transfer in Large V... | 2024-10-16 | - |
| 197 | LinVT | 23.5 | No | LinVT: Empower Your Image-level Large Language M... | 2024-12-06 | Code |
| 198 | Xmodel-VLM (Xmodel-LM 1.1B) | 21.8 | No | Xmodel-VLM: A Simple Baseline for Multimodal Vis... | 2024-05-15 | Code |
| 199 | TextBind | 19.4 | No | TextBind: Multi-turn Interleaved Multimodal Inst... | 2023-09-14 | Code |
| 200 | MMAR-0.5B | 18.49 | No | MMAR: Towards Lossless Multi-Modal Auto-Regressi... | 2024-10-14 | - |