TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/GPT-4 Technical Report

GPT-4 Technical Report

OpenAI, :, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O'Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, Barret Zoph

2023-03-15Preprint 2023 3Spatial ReasoningQuestion AnsweringFew-Shot LearningMathObject RearrangementMulti-task Language UnderstandingMMR totalLong-Context UnderstandingSentence CompletionOnly Connect Walls Dataset Task 1 (Grouping)Common Sense Reasoninganswerability predictionArithmetic ReasoningBug fixingFactual Inconsistency Detection in Chart CaptioningCode GenerationVisual Question Answering (VQA)Zero-Shot LearningFS-MEVQAVisual Question AnsweringImage Retrieval
PaperPDFCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.

Results

TaskDatasetMetricValueModel
Few-Shot LearningMedConceptsQAAccuracy61.911gpt-4-0125-preview
Zero-Shot LearningMedConceptsQAAccuracy52.489gpt-4-0125-preview
Transfer LearningMMLAverage (%)70GPT-3.5 Turbo
Question AnsweringPeerQAAlignScore0.1224GPT-4o-2024-08-06-128k
Question AnsweringPeerQAPrometheus-2 Answer Correctness3.4612GPT-4o-2024-08-06-128k
Question AnsweringPeerQARouge-L0.2266GPT-4o-2024-08-06-128k
Question AnsweringTruthfulQAMC10.59GPT-4 (RLHF)
Question AnsweringTIQP@128.6Gpt-4
Question AnsweringDROP TestF180.9GPT-4 (few-shot, k=3)
Question AnsweringDROP TestF164.1GPT 3.5 (few-shot, k=3)
Question AnsweringTriviaQAEM84.8GPT-4-0613 (Zero-shot)
Visual Question Answering (VQA)CORE-MMAbductive77.88GPT-4V
Visual Question Answering (VQA)CORE-MMAnalogical69.86GPT-4V
Visual Question Answering (VQA)CORE-MMDeductive74.86GPT-4V
Visual Question Answering (VQA)CORE-MMOverall score74.44GPT-4V
Visual Question Answering (VQA)InfiMM-EvalAbductive77.88GPT-4V
Visual Question Answering (VQA)InfiMM-EvalAnalogical69.86GPT-4V
Visual Question Answering (VQA)InfiMM-EvalDeductive74.86GPT-4V
Visual Question Answering (VQA)InfiMM-EvalOverall score74.44GPT-4V
Visual Question Answering (VQA)ViP-BenchGPT-4 score (bbox)60.7GPT-4V-turbo-detail:high (Visual Prompt)
Visual Question Answering (VQA)ViP-BenchGPT-4 score (human)59.9GPT-4V-turbo-detail:high (Visual Prompt)
Visual Question Answering (VQA)ViP-BenchGPT-4 score (bbox)52.8GPT-4V-turbo-detail:low (Visual Prompt)
Visual Question Answering (VQA)ViP-BenchGPT-4 score (human)51.4GPT-4V-turbo-detail:low (Visual Prompt)
Visual Question Answering (VQA)BenchLMMGPT-3.5 score58.37GPT-4V
Visual Question Answering (VQA)EmbSpatial-BenchGeneration36.07GPT-4V
Visual Question Answering (VQA)SME#Learning Samples (N)16GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)SMEACC42.3GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)SMEBLEU-445.51GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)SMECIDEr269.68GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)SMEDetection7GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)SMEMETEOR35.17GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)SMEROUGE-L52.67GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)SMESPICE37.67GPT-4-1106-Vision-Preview
Common Sense ReasoningWinoGrandeAccuracy87.5GPT-4 (5-shot)
Common Sense ReasoningWinoGrandeAccuracy81.6GPT-3.5 (5-shot)
Common Sense ReasoningARC (Challenge)Accuracy96.4GPT-4 (few-shot, k=25)
Common Sense ReasoningARC (Challenge)Accuracy85.2GPT-3.5 (few-shot, k=25)
ClusteringOCW Wasserstein Distance (WD)72.9GPT-4 (5-shot)
ClusteringOCW# Correct Groups269GPT-4 (5-shot)
ClusteringOCW# Solved Walls7GPT-4 (5-shot)
ClusteringOCWAdjusted Mutual Information (AMI)32.8GPT-4 (5-shot)
ClusteringOCWAdjusted Rand Index (ARI)29.1GPT-4 (5-shot)
ClusteringOCWFowlkes Mallows Score (FMS)43.4GPT-4 (5-shot)
ClusteringOCW Wasserstein Distance (WD)73.4GPT-4 (1-shot)
ClusteringOCW# Correct Groups262GPT-4 (1-shot)
ClusteringOCW# Solved Walls4GPT-4 (1-shot)
ClusteringOCWAdjusted Mutual Information (AMI)33.5GPT-4 (1-shot)
ClusteringOCWAdjusted Rand Index (ARI)29.7GPT-4 (1-shot)
ClusteringOCWFowlkes Mallows Score (FMS)43.7GPT-4 (1-shot)
ClusteringOCW Wasserstein Distance (WD)73.6GPT-4 (100-shot)
ClusteringOCW# Correct Groups249GPT-4 (100-shot)
ClusteringOCW# Solved Walls3GPT-4 (100-shot)
ClusteringOCWAdjusted Mutual Information (AMI)32.3GPT-4 (100-shot)
ClusteringOCWAdjusted Rand Index (ARI)28.5GPT-4 (100-shot)
ClusteringOCWFowlkes Mallows Score (FMS)42.8GPT-4 (100-shot)
ClusteringOCW Wasserstein Distance (WD)73.7GPT-4 (3-shot)
ClusteringOCW# Correct Groups272GPT-4 (3-shot)
ClusteringOCW# Solved Walls5GPT-4 (3-shot)
ClusteringOCWAdjusted Mutual Information (AMI)33.6GPT-4 (3-shot)
ClusteringOCWAdjusted Rand Index (ARI)29.9GPT-4 (3-shot)
ClusteringOCWFowlkes Mallows Score (FMS)43.9GPT-4 (3-shot)
ClusteringOCW Wasserstein Distance (WD)75.8GPT-4 (0-shot)
ClusteringOCW# Correct Groups239GPT-4 (0-shot)
ClusteringOCW# Solved Walls6GPT-4 (0-shot)
ClusteringOCWAdjusted Mutual Information (AMI)30.7GPT-4 (0-shot)
ClusteringOCWAdjusted Rand Index (ARI)27.2GPT-4 (0-shot)
ClusteringOCWFowlkes Mallows Score (FMS)41.5GPT-4 (0-shot)
ClusteringOCW Wasserstein Distance (WD)80.6GPT-3.5-turbo (5-shot)
ClusteringOCW# Correct Groups149GPT-3.5-turbo (5-shot)
ClusteringOCW# Solved Walls2GPT-3.5-turbo (5-shot)
ClusteringOCWAdjusted Mutual Information (AMI)25.4GPT-3.5-turbo (5-shot)
ClusteringOCWAdjusted Rand Index (ARI)22GPT-3.5-turbo (5-shot)
ClusteringOCWFowlkes Mallows Score (FMS)37.3GPT-3.5-turbo (5-shot)
ClusteringOCW Wasserstein Distance (WD)80.9GPT-3.5-turbo (3-shot)
ClusteringOCW# Correct Groups140GPT-3.5-turbo (3-shot)
ClusteringOCWAdjusted Mutual Information (AMI)24.7GPT-3.5-turbo (3-shot)
ClusteringOCWAdjusted Rand Index (ARI)21.3GPT-3.5-turbo (3-shot)
ClusteringOCWFowlkes Mallows Score (FMS)36.8GPT-3.5-turbo (3-shot)
ClusteringOCW Wasserstein Distance (WD)81.2GPT-3.5-turbo (10-shot)
ClusteringOCW# Correct Groups137GPT-3.5-turbo (10-shot)
ClusteringOCW# Solved Walls2GPT-3.5-turbo (10-shot)
ClusteringOCWAdjusted Mutual Information (AMI)24GPT-3.5-turbo (10-shot)
ClusteringOCWAdjusted Rand Index (ARI)20.4GPT-3.5-turbo (10-shot)
ClusteringOCWFowlkes Mallows Score (FMS)36.1GPT-3.5-turbo (10-shot)
ClusteringOCW Wasserstein Distance (WD)82.3GPT-3.5-turbo (1-shot)
ClusteringOCW# Correct Groups123GPT-3.5-turbo (1-shot)
ClusteringOCWAdjusted Mutual Information (AMI)21.2GPT-3.5-turbo (1-shot)
ClusteringOCWAdjusted Rand Index (ARI)18.2GPT-3.5-turbo (1-shot)
ClusteringOCWFowlkes Mallows Score (FMS)34.4GPT-3.5-turbo (1-shot)
ClusteringOCW Wasserstein Distance (WD)82.5GPT-3.5-turbo (0-shot)
ClusteringOCW# Correct Groups114GPT-3.5-turbo (0-shot)
ClusteringOCWAdjusted Mutual Information (AMI)21.6GPT-3.5-turbo (0-shot)
ClusteringOCWAdjusted Rand Index (ARI)18.4GPT-3.5-turbo (0-shot)
ClusteringOCWFowlkes Mallows Score (FMS)34GPT-3.5-turbo (0-shot)
Meta-LearningMedConceptsQAAccuracy61.911gpt-4-0125-preview
Multi-Task LearningMMLAverage (%)70GPT-3.5 Turbo
Sentence CompletionHellaSwagAccuracy95.3GPT-4 (10-shot)
Sentence CompletionHellaSwagAccuracy85.5GPT-3.5 (10-shot)
Arithmetic ReasoningGSM8KAccuracy57.1GPT-3.5 (few-shot, k=5)
Legal ReasoningLegalBench (Rule-recall)Balanced Accuracy59.2GPT-4
Constrained ClusteringOCW Wasserstein Distance (WD)72.9GPT-4 (5-shot)
Constrained ClusteringOCW# Correct Groups269GPT-4 (5-shot)
Constrained ClusteringOCW# Solved Walls7GPT-4 (5-shot)
Constrained ClusteringOCWAdjusted Mutual Information (AMI)32.8GPT-4 (5-shot)
Constrained ClusteringOCWAdjusted Rand Index (ARI)29.1GPT-4 (5-shot)
Constrained ClusteringOCWFowlkes Mallows Score (FMS)43.4GPT-4 (5-shot)
Constrained ClusteringOCW Wasserstein Distance (WD)73.4GPT-4 (1-shot)
Constrained ClusteringOCW# Correct Groups262GPT-4 (1-shot)
Constrained ClusteringOCW# Solved Walls4GPT-4 (1-shot)
Constrained ClusteringOCWAdjusted Mutual Information (AMI)33.5GPT-4 (1-shot)
Constrained ClusteringOCWAdjusted Rand Index (ARI)29.7GPT-4 (1-shot)
Constrained ClusteringOCWFowlkes Mallows Score (FMS)43.7GPT-4 (1-shot)
Constrained ClusteringOCW Wasserstein Distance (WD)73.6GPT-4 (100-shot)
Constrained ClusteringOCW# Correct Groups249GPT-4 (100-shot)
Constrained ClusteringOCW# Solved Walls3GPT-4 (100-shot)
Constrained ClusteringOCWAdjusted Mutual Information (AMI)32.3GPT-4 (100-shot)
Constrained ClusteringOCWAdjusted Rand Index (ARI)28.5GPT-4 (100-shot)
Constrained ClusteringOCWFowlkes Mallows Score (FMS)42.8GPT-4 (100-shot)
Constrained ClusteringOCW Wasserstein Distance (WD)73.7GPT-4 (3-shot)
Constrained ClusteringOCW# Correct Groups272GPT-4 (3-shot)
Constrained ClusteringOCW# Solved Walls5GPT-4 (3-shot)
Constrained ClusteringOCWAdjusted Mutual Information (AMI)33.6GPT-4 (3-shot)
Constrained ClusteringOCWAdjusted Rand Index (ARI)29.9GPT-4 (3-shot)
Constrained ClusteringOCWFowlkes Mallows Score (FMS)43.9GPT-4 (3-shot)
Constrained ClusteringOCW Wasserstein Distance (WD)75.8GPT-4 (0-shot)
Constrained ClusteringOCW# Correct Groups239GPT-4 (0-shot)
Constrained ClusteringOCW# Solved Walls6GPT-4 (0-shot)
Constrained ClusteringOCWAdjusted Mutual Information (AMI)30.7GPT-4 (0-shot)
Constrained ClusteringOCWAdjusted Rand Index (ARI)27.2GPT-4 (0-shot)
Constrained ClusteringOCWFowlkes Mallows Score (FMS)41.5GPT-4 (0-shot)
Constrained ClusteringOCW Wasserstein Distance (WD)80.6GPT-3.5-turbo (5-shot)
Constrained ClusteringOCW# Correct Groups149GPT-3.5-turbo (5-shot)
Constrained ClusteringOCW# Solved Walls2GPT-3.5-turbo (5-shot)
Constrained ClusteringOCWAdjusted Mutual Information (AMI)25.4GPT-3.5-turbo (5-shot)
Constrained ClusteringOCWAdjusted Rand Index (ARI)22GPT-3.5-turbo (5-shot)
Constrained ClusteringOCWFowlkes Mallows Score (FMS)37.3GPT-3.5-turbo (5-shot)
Constrained ClusteringOCW Wasserstein Distance (WD)80.9GPT-3.5-turbo (3-shot)
Constrained ClusteringOCW# Correct Groups140GPT-3.5-turbo (3-shot)
Constrained ClusteringOCWAdjusted Mutual Information (AMI)24.7GPT-3.5-turbo (3-shot)
Constrained ClusteringOCWAdjusted Rand Index (ARI)21.3GPT-3.5-turbo (3-shot)
Constrained ClusteringOCWFowlkes Mallows Score (FMS)36.8GPT-3.5-turbo (3-shot)
Constrained ClusteringOCW Wasserstein Distance (WD)81.2GPT-3.5-turbo (10-shot)
Constrained ClusteringOCW# Correct Groups137GPT-3.5-turbo (10-shot)
Constrained ClusteringOCW# Solved Walls2GPT-3.5-turbo (10-shot)
Constrained ClusteringOCWAdjusted Mutual Information (AMI)24GPT-3.5-turbo (10-shot)
Constrained ClusteringOCWAdjusted Rand Index (ARI)20.4GPT-3.5-turbo (10-shot)
Constrained ClusteringOCWFowlkes Mallows Score (FMS)36.1GPT-3.5-turbo (10-shot)
Constrained ClusteringOCW Wasserstein Distance (WD)82.3GPT-3.5-turbo (1-shot)
Constrained ClusteringOCW# Correct Groups123GPT-3.5-turbo (1-shot)
Constrained ClusteringOCWAdjusted Mutual Information (AMI)21.2GPT-3.5-turbo (1-shot)
Constrained ClusteringOCWAdjusted Rand Index (ARI)18.2GPT-3.5-turbo (1-shot)
Constrained ClusteringOCWFowlkes Mallows Score (FMS)34.4GPT-3.5-turbo (1-shot)
Constrained ClusteringOCW Wasserstein Distance (WD)82.5GPT-3.5-turbo (0-shot)
Constrained ClusteringOCW# Correct Groups114GPT-3.5-turbo (0-shot)
Constrained ClusteringOCWAdjusted Mutual Information (AMI)21.6GPT-3.5-turbo (0-shot)
Constrained ClusteringOCWAdjusted Rand Index (ARI)18.4GPT-3.5-turbo (0-shot)
Constrained ClusteringOCWFowlkes Mallows Score (FMS)34GPT-3.5-turbo (0-shot)
Factual Inconsistency Detection in Chart CaptioningCHOCOLATE-LLMKendall's Tau-c0.205GPT-4V
Visual Question AnsweringViP-BenchGPT-4 score (bbox)60.7GPT-4V-turbo-detail:high (Visual Prompt)
Visual Question AnsweringViP-BenchGPT-4 score (human)59.9GPT-4V-turbo-detail:high (Visual Prompt)
Visual Question AnsweringViP-BenchGPT-4 score (bbox)52.8GPT-4V-turbo-detail:low (Visual Prompt)
Visual Question AnsweringViP-BenchGPT-4 score (human)51.4GPT-4V-turbo-detail:low (Visual Prompt)
Visual Question AnsweringBenchLMMGPT-3.5 score58.37GPT-4V
Visual Question AnsweringEmbSpatial-BenchGeneration36.07GPT-4V
Visual Question AnsweringSME#Learning Samples (N)16GPT-4-1106-Vision-Preview
Visual Question AnsweringSMEACC42.3GPT-4-1106-Vision-Preview
Visual Question AnsweringSMEBLEU-445.51GPT-4-1106-Vision-Preview
Visual Question AnsweringSMECIDEr269.68GPT-4-1106-Vision-Preview
Visual Question AnsweringSMEDetection7GPT-4-1106-Vision-Preview
Visual Question AnsweringSMEMETEOR35.17GPT-4-1106-Vision-Preview
Visual Question AnsweringSMEROUGE-L52.67GPT-4-1106-Vision-Preview
Visual Question AnsweringSMESPICE37.67GPT-4-1106-Vision-Preview
Long-Context UnderstandingMMNeedle1 Image, 2*2 Stitching, Exact Accuracy94.6GPT-4o
Long-Context UnderstandingMMNeedle1 Image, 4*4 Stitching, Exact Accuracy83GPT-4o
Long-Context UnderstandingMMNeedle1 Image, 8*8 Stitching, Exact Accuracy19GPT-4o
Long-Context UnderstandingMMNeedle10 Images, 1*1 Stitching, Exact Accuracy97GPT-4o
Long-Context UnderstandingMMNeedle10 Images, 2*2 Stitching, Exact Accuracy81.8GPT-4o
Long-Context UnderstandingMMNeedle10 Images, 4*4 Stitching, Exact Accuracy26.9GPT-4o
Long-Context UnderstandingMMNeedle10 Images, 8*8 Stitching, Exact Accuracy1GPT-4o
Long-Context UnderstandingMMNeedle1 Image, 2*2 Stitching, Exact Accuracy86.09GPT-4V
Long-Context UnderstandingMMNeedle1 Image, 4*4 Stitching, Exact Accuracy54.72GPT-4V
Long-Context UnderstandingMMNeedle1 Image, 8*8 Stitching, Exact Accuracy7.3GPT-4V
Long-Context UnderstandingMMNeedle10 Images, 1*1 Stitching, Exact Accuracy72.36GPT-4V
Long-Context UnderstandingMMNeedle10 Images, 2*2 Stitching, Exact Accuracy34.24GPT-4V
Long-Context UnderstandingMMNeedle10 Images, 4*4 Stitching, Exact Accuracy7.58GPT-4V
Long-Context UnderstandingAda-LEval (BestAnswer)12k49.5GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (BestAnswer)16k44GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (BestAnswer)1k74GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (BestAnswer)2k73.5GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (BestAnswer)32k16GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (BestAnswer)4k67.5GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (BestAnswer)6k59.5GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (BestAnswer)8k53.5GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (BestAnswer)12k52GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (BestAnswer)16k44.5GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (BestAnswer)1k73.5GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (BestAnswer)2k73.5GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (BestAnswer)32k30GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (BestAnswer)4k65.5GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (BestAnswer)6k63GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (BestAnswer)8k56.5GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (TSort)128k6GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (TSort)16k3.5GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (TSort)2k18.5GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (TSort)32k6GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (TSort)4k15.5GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (TSort)64k6GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (TSort)8k7.5GPT-4-Turbo-1106
Long-Context UnderstandingAda-LEval (TSort)128k2GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (TSort)16k5.5GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (TSort)2k15.5GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (TSort)32k2GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (TSort)4k16.5GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (TSort)64k4GPT-4-Turbo-0125
Long-Context UnderstandingAda-LEval (TSort)8k8.5GPT-4-Turbo-0125
answerability predictionPeerQAMacro F10.3087GPT-4o-2024-08-06
Explanatory Visual Question AnsweringSME#Learning Samples (N)16GPT-4-1106-Vision-Preview
Explanatory Visual Question AnsweringSMEACC42.3GPT-4-1106-Vision-Preview
Explanatory Visual Question AnsweringSMEBLEU-445.51GPT-4-1106-Vision-Preview
Explanatory Visual Question AnsweringSMECIDEr269.68GPT-4-1106-Vision-Preview
Explanatory Visual Question AnsweringSMEDetection7GPT-4-1106-Vision-Preview
Explanatory Visual Question AnsweringSMEMETEOR35.17GPT-4-1106-Vision-Preview
Explanatory Visual Question AnsweringSMEROUGE-L52.67GPT-4-1106-Vision-Preview
Explanatory Visual Question AnsweringSMESPICE37.67GPT-4-1106-Vision-Preview
Object RearrangementOpen6DOR V2pos-level039.1GPT-4V
Object RearrangementOpen6DOR V2pos-level146.8GPT-4V
Object RearrangementOpen6DOR V2rot-level09.1GPT-4V
Object RearrangementOpen6DOR V2rot-level16.9GPT-4V
Object RearrangementOpen6DOR V2rot-level211.7GPT-4V

Related Papers

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17