GPT-4 Technical Report

OpenAI, :, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O'Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, Barret Zoph

Paper PDF Code(official)Code Code Code Code Code Code Code Code Code Code

Abstract

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.

Results

Task	Dataset	Metric	Value	Model
Few-Shot Learning	MedConceptsQA	Accuracy	61.911	gpt-4-0125-preview
Zero-Shot Learning	MedConceptsQA	Accuracy	52.489	gpt-4-0125-preview
Transfer Learning	MML	Average (%)	70	GPT-3.5 Turbo
Question Answering	PeerQA	AlignScore	0.1224	GPT-4o-2024-08-06-128k
Question Answering	PeerQA	Prometheus-2 Answer Correctness	3.4612	GPT-4o-2024-08-06-128k
Question Answering	PeerQA	Rouge-L	0.2266	GPT-4o-2024-08-06-128k
Question Answering	TruthfulQA	MC1	0.59	GPT-4 (RLHF)
Question Answering	TIQ	P@1	28.6	Gpt-4
Question Answering	DROP Test	F1	80.9	GPT-4 (few-shot, k=3)
Question Answering	DROP Test	F1	64.1	GPT 3.5 (few-shot, k=3)
Question Answering	TriviaQA	EM	84.8	GPT-4-0613 (Zero-shot)
Visual Question Answering (VQA)	CORE-MM	Abductive	77.88	GPT-4V
Visual Question Answering (VQA)	CORE-MM	Analogical	69.86	GPT-4V
Visual Question Answering (VQA)	CORE-MM	Deductive	74.86	GPT-4V
Visual Question Answering (VQA)	CORE-MM	Overall score	74.44	GPT-4V
Visual Question Answering (VQA)	InfiMM-Eval	Abductive	77.88	GPT-4V
Visual Question Answering (VQA)	InfiMM-Eval	Analogical	69.86	GPT-4V
Visual Question Answering (VQA)	InfiMM-Eval	Deductive	74.86	GPT-4V
Visual Question Answering (VQA)	InfiMM-Eval	Overall score	74.44	GPT-4V
Visual Question Answering (VQA)	ViP-Bench	GPT-4 score (bbox)	60.7	GPT-4V-turbo-detail:high (Visual Prompt)
Visual Question Answering (VQA)	ViP-Bench	GPT-4 score (human)	59.9	GPT-4V-turbo-detail:high (Visual Prompt)
Visual Question Answering (VQA)	ViP-Bench	GPT-4 score (bbox)	52.8	GPT-4V-turbo-detail:low (Visual Prompt)
Visual Question Answering (VQA)	ViP-Bench	GPT-4 score (human)	51.4	GPT-4V-turbo-detail:low (Visual Prompt)
Visual Question Answering (VQA)	BenchLMM	GPT-3.5 score	58.37	GPT-4V
Visual Question Answering (VQA)	EmbSpatial-Bench	Generation	36.07	GPT-4V
Visual Question Answering (VQA)	SME	#Learning Samples (N)	16	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	ACC	42.3	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	BLEU-4	45.51	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	CIDEr	269.68	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	Detection	7	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	METEOR	35.17	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	ROUGE-L	52.67	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	SPICE	37.67	GPT-4-1106-Vision-Preview
Common Sense Reasoning	WinoGrande	Accuracy	87.5	GPT-4 (5-shot)
Common Sense Reasoning	WinoGrande	Accuracy	81.6	GPT-3.5 (5-shot)
Common Sense Reasoning	ARC (Challenge)	Accuracy	96.4	GPT-4 (few-shot, k=25)
Common Sense Reasoning	ARC (Challenge)	Accuracy	85.2	GPT-3.5 (few-shot, k=25)
Clustering	OCW	Wasserstein Distance (WD)	72.9	GPT-4 (5-shot)
Clustering	OCW	# Correct Groups	269	GPT-4 (5-shot)
Clustering	OCW	# Solved Walls	7	GPT-4 (5-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	32.8	GPT-4 (5-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	29.1	GPT-4 (5-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	43.4	GPT-4 (5-shot)
Clustering	OCW	Wasserstein Distance (WD)	73.4	GPT-4 (1-shot)
Clustering	OCW	# Correct Groups	262	GPT-4 (1-shot)
Clustering	OCW	# Solved Walls	4	GPT-4 (1-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	33.5	GPT-4 (1-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	29.7	GPT-4 (1-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	43.7	GPT-4 (1-shot)
Clustering	OCW	Wasserstein Distance (WD)	73.6	GPT-4 (100-shot)
Clustering	OCW	# Correct Groups	249	GPT-4 (100-shot)
Clustering	OCW	# Solved Walls	3	GPT-4 (100-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	32.3	GPT-4 (100-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	28.5	GPT-4 (100-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	42.8	GPT-4 (100-shot)
Clustering	OCW	Wasserstein Distance (WD)	73.7	GPT-4 (3-shot)
Clustering	OCW	# Correct Groups	272	GPT-4 (3-shot)
Clustering	OCW	# Solved Walls	5	GPT-4 (3-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	33.6	GPT-4 (3-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	29.9	GPT-4 (3-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	43.9	GPT-4 (3-shot)
Clustering	OCW	Wasserstein Distance (WD)	75.8	GPT-4 (0-shot)
Clustering	OCW	# Correct Groups	239	GPT-4 (0-shot)
Clustering	OCW	# Solved Walls	6	GPT-4 (0-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	30.7	GPT-4 (0-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	27.2	GPT-4 (0-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	41.5	GPT-4 (0-shot)
Clustering	OCW	Wasserstein Distance (WD)	80.6	GPT-3.5-turbo (5-shot)
Clustering	OCW	# Correct Groups	149	GPT-3.5-turbo (5-shot)
Clustering	OCW	# Solved Walls	2	GPT-3.5-turbo (5-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	25.4	GPT-3.5-turbo (5-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	22	GPT-3.5-turbo (5-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	37.3	GPT-3.5-turbo (5-shot)
Clustering	OCW	Wasserstein Distance (WD)	80.9	GPT-3.5-turbo (3-shot)
Clustering	OCW	# Correct Groups	140	GPT-3.5-turbo (3-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	24.7	GPT-3.5-turbo (3-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	21.3	GPT-3.5-turbo (3-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	36.8	GPT-3.5-turbo (3-shot)
Clustering	OCW	Wasserstein Distance (WD)	81.2	GPT-3.5-turbo (10-shot)
Clustering	OCW	# Correct Groups	137	GPT-3.5-turbo (10-shot)
Clustering	OCW	# Solved Walls	2	GPT-3.5-turbo (10-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	24	GPT-3.5-turbo (10-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	20.4	GPT-3.5-turbo (10-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	36.1	GPT-3.5-turbo (10-shot)
Clustering	OCW	Wasserstein Distance (WD)	82.3	GPT-3.5-turbo (1-shot)
Clustering	OCW	# Correct Groups	123	GPT-3.5-turbo (1-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	21.2	GPT-3.5-turbo (1-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	18.2	GPT-3.5-turbo (1-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	34.4	GPT-3.5-turbo (1-shot)
Clustering	OCW	Wasserstein Distance (WD)	82.5	GPT-3.5-turbo (0-shot)
Clustering	OCW	# Correct Groups	114	GPT-3.5-turbo (0-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	21.6	GPT-3.5-turbo (0-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	18.4	GPT-3.5-turbo (0-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	34	GPT-3.5-turbo (0-shot)
Meta-Learning	MedConceptsQA	Accuracy	61.911	gpt-4-0125-preview
Multi-Task Learning	MML	Average (%)	70	GPT-3.5 Turbo
Sentence Completion	HellaSwag	Accuracy	95.3	GPT-4 (10-shot)
Sentence Completion	HellaSwag	Accuracy	85.5	GPT-3.5 (10-shot)
Arithmetic Reasoning	GSM8K	Accuracy	57.1	GPT-3.5 (few-shot, k=5)
Legal Reasoning	LegalBench (Rule-recall)	Balanced Accuracy	59.2	GPT-4
Constrained Clustering	OCW	Wasserstein Distance (WD)	72.9	GPT-4 (5-shot)
Constrained Clustering	OCW	# Correct Groups	269	GPT-4 (5-shot)
Constrained Clustering	OCW	# Solved Walls	7	GPT-4 (5-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	32.8	GPT-4 (5-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	29.1	GPT-4 (5-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	43.4	GPT-4 (5-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	73.4	GPT-4 (1-shot)
Constrained Clustering	OCW	# Correct Groups	262	GPT-4 (1-shot)
Constrained Clustering	OCW	# Solved Walls	4	GPT-4 (1-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	33.5	GPT-4 (1-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	29.7	GPT-4 (1-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	43.7	GPT-4 (1-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	73.6	GPT-4 (100-shot)
Constrained Clustering	OCW	# Correct Groups	249	GPT-4 (100-shot)
Constrained Clustering	OCW	# Solved Walls	3	GPT-4 (100-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	32.3	GPT-4 (100-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	28.5	GPT-4 (100-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	42.8	GPT-4 (100-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	73.7	GPT-4 (3-shot)
Constrained Clustering	OCW	# Correct Groups	272	GPT-4 (3-shot)
Constrained Clustering	OCW	# Solved Walls	5	GPT-4 (3-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	33.6	GPT-4 (3-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	29.9	GPT-4 (3-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	43.9	GPT-4 (3-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	75.8	GPT-4 (0-shot)
Constrained Clustering	OCW	# Correct Groups	239	GPT-4 (0-shot)
Constrained Clustering	OCW	# Solved Walls	6	GPT-4 (0-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	30.7	GPT-4 (0-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	27.2	GPT-4 (0-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	41.5	GPT-4 (0-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	80.6	GPT-3.5-turbo (5-shot)
Constrained Clustering	OCW	# Correct Groups	149	GPT-3.5-turbo (5-shot)
Constrained Clustering	OCW	# Solved Walls	2	GPT-3.5-turbo (5-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	25.4	GPT-3.5-turbo (5-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	22	GPT-3.5-turbo (5-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	37.3	GPT-3.5-turbo (5-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	80.9	GPT-3.5-turbo (3-shot)
Constrained Clustering	OCW	# Correct Groups	140	GPT-3.5-turbo (3-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	24.7	GPT-3.5-turbo (3-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	21.3	GPT-3.5-turbo (3-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	36.8	GPT-3.5-turbo (3-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	81.2	GPT-3.5-turbo (10-shot)
Constrained Clustering	OCW	# Correct Groups	137	GPT-3.5-turbo (10-shot)
Constrained Clustering	OCW	# Solved Walls	2	GPT-3.5-turbo (10-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	24	GPT-3.5-turbo (10-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	20.4	GPT-3.5-turbo (10-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	36.1	GPT-3.5-turbo (10-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	82.3	GPT-3.5-turbo (1-shot)
Constrained Clustering	OCW	# Correct Groups	123	GPT-3.5-turbo (1-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	21.2	GPT-3.5-turbo (1-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	18.2	GPT-3.5-turbo (1-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	34.4	GPT-3.5-turbo (1-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	82.5	GPT-3.5-turbo (0-shot)
Constrained Clustering	OCW	# Correct Groups	114	GPT-3.5-turbo (0-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	21.6	GPT-3.5-turbo (0-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	18.4	GPT-3.5-turbo (0-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	34	GPT-3.5-turbo (0-shot)
Factual Inconsistency Detection in Chart Captioning	CHOCOLATE-LLM	Kendall's Tau-c	0.205	GPT-4V
Visual Question Answering	ViP-Bench	GPT-4 score (bbox)	60.7	GPT-4V-turbo-detail:high (Visual Prompt)
Visual Question Answering	ViP-Bench	GPT-4 score (human)	59.9	GPT-4V-turbo-detail:high (Visual Prompt)
Visual Question Answering	ViP-Bench	GPT-4 score (bbox)	52.8	GPT-4V-turbo-detail:low (Visual Prompt)
Visual Question Answering	ViP-Bench	GPT-4 score (human)	51.4	GPT-4V-turbo-detail:low (Visual Prompt)
Visual Question Answering	BenchLMM	GPT-3.5 score	58.37	GPT-4V
Visual Question Answering	EmbSpatial-Bench	Generation	36.07	GPT-4V
Visual Question Answering	SME	#Learning Samples (N)	16	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	ACC	42.3	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	BLEU-4	45.51	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	CIDEr	269.68	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	Detection	7	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	METEOR	35.17	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	ROUGE-L	52.67	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	SPICE	37.67	GPT-4-1106-Vision-Preview
Long-Context Understanding	MMNeedle	1 Image, 2*2 Stitching, Exact Accuracy	94.6	GPT-4o
Long-Context Understanding	MMNeedle	1 Image, 4*4 Stitching, Exact Accuracy	83	GPT-4o
Long-Context Understanding	MMNeedle	1 Image, 8*8 Stitching, Exact Accuracy	19	GPT-4o
Long-Context Understanding	MMNeedle	10 Images, 1*1 Stitching, Exact Accuracy	97	GPT-4o
Long-Context Understanding	MMNeedle	10 Images, 2*2 Stitching, Exact Accuracy	81.8	GPT-4o
Long-Context Understanding	MMNeedle	10 Images, 4*4 Stitching, Exact Accuracy	26.9	GPT-4o
Long-Context Understanding	MMNeedle	10 Images, 8*8 Stitching, Exact Accuracy	1	GPT-4o
Long-Context Understanding	MMNeedle	1 Image, 2*2 Stitching, Exact Accuracy	86.09	GPT-4V
Long-Context Understanding	MMNeedle	1 Image, 4*4 Stitching, Exact Accuracy	54.72	GPT-4V
Long-Context Understanding	MMNeedle	1 Image, 8*8 Stitching, Exact Accuracy	7.3	GPT-4V
Long-Context Understanding	MMNeedle	10 Images, 1*1 Stitching, Exact Accuracy	72.36	GPT-4V
Long-Context Understanding	MMNeedle	10 Images, 2*2 Stitching, Exact Accuracy	34.24	GPT-4V
Long-Context Understanding	MMNeedle	10 Images, 4*4 Stitching, Exact Accuracy	7.58	GPT-4V
Long-Context Understanding	Ada-LEval (BestAnswer)	12k	49.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	16k	44	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	1k	74	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	2k	73.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	32k	16	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	4k	67.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	6k	59.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	8k	53.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	12k	52	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	16k	44.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	1k	73.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	2k	73.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	32k	30	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	4k	65.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	6k	63	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	8k	56.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	128k	6	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	16k	3.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	2k	18.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	32k	6	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	4k	15.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	64k	6	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	8k	7.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	128k	2	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	16k	5.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	2k	15.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	32k	2	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	4k	16.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	64k	4	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	8k	8.5	GPT-4-Turbo-0125
answerability prediction	PeerQA	Macro F1	0.3087	GPT-4o-2024-08-06
Explanatory Visual Question Answering	SME	#Learning Samples (N)	16	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	ACC	42.3	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	BLEU-4	45.51	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	CIDEr	269.68	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	Detection	7	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	METEOR	35.17	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	ROUGE-L	52.67	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	SPICE	37.67	GPT-4-1106-Vision-Preview
Object Rearrangement	Open6DOR V2	pos-level0	39.1	GPT-4V
Object Rearrangement	Open6DOR V2	pos-level1	46.8	GPT-4V
Object Rearrangement	Open6DOR V2	rot-level0	9.1	GPT-4V
Object Rearrangement	Open6DOR V2	rot-level1	6.9	GPT-4V
Object Rearrangement	Open6DOR V2	rot-level2	11.7	GPT-4V

GPT-4 Technical Report

Abstract

Results

Task	Dataset	Metric	Value	Model
Few-Shot Learning	MedConceptsQA	Accuracy	61.911	gpt-4-0125-preview
Zero-Shot Learning	MedConceptsQA	Accuracy	52.489	gpt-4-0125-preview
Transfer Learning	MML	Average (%)	70	GPT-3.5 Turbo
Question Answering	PeerQA	AlignScore	0.1224	GPT-4o-2024-08-06-128k
Question Answering	PeerQA	Prometheus-2 Answer Correctness	3.4612	GPT-4o-2024-08-06-128k
Question Answering	PeerQA	Rouge-L	0.2266	GPT-4o-2024-08-06-128k
Question Answering	TruthfulQA	MC1	0.59	GPT-4 (RLHF)
Question Answering	TIQ	P@1	28.6	Gpt-4
Question Answering	DROP Test	F1	80.9	GPT-4 (few-shot, k=3)
Question Answering	DROP Test	F1	64.1	GPT 3.5 (few-shot, k=3)
Question Answering	TriviaQA	EM	84.8	GPT-4-0613 (Zero-shot)
Visual Question Answering (VQA)	CORE-MM	Abductive	77.88	GPT-4V
Visual Question Answering (VQA)	CORE-MM	Analogical	69.86	GPT-4V
Visual Question Answering (VQA)	CORE-MM	Deductive	74.86	GPT-4V
Visual Question Answering (VQA)	CORE-MM	Overall score	74.44	GPT-4V
Visual Question Answering (VQA)	InfiMM-Eval	Abductive	77.88	GPT-4V
Visual Question Answering (VQA)	InfiMM-Eval	Analogical	69.86	GPT-4V
Visual Question Answering (VQA)	InfiMM-Eval	Deductive	74.86	GPT-4V
Visual Question Answering (VQA)	InfiMM-Eval	Overall score	74.44	GPT-4V
Visual Question Answering (VQA)	ViP-Bench	GPT-4 score (bbox)	60.7	GPT-4V-turbo-detail:high (Visual Prompt)
Visual Question Answering (VQA)	ViP-Bench	GPT-4 score (human)	59.9	GPT-4V-turbo-detail:high (Visual Prompt)
Visual Question Answering (VQA)	ViP-Bench	GPT-4 score (bbox)	52.8	GPT-4V-turbo-detail:low (Visual Prompt)
Visual Question Answering (VQA)	ViP-Bench	GPT-4 score (human)	51.4	GPT-4V-turbo-detail:low (Visual Prompt)
Visual Question Answering (VQA)	BenchLMM	GPT-3.5 score	58.37	GPT-4V
Visual Question Answering (VQA)	EmbSpatial-Bench	Generation	36.07	GPT-4V
Visual Question Answering (VQA)	SME	#Learning Samples (N)	16	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	ACC	42.3	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	BLEU-4	45.51	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	CIDEr	269.68	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	Detection	7	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	METEOR	35.17	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	ROUGE-L	52.67	GPT-4-1106-Vision-Preview
Visual Question Answering (VQA)	SME	SPICE	37.67	GPT-4-1106-Vision-Preview
Common Sense Reasoning	WinoGrande	Accuracy	87.5	GPT-4 (5-shot)
Common Sense Reasoning	WinoGrande	Accuracy	81.6	GPT-3.5 (5-shot)
Common Sense Reasoning	ARC (Challenge)	Accuracy	96.4	GPT-4 (few-shot, k=25)
Common Sense Reasoning	ARC (Challenge)	Accuracy	85.2	GPT-3.5 (few-shot, k=25)
Clustering	OCW	Wasserstein Distance (WD)	72.9	GPT-4 (5-shot)
Clustering	OCW	# Correct Groups	269	GPT-4 (5-shot)
Clustering	OCW	# Solved Walls	7	GPT-4 (5-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	32.8	GPT-4 (5-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	29.1	GPT-4 (5-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	43.4	GPT-4 (5-shot)
Clustering	OCW	Wasserstein Distance (WD)	73.4	GPT-4 (1-shot)
Clustering	OCW	# Correct Groups	262	GPT-4 (1-shot)
Clustering	OCW	# Solved Walls	4	GPT-4 (1-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	33.5	GPT-4 (1-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	29.7	GPT-4 (1-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	43.7	GPT-4 (1-shot)
Clustering	OCW	Wasserstein Distance (WD)	73.6	GPT-4 (100-shot)
Clustering	OCW	# Correct Groups	249	GPT-4 (100-shot)
Clustering	OCW	# Solved Walls	3	GPT-4 (100-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	32.3	GPT-4 (100-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	28.5	GPT-4 (100-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	42.8	GPT-4 (100-shot)
Clustering	OCW	Wasserstein Distance (WD)	73.7	GPT-4 (3-shot)
Clustering	OCW	# Correct Groups	272	GPT-4 (3-shot)
Clustering	OCW	# Solved Walls	5	GPT-4 (3-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	33.6	GPT-4 (3-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	29.9	GPT-4 (3-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	43.9	GPT-4 (3-shot)
Clustering	OCW	Wasserstein Distance (WD)	75.8	GPT-4 (0-shot)
Clustering	OCW	# Correct Groups	239	GPT-4 (0-shot)
Clustering	OCW	# Solved Walls	6	GPT-4 (0-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	30.7	GPT-4 (0-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	27.2	GPT-4 (0-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	41.5	GPT-4 (0-shot)
Clustering	OCW	Wasserstein Distance (WD)	80.6	GPT-3.5-turbo (5-shot)
Clustering	OCW	# Correct Groups	149	GPT-3.5-turbo (5-shot)
Clustering	OCW	# Solved Walls	2	GPT-3.5-turbo (5-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	25.4	GPT-3.5-turbo (5-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	22	GPT-3.5-turbo (5-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	37.3	GPT-3.5-turbo (5-shot)
Clustering	OCW	Wasserstein Distance (WD)	80.9	GPT-3.5-turbo (3-shot)
Clustering	OCW	# Correct Groups	140	GPT-3.5-turbo (3-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	24.7	GPT-3.5-turbo (3-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	21.3	GPT-3.5-turbo (3-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	36.8	GPT-3.5-turbo (3-shot)
Clustering	OCW	Wasserstein Distance (WD)	81.2	GPT-3.5-turbo (10-shot)
Clustering	OCW	# Correct Groups	137	GPT-3.5-turbo (10-shot)
Clustering	OCW	# Solved Walls	2	GPT-3.5-turbo (10-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	24	GPT-3.5-turbo (10-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	20.4	GPT-3.5-turbo (10-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	36.1	GPT-3.5-turbo (10-shot)
Clustering	OCW	Wasserstein Distance (WD)	82.3	GPT-3.5-turbo (1-shot)
Clustering	OCW	# Correct Groups	123	GPT-3.5-turbo (1-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	21.2	GPT-3.5-turbo (1-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	18.2	GPT-3.5-turbo (1-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	34.4	GPT-3.5-turbo (1-shot)
Clustering	OCW	Wasserstein Distance (WD)	82.5	GPT-3.5-turbo (0-shot)
Clustering	OCW	# Correct Groups	114	GPT-3.5-turbo (0-shot)
Clustering	OCW	Adjusted Mutual Information (AMI)	21.6	GPT-3.5-turbo (0-shot)
Clustering	OCW	Adjusted Rand Index (ARI)	18.4	GPT-3.5-turbo (0-shot)
Clustering	OCW	Fowlkes Mallows Score (FMS)	34	GPT-3.5-turbo (0-shot)
Meta-Learning	MedConceptsQA	Accuracy	61.911	gpt-4-0125-preview
Multi-Task Learning	MML	Average (%)	70	GPT-3.5 Turbo
Sentence Completion	HellaSwag	Accuracy	95.3	GPT-4 (10-shot)
Sentence Completion	HellaSwag	Accuracy	85.5	GPT-3.5 (10-shot)
Arithmetic Reasoning	GSM8K	Accuracy	57.1	GPT-3.5 (few-shot, k=5)
Legal Reasoning	LegalBench (Rule-recall)	Balanced Accuracy	59.2	GPT-4
Constrained Clustering	OCW	Wasserstein Distance (WD)	72.9	GPT-4 (5-shot)
Constrained Clustering	OCW	# Correct Groups	269	GPT-4 (5-shot)
Constrained Clustering	OCW	# Solved Walls	7	GPT-4 (5-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	32.8	GPT-4 (5-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	29.1	GPT-4 (5-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	43.4	GPT-4 (5-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	73.4	GPT-4 (1-shot)
Constrained Clustering	OCW	# Correct Groups	262	GPT-4 (1-shot)
Constrained Clustering	OCW	# Solved Walls	4	GPT-4 (1-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	33.5	GPT-4 (1-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	29.7	GPT-4 (1-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	43.7	GPT-4 (1-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	73.6	GPT-4 (100-shot)
Constrained Clustering	OCW	# Correct Groups	249	GPT-4 (100-shot)
Constrained Clustering	OCW	# Solved Walls	3	GPT-4 (100-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	32.3	GPT-4 (100-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	28.5	GPT-4 (100-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	42.8	GPT-4 (100-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	73.7	GPT-4 (3-shot)
Constrained Clustering	OCW	# Correct Groups	272	GPT-4 (3-shot)
Constrained Clustering	OCW	# Solved Walls	5	GPT-4 (3-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	33.6	GPT-4 (3-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	29.9	GPT-4 (3-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	43.9	GPT-4 (3-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	75.8	GPT-4 (0-shot)
Constrained Clustering	OCW	# Correct Groups	239	GPT-4 (0-shot)
Constrained Clustering	OCW	# Solved Walls	6	GPT-4 (0-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	30.7	GPT-4 (0-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	27.2	GPT-4 (0-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	41.5	GPT-4 (0-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	80.6	GPT-3.5-turbo (5-shot)
Constrained Clustering	OCW	# Correct Groups	149	GPT-3.5-turbo (5-shot)
Constrained Clustering	OCW	# Solved Walls	2	GPT-3.5-turbo (5-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	25.4	GPT-3.5-turbo (5-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	22	GPT-3.5-turbo (5-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	37.3	GPT-3.5-turbo (5-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	80.9	GPT-3.5-turbo (3-shot)
Constrained Clustering	OCW	# Correct Groups	140	GPT-3.5-turbo (3-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	24.7	GPT-3.5-turbo (3-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	21.3	GPT-3.5-turbo (3-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	36.8	GPT-3.5-turbo (3-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	81.2	GPT-3.5-turbo (10-shot)
Constrained Clustering	OCW	# Correct Groups	137	GPT-3.5-turbo (10-shot)
Constrained Clustering	OCW	# Solved Walls	2	GPT-3.5-turbo (10-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	24	GPT-3.5-turbo (10-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	20.4	GPT-3.5-turbo (10-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	36.1	GPT-3.5-turbo (10-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	82.3	GPT-3.5-turbo (1-shot)
Constrained Clustering	OCW	# Correct Groups	123	GPT-3.5-turbo (1-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	21.2	GPT-3.5-turbo (1-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	18.2	GPT-3.5-turbo (1-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	34.4	GPT-3.5-turbo (1-shot)
Constrained Clustering	OCW	Wasserstein Distance (WD)	82.5	GPT-3.5-turbo (0-shot)
Constrained Clustering	OCW	# Correct Groups	114	GPT-3.5-turbo (0-shot)
Constrained Clustering	OCW	Adjusted Mutual Information (AMI)	21.6	GPT-3.5-turbo (0-shot)
Constrained Clustering	OCW	Adjusted Rand Index (ARI)	18.4	GPT-3.5-turbo (0-shot)
Constrained Clustering	OCW	Fowlkes Mallows Score (FMS)	34	GPT-3.5-turbo (0-shot)
Factual Inconsistency Detection in Chart Captioning	CHOCOLATE-LLM	Kendall's Tau-c	0.205	GPT-4V
Visual Question Answering	ViP-Bench	GPT-4 score (bbox)	60.7	GPT-4V-turbo-detail:high (Visual Prompt)
Visual Question Answering	ViP-Bench	GPT-4 score (human)	59.9	GPT-4V-turbo-detail:high (Visual Prompt)
Visual Question Answering	ViP-Bench	GPT-4 score (bbox)	52.8	GPT-4V-turbo-detail:low (Visual Prompt)
Visual Question Answering	ViP-Bench	GPT-4 score (human)	51.4	GPT-4V-turbo-detail:low (Visual Prompt)
Visual Question Answering	BenchLMM	GPT-3.5 score	58.37	GPT-4V
Visual Question Answering	EmbSpatial-Bench	Generation	36.07	GPT-4V
Visual Question Answering	SME	#Learning Samples (N)	16	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	ACC	42.3	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	BLEU-4	45.51	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	CIDEr	269.68	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	Detection	7	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	METEOR	35.17	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	ROUGE-L	52.67	GPT-4-1106-Vision-Preview
Visual Question Answering	SME	SPICE	37.67	GPT-4-1106-Vision-Preview
Long-Context Understanding	MMNeedle	1 Image, 2*2 Stitching, Exact Accuracy	94.6	GPT-4o
Long-Context Understanding	MMNeedle	1 Image, 4*4 Stitching, Exact Accuracy	83	GPT-4o
Long-Context Understanding	MMNeedle	1 Image, 8*8 Stitching, Exact Accuracy	19	GPT-4o
Long-Context Understanding	MMNeedle	10 Images, 1*1 Stitching, Exact Accuracy	97	GPT-4o
Long-Context Understanding	MMNeedle	10 Images, 2*2 Stitching, Exact Accuracy	81.8	GPT-4o
Long-Context Understanding	MMNeedle	10 Images, 4*4 Stitching, Exact Accuracy	26.9	GPT-4o
Long-Context Understanding	MMNeedle	10 Images, 8*8 Stitching, Exact Accuracy	1	GPT-4o
Long-Context Understanding	MMNeedle	1 Image, 2*2 Stitching, Exact Accuracy	86.09	GPT-4V
Long-Context Understanding	MMNeedle	1 Image, 4*4 Stitching, Exact Accuracy	54.72	GPT-4V
Long-Context Understanding	MMNeedle	1 Image, 8*8 Stitching, Exact Accuracy	7.3	GPT-4V
Long-Context Understanding	MMNeedle	10 Images, 1*1 Stitching, Exact Accuracy	72.36	GPT-4V
Long-Context Understanding	MMNeedle	10 Images, 2*2 Stitching, Exact Accuracy	34.24	GPT-4V
Long-Context Understanding	MMNeedle	10 Images, 4*4 Stitching, Exact Accuracy	7.58	GPT-4V
Long-Context Understanding	Ada-LEval (BestAnswer)	12k	49.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	16k	44	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	1k	74	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	2k	73.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	32k	16	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	4k	67.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	6k	59.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	8k	53.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (BestAnswer)	12k	52	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	16k	44.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	1k	73.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	2k	73.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	32k	30	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	4k	65.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	6k	63	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (BestAnswer)	8k	56.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	128k	6	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	16k	3.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	2k	18.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	32k	6	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	4k	15.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	64k	6	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	8k	7.5	GPT-4-Turbo-1106
Long-Context Understanding	Ada-LEval (TSort)	128k	2	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	16k	5.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	2k	15.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	32k	2	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	4k	16.5	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	64k	4	GPT-4-Turbo-0125
Long-Context Understanding	Ada-LEval (TSort)	8k	8.5	GPT-4-Turbo-0125
answerability prediction	PeerQA	Macro F1	0.3087	GPT-4o-2024-08-06
Explanatory Visual Question Answering	SME	#Learning Samples (N)	16	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	ACC	42.3	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	BLEU-4	45.51	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	CIDEr	269.68	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	Detection	7	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	METEOR	35.17	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	ROUGE-L	52.67	GPT-4-1106-Vision-Preview
Explanatory Visual Question Answering	SME	SPICE	37.67	GPT-4-1106-Vision-Preview
Object Rearrangement	Open6DOR V2	pos-level0	39.1	GPT-4V
Object Rearrangement	Open6DOR V2	pos-level1	46.8	GPT-4V
Object Rearrangement	Open6DOR V2	rot-level0	9.1	GPT-4V
Object Rearrangement	Open6DOR V2	rot-level1	6.9	GPT-4V
Object Rearrangement	Open6DOR V2	rot-level2	11.7	GPT-4V

GPT-4 Technical Report

Abstract

Results

Related Papers

GPT-4 Technical Report

Abstract

Results

Related Papers