TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/GPT-4o: Visual perception performance of multimodal large ...

GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding

Yiqi Wu, Xiaodan Hu, Ziming Fu, Siling Zhou, Jiangong Li

2024-06-14Zero-Shot Video Question AnswerMMR totalSemantic correspondenceVideo UnderstandingActivity Recognition
PaperPDF

Abstract

Animal ethology is an crucial aspect of animal research, and animal behavior labeling is the foundation for studying animal behavior. This process typically involves labeling video clips with behavioral semantic tags, a task that is complex, subjective, and multimodal. With the rapid development of multimodal large language models(LLMs), new application have emerged for animal behavior understanding tasks in livestock scenarios. This study evaluates the visual perception capabilities of multimodal LLMs in animal activity recognition. To achieve this, we created piglet test data comprising close-up video clips of individual piglets and annotated full-shot video clips. These data were used to assess the performance of four multimodal LLMs-Video-LLaMA, MiniGPT4-Video, Video-Chat2, and GPT-4 omni (GPT-4o)-in piglet activity understanding. Through comprehensive evaluation across five dimensions, including counting, actor referring, semantic correspondence, time perception, and robustness, we found that while current multimodal LLMs require improvement in semantic correspondence and time perception, they have initially demonstrated visual perception capabilities for animal activity recognition. Notably, GPT-4o showed outstanding performance, with Video-Chat2 and GPT-4o exhibiting significantly better semantic correspondence and time perception in close-up video clips compared to full-shot clips. The initial evaluation experiments in this study validate the potential of multimodal large language models in livestock scene video understanding and provide new directions and references for future research on animal behavior video understanding. Furthermore, by deeply exploring the influence of visual prompts on multimodal large language models, we expect to enhance the accuracy and efficiency of animal behavior recognition in livestock scenarios through human visual processing methods.

Results

TaskDatasetMetricValueModel
Question AnsweringVideo-MME (w/o subs)Accuracy (%)70.3GPT-4o
Question AnsweringVideo-MME (w/o subs)Accuracy (%)62.3GPT-4o mini
Question AnsweringZero-shot Video Question Answering on LongVideoBenchAccuracy (% )64GPT-4o
Question AnsweringVideo-MMEAccuracy (%)77.2GPT-4o
Question AnsweringVideo-MMEAccuracy (%)68.9GPT-4o mini
Video Question AnsweringVideo-MME (w/o subs)Accuracy (%)70.3GPT-4o
Video Question AnsweringVideo-MME (w/o subs)Accuracy (%)62.3GPT-4o mini
Video Question AnsweringZero-shot Video Question Answering on LongVideoBenchAccuracy (% )64GPT-4o
Video Question AnsweringVideo-MMEAccuracy (%)77.2GPT-4o
Video Question AnsweringVideo-MMEAccuracy (%)68.9GPT-4o mini
MMR totalMRR-BenchmarkTotal Column Score457GPT-4o

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08