Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

2025-07-02Video Understanding

Abstract

Crash detection from video feeds is a critical problem in intelligent transportation systems. Recent developments in large language models (LLMs) and vision-language models (VLMs) have transformed how we process, reason about, and summarize multimodal information. This paper surveys recent methods leveraging LLMs for crash detection from video data. We present a structured taxonomy of fusion strategies, summarize key datasets, analyze model architectures, compare performance benchmarks, and discuss ongoing challenges and opportunities. Our review provides a foundation for future research in this fast-growing intersection of video understanding and foundation models.

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17 UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15 EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14 Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14 Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08 Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08 MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08 Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models2025-07-08