TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Shotluck Holmes: A Family of Efficient Small-Scale Large L...

Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization

Richard Luo, Austin Peng, Adithya Vasudev, Rishabh Jain

2024-05-31Video SummarizationVideo Captioning
PaperPDFCode(official)

Abstract

Video is an increasingly prominent and information-dense medium, yet it poses substantial challenges for language models. A typical video consists of a sequence of shorter segments, or shots, that collectively form a coherent narrative. Each shot is analogous to a word in a sentence where multiple data streams of information (such as visual and auditory data) must be processed simultaneously. Comprehension of the entire video requires not only understanding the visual-audio information of each shot but also requires that the model links the ideas between each shot to generate a larger, all-encompassing story. Despite significant progress in the field, current works often overlook videos' more granular shot-by-shot semantic information. In this project, we propose a family of efficient large language vision models (LLVMs) to boost video summarization and captioning called Shotluck Holmes. By leveraging better pretraining and data collection strategies, we extend the abilities of existing small LLVMs from being able to understand a picture to being able to understand a sequence of frames. Specifically, we show that Shotluck Holmes achieves better performance than state-of-the-art results on the Shot2Story video captioning and summary task with significantly smaller and more computationally efficient models.

Results

TaskDatasetMetricValueModel
VideoShot2Story20KBLEU-47.67Shotluck-Holmes (3.1B)
VideoShot2Story20KCIDEr152.3Shotluck-Holmes (3.1B)
VideoShot2Story20KMETEOR23.2Shotluck-Holmes (3.1B)
VideoShot2Story20KROUGE43Shotluck-Holmes (3.1B)
Video CaptioningShot2Story20KBLEU-48.7Shotluck-Holmes (3.1B)
Video CaptioningShot2Story20KCIDEr63.2Shotluck-Holmes (3.1B)
Video CaptioningShot2Story20KMETEOR25.7Shotluck-Holmes (3.1B)
Video CaptioningShot2Story20KROUGE36.2Shotluck-Holmes (3.1B)
Video SummarizationShot2Story20KBLEU-47.67Shotluck-Holmes (3.1B)
Video SummarizationShot2Story20KCIDEr152.3Shotluck-Holmes (3.1B)
Video SummarizationShot2Story20KMETEOR23.2Shotluck-Holmes (3.1B)
Video SummarizationShot2Story20KROUGE43Shotluck-Holmes (3.1B)

Related Papers

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness2025-06-25Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization2025-06-25Dense Video Captioning using Graph-based Sentence Summarization2025-06-25video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment2025-06-12Prompts to Summaries: Zero-Shot Language-Guided Video Summarization2025-06-12Enhancing Video Memorability Prediction with Text-Motion Cross-modal Contrastive Loss and Its Application in Video Summarization2025-06-10