TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Tarsier: Recipes for Training and Evaluating Large Video D...

Tarsier: Recipes for Training and Evaluating Large Video Description Models

Jiawei Wang, Liping Yuan, Yuchen Zhang, Haomiao Sun

2024-06-30arXiv 2024 7Zero-Shot Video Question AnswerVideo Question AnsweringVideo CaptioningVideo DescriptionVideo UnderstandingVisual Question Answering (VQA)
PaperPDFCode(official)

Abstract

Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a meticulously designed two-stage training procedure, the Tarsier models exhibit substantially stronger video description capabilities than any existing open-source model, showing a $+51.4\%$ advantage in human side-by-side evaluation over the strongest model. Additionally, they are comparable to state-of-the-art proprietary models, with a $+12.3\%$ advantage against GPT-4V and a $-6.7\%$ disadvantage against Gemini 1.5 Pro. When upgraded to Tarsier2 by building upon SigLIP and Qwen2-7B, it further improves significantly with a $+4.8\%$ advantage against GPT-4o. Besides video description, Tarsier proves to be a versatile generalist model, achieving new state-of-the-art results across nine public benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. Our second contribution is the introduction of a new benchmark -- DREAM-1K (https://tarsier-vlm.github.io/) for evaluating video description models, consisting of a new challenging dataset featuring videos from diverse sources and varying complexity, along with an automatic method specifically designed to assess the quality of fine-grained video descriptions. We make our models and evaluation benchmark publicly available at https://github.com/bytedance/tarsier.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QAAccuracy79.2Tarsier (34B)
Question AnsweringMSVD-QAAccuracy80.3Tarsier (34B)
Question AnsweringMSVD-QAConfidence Score4.2Tarsier (34B)
Question AnsweringTGIF-QAAccuracy82.5Tarsier (34B)
Question AnsweringTGIF-QAConfidence Score4.4Tarsier (34B)
Question AnsweringMSRVTT-QAAccuracy66.4Tarsier (34B)
Question AnsweringMSRVTT-QAConfidence Score3.7Tarsier (34B)
Question AnsweringEgoSchema (fullset)Accuracy61.7Tarsier (34B)
Question AnsweringEgoSchema (subset)Accuracy68.6Tarsier (34B)
Question AnsweringActivityNet-QAAccuracy61.6Tarsier (34B)
Question AnsweringActivityNet-QAConfidence Score3.7Tarsier (34B)
Video Question AnsweringTVBenchAverage Accuracy55.5Tarsier-34B
Video Question AnsweringTVBenchAverage Accuracy46.9Tarsier-7B
Video Question AnsweringMVBenchAvg.67.6Tarsier (34B)
Video Question AnsweringNExT-QAAccuracy79.2Tarsier (34B)
Video Question AnsweringMSVD-QAAccuracy80.3Tarsier (34B)
Video Question AnsweringMSVD-QAConfidence Score4.2Tarsier (34B)
Video Question AnsweringTGIF-QAAccuracy82.5Tarsier (34B)
Video Question AnsweringTGIF-QAConfidence Score4.4Tarsier (34B)
Video Question AnsweringMSRVTT-QAAccuracy66.4Tarsier (34B)
Video Question AnsweringMSRVTT-QAConfidence Score3.7Tarsier (34B)
Video Question AnsweringEgoSchema (fullset)Accuracy61.7Tarsier (34B)
Video Question AnsweringEgoSchema (subset)Accuracy68.6Tarsier (34B)
Video Question AnsweringActivityNet-QAAccuracy61.6Tarsier (34B)
Video Question AnsweringActivityNet-QAConfidence Score3.7Tarsier (34B)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09