TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Vista-LLaMA: Reliable Video Narrator via Equal Distance to...

Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens

Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, Yi Yang

2023-12-12Zero-Shot Video Question AnswerQuestion AnsweringText GenerationHallucinationVideo Question Answering
PaperPDF

Abstract

Recent advances in large video-language models have displayed promising outcomes in video comprehension. Current approaches straightforwardly convert video into language tokens and employ large language models for multi-modal tasks. However, this method often leads to the generation of irrelevant content, commonly known as "hallucination", as the length of the text increases and the impact of the video diminishes. To address this problem, we propose Vista-LLaMA, a novel framework that maintains the consistent distance between all visual tokens and any language tokens, irrespective of the generated text length. Vista-LLaMA omits relative position encoding when determining attention weights between visual and text tokens, retaining the position encoding for text and text tokens. This amplifies the effect of visual tokens on text generation, especially when the relative distance is longer between visual and text tokens. The proposed attention mechanism significantly reduces the chance of producing irrelevant text related to the video content. Furthermore, we present a sequential visual projector that projects the current video frame into tokens of language space with the assistance of the previous frame. This approach not only captures the temporal relationship within the video, but also allows less visual tokens to encompass the entire video. Our approach significantly outperforms various previous methods (e.g., Video-ChatGPT, MovieChat) on four challenging open-ended video question answering benchmarks. We reach an accuracy of 60.7 on the zero-shot NExT-QA and 60.5 on the zero-shot MSRVTT-QA, setting a new state-of-the-art performance. This project is available at https://jinxxian.github.io/Vista-LLaMA.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QA (Open-ended VideoQA)Accuracy60.7Vista-LLaMA
Question AnsweringNExT-QA (Open-ended VideoQA)Confidence Score3.4Vista-LLaMA
Question AnsweringMSRVTT-QAAccuracy60.5Vista-LLaMA-7B
Question AnsweringMSRVTT-QAConfidence Score3.3Vista-LLaMA-7B
Video Question AnsweringMSRVTT-QAAccuracy60.5Vista-LLaMA-7B
Video Question AnsweringMSRVTT-QAConfidence Score3.3Vista-LLaMA-7B

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16