TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/RTQ: Rethinking Video-language Understanding Based on Imag...

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Xiao Wang, Yaoyu Li, Tian Gan, Zheng Zhang, Jingjing Lv, Liqiang Nie

2023-12-01Video RetrievalVideo Question AnsweringVideo Captioning
PaperPDFCodeCode(official)

Abstract

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods. Code is available at https://github.com/SCZwangxiao/RTQ-MM2023.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video R@153.4RTQ
VideoMSR-VTT-1kAtext-to-video R@1084.4RTQ
VideoMSR-VTT-1kAtext-to-video R@576.1RTQ
VideoActivityNettext-to-video R@153.5RTQ
VideoActivityNettext-to-video R@1091.9RTQ
VideoActivityNettext-to-video R@581.4RTQ
VideoDiDeMotext-to-video R@157.6RTQ
VideoDiDeMotext-to-video R@1089.9RTQ
VideoDiDeMotext-to-video R@584.1RTQ
Video Question AnsweringNExT-QAAccuracy63.2RTQ
Video CaptioningMSR-VTTBLEU-449.6RTQ
Video CaptioningMSR-VTTCIDEr69.3RTQ
Video CaptioningMSR-VTTROUGE-L66.1RTQ
Video CaptioningMSVDBLEU-466.9RTQ
Video CaptioningMSVDCIDEr123.4RTQ
Video CaptioningMSVDROUGE-L82.2RTQ
Video RetrievalMSR-VTT-1kAtext-to-video R@153.4RTQ
Video RetrievalMSR-VTT-1kAtext-to-video R@1084.4RTQ
Video RetrievalMSR-VTT-1kAtext-to-video R@576.1RTQ
Video RetrievalActivityNettext-to-video R@153.5RTQ
Video RetrievalActivityNettext-to-video R@1091.9RTQ
Video RetrievalActivityNettext-to-video R@581.4RTQ
Video RetrievalDiDeMotext-to-video R@157.6RTQ
Video RetrievalDiDeMotext-to-video R@1089.9RTQ
Video RetrievalDiDeMotext-to-video R@584.1RTQ

Related Papers

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder2025-06-28LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs2025-06-27Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization2025-06-25Dense Video Captioning using Graph-based Sentence Summarization2025-06-25How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?2025-06-19video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18CogStream: Context-guided Streaming Video Question Answering2025-06-12