TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/The Devil is in Temporal Token: High Quality Video Reasoni...

The Devil is in Temporal Token: High Quality Video Reasoning Segmentation

Sitong Gong, Yunzhi Zhuge, Lu Zhang, Zongxin Yang, Pingping Zhang, Huchuan Lu

2025-01-15CVPR 2025 1Referring Video Object SegmentationReferring Expression SegmentationSegmentation
PaperPDFCode(official)

Abstract

Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in the keyframe or the entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation approach that leverages Multimodal Large Language Models (MLLMs) to inject rich spatiotemporal features into hierarchical tokens.Our key innovations include a Temporal Dynamic Aggregation (TDA) and a Token-driven Keyframe Selection (TKS). Specifically, we design frame-level <SEG> and temporal-level <TAK> tokens that utilize MLLM's autoregressive learning to effectively capture both local and global information. Subsequently, we apply a similarity-based weighted fusion and frame selection strategy, then utilize SAM2 to perform keyframe segmentation and propagation. To enhance keyframe localization accuracy, the TKS filters keyframes based on SAM2's occlusion scores during inference. VRS-HQ achieves state-of-the-art performance on ReVOS, surpassing VISA by 5.9%/12.5%/9.1% in J&F scores across the three subsets. These results highlight the strong temporal reasoning and segmentation capabilities of our method. Code and model weights will be released at VRS-HQ.

Results

TaskDatasetMetricValueModel
VideoReVOSF62.5VRS-HQ (Chat-UniVi-13B)
VideoReVOSJ57.6VRS-HQ (Chat-UniVi-13B)
VideoReVOSJ&F60VRS-HQ (Chat-UniVi-13B)
VideoReVOSR18.9VRS-HQ (Chat-UniVi-13B)
VideoReVOSF61.6VRS-HQ (Chat-UniVi-7B)
VideoReVOSJ56.6VRS-HQ (Chat-UniVi-7B)
VideoReVOSJ&F59.1VRS-HQ (Chat-UniVi-7B)
VideoReVOSR19.7VRS-HQ (Chat-UniVi-7B)
VideoMeViSF53.7VRS-HQ (Chat-UniVi-13B)
VideoMeViSJ48VRS-HQ (Chat-UniVi-13B)
VideoMeViSJ&F50.9VRS-HQ (Chat-UniVi-13B)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F73.1VRS-HQ (Chat-UniVi-13B)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J69VRS-HQ (Chat-UniVi-13B)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F71VRS-HQ (Chat-UniVi-13B)
Video Object SegmentationReVOSF62.5VRS-HQ (Chat-UniVi-13B)
Video Object SegmentationReVOSJ57.6VRS-HQ (Chat-UniVi-13B)
Video Object SegmentationReVOSJ&F60VRS-HQ (Chat-UniVi-13B)
Video Object SegmentationReVOSR18.9VRS-HQ (Chat-UniVi-13B)
Video Object SegmentationReVOSF61.6VRS-HQ (Chat-UniVi-7B)
Video Object SegmentationReVOSJ56.6VRS-HQ (Chat-UniVi-7B)
Video Object SegmentationReVOSJ&F59.1VRS-HQ (Chat-UniVi-7B)
Video Object SegmentationReVOSR19.7VRS-HQ (Chat-UniVi-7B)
Video Object SegmentationMeViSF53.7VRS-HQ (Chat-UniVi-13B)
Video Object SegmentationMeViSJ48VRS-HQ (Chat-UniVi-13B)
Video Object SegmentationMeViSJ&F50.9VRS-HQ (Chat-UniVi-13B)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F73.1VRS-HQ (Chat-UniVi-13B)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J69VRS-HQ (Chat-UniVi-13B)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F71VRS-HQ (Chat-UniVi-13B)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17