TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Decoupling Static and Hierarchical Motion Perception for R...

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

Shuting He, Henghui Ding

2024-04-04CVPR 2024 1Referring ExpressionReferring Video Object SegmentationReferring Expression SegmentationSentence EmbeddingsVideo SegmentationContrastive LearningVideo Semantic Segmentation
PaperPDFCode(official)

Abstract

Referring video segmentation relies on natural language expressions to identify and segment objects, often emphasizing motion clues. Previous works treat a sentence as a whole and directly perform identification at the video-level, mixing up static image-level cues with temporal motion cues. However, image-level features cannot well comprehend motion cues in sentences, and static cues are not crucial for temporal perception. In fact, static cues can sometimes interfere with temporal perception by overshadowing motion cues. In this work, we propose to decouple video-level referring expression understanding into static and motion perception, with a specific emphasis on enhancing temporal comprehension. Firstly, we introduce an expression-decoupling module to make static cues and motion cues perform their distinct role, alleviating the issue of sentence embeddings overlooking motion cues. Secondly, we propose a hierarchical motion perception module to capture temporal information effectively across varying timescales. Furthermore, we employ contrastive learning to distinguish the motions of visually similar objects. These contributions yield state-of-the-art performance across five datasets, including a remarkable $\textbf{9.2%}$ $\mathcal{J\&F}$ improvement on the challenging $\textbf{MeViS}$ dataset. Code is available at https://github.com/heshuting555/DsHmp.

Results

TaskDatasetMetricValueModel
VideoMeViSF49.8DsHmp
VideoMeViSJ43DsHmp
VideoMeViSJ&F46.4DsHmp
VideoRef-DAVIS17F68.1DsHmp
VideoRef-DAVIS17J61.7DsHmp
VideoRef-DAVIS17J&F64.9DsHmp
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F69.1DsHmp (Video-Swin-Base)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J65DsHmp (Video-Swin-Base)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F67.1DsHmp (Video-Swin-Base)
Video Object SegmentationMeViSF49.8DsHmp
Video Object SegmentationMeViSJ43DsHmp
Video Object SegmentationMeViSJ&F46.4DsHmp
Video Object SegmentationRef-DAVIS17F68.1DsHmp
Video Object SegmentationRef-DAVIS17J61.7DsHmp
Video Object SegmentationRef-DAVIS17J&F64.9DsHmp
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F69.1DsHmp (Video-Swin-Base)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J65DsHmp (Video-Swin-Base)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F67.1DsHmp (Video-Swin-Base)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment2025-07-20SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16LLM-Driven Dual-Level Multi-Interest Modeling for Recommendation2025-07-15