TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Cross-Modal Progressive Comprehension for Referring Segmen...

Cross-Modal Progressive Comprehension for Referring Segmentation

Si Liu, Tianrui Hui, Shaofei Huang, Yunchao Wei, Bo Li, Guanbin Li

2021-05-15AttributeReferring Expression SegmentationSegmentationSemantic SegmentationVideo SegmentationVideo Semantic SegmentationImage Segmentation
PaperPDFCode(official)

Abstract

Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressive manner based on informative words in the expression, i.e., first roughly locating candidate entities and then distinguishing the target one. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models. For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the target entity as well as suppress other irrelevant ones by spatial graph reasoning. For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning. In addition to the CMPC, we also introduce a simple yet effective Text-Guided Feature Exchange (TGFE) module to integrate the reasoned multimodal features corresponding to different levels in the visual backbone under the guidance of textual information. In this way, multi-level features can communicate with each other and be mutually refined based on the textual context. Combining CMPC-I or CMPC-V with TGFE can form our image or video version referring segmentation frameworks and our frameworks achieve new state-of-the-art performances on four referring image segmentation benchmarks and three referring video segmentation benchmarks respectively.

Results

TaskDatasetMetricValueModel
Instance SegmentationA2D SentencesAP0.404CMPC-V (I3D)
Instance SegmentationA2D SentencesIoU mean0.573CMPC-V (I3D)
Instance SegmentationA2D SentencesIoU overall0.653CMPC-V (I3D)
Instance SegmentationA2D SentencesPrecision@0.50.655CMPC-V (I3D)
Instance SegmentationA2D SentencesPrecision@0.60.592CMPC-V (I3D)
Instance SegmentationA2D SentencesPrecision@0.70.506CMPC-V (I3D)
Instance SegmentationA2D SentencesPrecision@0.80.342CMPC-V (I3D)
Instance SegmentationA2D SentencesPrecision@0.90.098CMPC-V (I3D)
Instance SegmentationA2D SentencesAP0.351CMPC-V (R2D)
Instance SegmentationA2D SentencesIoU mean0.515CMPC-V (R2D)
Instance SegmentationA2D SentencesIoU overall0.649CMPC-V (R2D)
Instance SegmentationA2D SentencesPrecision@0.50.59CMPC-V (R2D)
Instance SegmentationA2D SentencesPrecision@0.60.527CMPC-V (R2D)
Instance SegmentationA2D SentencesPrecision@0.70.434CMPC-V (R2D)
Instance SegmentationA2D SentencesPrecision@0.80.284CMPC-V (R2D)
Instance SegmentationA2D SentencesPrecision@0.90.068CMPC-V (R2D)
Instance SegmentationJ-HMDBAP0.342CMPC-V
Instance SegmentationJ-HMDBIoU mean0.617CMPC-V
Instance SegmentationJ-HMDBIoU overall0.616CMPC-V
Instance SegmentationJ-HMDBPrecision@0.50.813CMPC-V
Instance SegmentationJ-HMDBPrecision@0.60.657CMPC-V
Instance SegmentationJ-HMDBPrecision@0.70.371CMPC-V
Instance SegmentationJ-HMDBPrecision@0.80.07CMPC-V
Referring Expression SegmentationA2D SentencesAP0.404CMPC-V (I3D)
Referring Expression SegmentationA2D SentencesIoU mean0.573CMPC-V (I3D)
Referring Expression SegmentationA2D SentencesIoU overall0.653CMPC-V (I3D)
Referring Expression SegmentationA2D SentencesPrecision@0.50.655CMPC-V (I3D)
Referring Expression SegmentationA2D SentencesPrecision@0.60.592CMPC-V (I3D)
Referring Expression SegmentationA2D SentencesPrecision@0.70.506CMPC-V (I3D)
Referring Expression SegmentationA2D SentencesPrecision@0.80.342CMPC-V (I3D)
Referring Expression SegmentationA2D SentencesPrecision@0.90.098CMPC-V (I3D)
Referring Expression SegmentationA2D SentencesAP0.351CMPC-V (R2D)
Referring Expression SegmentationA2D SentencesIoU mean0.515CMPC-V (R2D)
Referring Expression SegmentationA2D SentencesIoU overall0.649CMPC-V (R2D)
Referring Expression SegmentationA2D SentencesPrecision@0.50.59CMPC-V (R2D)
Referring Expression SegmentationA2D SentencesPrecision@0.60.527CMPC-V (R2D)
Referring Expression SegmentationA2D SentencesPrecision@0.70.434CMPC-V (R2D)
Referring Expression SegmentationA2D SentencesPrecision@0.80.284CMPC-V (R2D)
Referring Expression SegmentationA2D SentencesPrecision@0.90.068CMPC-V (R2D)
Referring Expression SegmentationJ-HMDBAP0.342CMPC-V
Referring Expression SegmentationJ-HMDBIoU mean0.617CMPC-V
Referring Expression SegmentationJ-HMDBIoU overall0.616CMPC-V
Referring Expression SegmentationJ-HMDBPrecision@0.50.813CMPC-V
Referring Expression SegmentationJ-HMDBPrecision@0.60.657CMPC-V
Referring Expression SegmentationJ-HMDBPrecision@0.70.371CMPC-V
Referring Expression SegmentationJ-HMDBPrecision@0.80.07CMPC-V

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17