TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ClawCraneNet: Leveraging Object-level Relation for Text-ba...

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

Chen Liang, Yu Wu, Yawei Luo, Yi Yang

2021-03-19Referring Expression SegmentationVideo SegmentationVideo Semantic SegmentationVideo Understanding
PaperPDF

Abstract

Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue that such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language guidance. We first figure out all candidate objects in videos and then choose the refereed one by parsing relations among those high-level objects. Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation. Extensive experiments on A2D Sentences and J-HMDB Sentences show our method outperforms state-of-the-art methods by a large margin. Qualitative results also show our results are more explainable.

Results

TaskDatasetMetricValueModel
Instance SegmentationA2D SentencesIoU mean0.655ClawCraneNet
Instance SegmentationA2D SentencesIoU overall0.644ClawCraneNet
Instance SegmentationA2D SentencesPrecision@0.50.704ClawCraneNet
Instance SegmentationA2D SentencesPrecision@0.60.677ClawCraneNet
Instance SegmentationA2D SentencesPrecision@0.70.617ClawCraneNet
Instance SegmentationA2D SentencesPrecision@0.80.489ClawCraneNet
Instance SegmentationA2D SentencesPrecision@0.90.171ClawCraneNet
Instance SegmentationJ-HMDBIoU mean0.655ClawCraneNet
Instance SegmentationJ-HMDBIoU overall0.644ClawCraneNet
Instance SegmentationJ-HMDBPrecision@0.50.88ClawCraneNet
Instance SegmentationJ-HMDBPrecision@0.60.796ClawCraneNet
Instance SegmentationJ-HMDBPrecision@0.70.566ClawCraneNet
Instance SegmentationJ-HMDBPrecision@0.80.147ClawCraneNet
Instance SegmentationJ-HMDBPrecision@0.90.002ClawCraneNet
Referring Expression SegmentationA2D SentencesIoU mean0.655ClawCraneNet
Referring Expression SegmentationA2D SentencesIoU overall0.644ClawCraneNet
Referring Expression SegmentationA2D SentencesPrecision@0.50.704ClawCraneNet
Referring Expression SegmentationA2D SentencesPrecision@0.60.677ClawCraneNet
Referring Expression SegmentationA2D SentencesPrecision@0.70.617ClawCraneNet
Referring Expression SegmentationA2D SentencesPrecision@0.80.489ClawCraneNet
Referring Expression SegmentationA2D SentencesPrecision@0.90.171ClawCraneNet
Referring Expression SegmentationJ-HMDBIoU mean0.655ClawCraneNet
Referring Expression SegmentationJ-HMDBIoU overall0.644ClawCraneNet
Referring Expression SegmentationJ-HMDBPrecision@0.50.88ClawCraneNet
Referring Expression SegmentationJ-HMDBPrecision@0.60.796ClawCraneNet
Referring Expression SegmentationJ-HMDBPrecision@0.70.566ClawCraneNet
Referring Expression SegmentationJ-HMDBPrecision@0.80.147ClawCraneNet
Referring Expression SegmentationJ-HMDBPrecision@0.90.002ClawCraneNet

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation2025-07-13MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation2025-07-10Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08