TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Collaborative Spatial-Temporal Modeling for Language-Queri...

Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation

Tianrui Hui, Shaofei Huang, Si Liu, Zihan Ding, Guanbin Li, Wenguan Wang, Jizhong Han, Fei Wang

2021-05-14CVPR 2021 1feature selectionReferring Expression Segmentation
PaperPDF

Abstract

Language-queried video actor segmentation aims to predict the pixel-level mask of the actor which performs the actions described by a natural language query in the target frames. Existing methods adopt 3D CNNs over the video clip as a general encoder to extract a mixed spatio-temporal feature for the target frame. Though 3D convolutions are amenable to recognizing which actor is performing the queried actions, it also inevitably introduces misaligned spatial information from adjacent frames, which confuses features of the target frame and yields inaccurate segmentation. Therefore, we propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors. In the decoder, a Language-Guided Feature Selection (LGFS) module is proposed to flexibly integrate spatial and temporal features from the two encoders. We also propose a Cross-Modal Adaptive Modulation (CMAM) module to dynamically recombine spatial- and temporal-relevant linguistic features for multimodal feature interaction in each stage of the two encoders. Our method achieves new state-of-the-art performance on two popular benchmarks with less computational overhead than previous approaches.

Results

TaskDatasetMetricValueModel
Instance SegmentationA2D SentencesAP0.399Hui et al.
Instance SegmentationA2D SentencesIoU mean0.561Hui et al.
Instance SegmentationA2D SentencesIoU overall0.662Hui et al.
Instance SegmentationA2D SentencesPrecision@0.50.654Hui et al.
Instance SegmentationA2D SentencesPrecision@0.60.589Hui et al.
Instance SegmentationA2D SentencesPrecision@0.70.497Hui et al.
Instance SegmentationA2D SentencesPrecision@0.80.333Hui et al.
Instance SegmentationA2D SentencesPrecision@0.90.091Hui et al.
Instance SegmentationJ-HMDBAP0.335Hui et al.
Instance SegmentationJ-HMDBIoU mean0.604Hui et al.
Instance SegmentationJ-HMDBIoU overall0.598Hui et al.
Instance SegmentationJ-HMDBPrecision@0.50.783Hui et al.
Instance SegmentationJ-HMDBPrecision@0.60.639Hui et al.
Instance SegmentationJ-HMDBPrecision@0.70.378Hui et al.
Instance SegmentationJ-HMDBPrecision@0.80.076Hui et al.
Referring Expression SegmentationA2D SentencesAP0.399Hui et al.
Referring Expression SegmentationA2D SentencesIoU mean0.561Hui et al.
Referring Expression SegmentationA2D SentencesIoU overall0.662Hui et al.
Referring Expression SegmentationA2D SentencesPrecision@0.50.654Hui et al.
Referring Expression SegmentationA2D SentencesPrecision@0.60.589Hui et al.
Referring Expression SegmentationA2D SentencesPrecision@0.70.497Hui et al.
Referring Expression SegmentationA2D SentencesPrecision@0.80.333Hui et al.
Referring Expression SegmentationA2D SentencesPrecision@0.90.091Hui et al.
Referring Expression SegmentationJ-HMDBAP0.335Hui et al.
Referring Expression SegmentationJ-HMDBIoU mean0.604Hui et al.
Referring Expression SegmentationJ-HMDBIoU overall0.598Hui et al.
Referring Expression SegmentationJ-HMDBPrecision@0.50.783Hui et al.
Referring Expression SegmentationJ-HMDBPrecision@0.60.639Hui et al.
Referring Expression SegmentationJ-HMDBPrecision@0.70.378Hui et al.
Referring Expression SegmentationJ-HMDBPrecision@0.80.076Hui et al.

Related Papers

mNARX+: A surrogate model for complex dynamical systems using manifold-NARX and automatic feature selection2025-07-17Interpretable Bayesian Tensor Network Kernel Machines with Automatic Rank and Feature Selection2025-07-15Lightweight Model for Poultry Disease Detection from Fecal Images Using Multi-Color Space Feature Optimization and Machine Learning2025-07-14From Motion to Meaning: Biomechanics-Informed Neural Network for Explainable Cardiovascular Disease Identification2025-07-08DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy2025-07-02Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28Vulnerability Disclosure through Adaptive Black-Box Adversarial Attacks on NIDS2025-06-25Towards Interpretable and Efficient Feature Selection in Trajectory Datasets: A Taxonomic Approach2025-06-25