TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Actor and Action Modular Network for Text-based Video Segm...

Actor and Action Modular Network for Text-based Video Segmentation

Jianhua Yang, Yan Huang, Kai Niu, Linjiang Huang, Zhanyu Ma, Liang Wang

2020-11-02Action SegmentationReferring Expression SegmentationSegmentationSemantic SegmentationVideo SegmentationVideo Semantic SegmentationAction Understanding
PaperPDF

Abstract

Text-based video segmentation aims to segment an actor in video sequences by specifying the actor and its performing action with a textual query. Previous methods fail to explicitly align the video content with the textual query in a fine-grained manner according to the actor and its action, due to the problem of \emph{semantic asymmetry}. The \emph{semantic asymmetry} implies that two modalities contain different amounts of semantic information during the multi-modal fusion process. To alleviate this problem, we propose a novel actor and action modular network that individually localizes the actor and its action in two separate modules. Specifically, we first learn the actor-/action-related content from the video and textual query, and then match them in a symmetrical manner to localize the target tube. The target tube contains the desired actor and action which is then fed into a fully convolutional network to predict segmentation masks of the actor. Our method also establishes the association of objects cross multiple frames with the proposed temporal proposal aggregation mechanism. This enables our method to segment the video effectively and keep the temporal consistency of predictions. The whole model is allowed for joint learning of the actor-action matching and segmentation, as well as achieves the state-of-the-art performance for both single-frame segmentation and full video segmentation on A2D Sentences and J-HMDB Sentences datasets.

Results

TaskDatasetMetricValueModel
Instance SegmentationA2D SentencesAP0.396AAMN
Instance SegmentationA2D SentencesIoU mean0.552AAMN
Instance SegmentationA2D SentencesIoU overall0.617AAMN
Instance SegmentationA2D SentencesPrecision@0.50.681AAMN
Instance SegmentationA2D SentencesPrecision@0.60.629AAMN
Instance SegmentationA2D SentencesPrecision@0.70.523AAMN
Instance SegmentationA2D SentencesPrecision@0.80.296AAMN
Instance SegmentationA2D SentencesPrecision@0.90.029AAMN
Instance SegmentationJ-HMDBAP0.321AAMN
Instance SegmentationJ-HMDBIoU mean0.576AAMN
Instance SegmentationJ-HMDBIoU overall0.583AAMN
Instance SegmentationJ-HMDBPrecision@0.50.773AAMN
Instance SegmentationJ-HMDBPrecision@0.60.627AAMN
Instance SegmentationJ-HMDBPrecision@0.70.36AAMN
Instance SegmentationJ-HMDBPrecision@0.80.044AAMN
Referring Expression SegmentationA2D SentencesAP0.396AAMN
Referring Expression SegmentationA2D SentencesIoU mean0.552AAMN
Referring Expression SegmentationA2D SentencesIoU overall0.617AAMN
Referring Expression SegmentationA2D SentencesPrecision@0.50.681AAMN
Referring Expression SegmentationA2D SentencesPrecision@0.60.629AAMN
Referring Expression SegmentationA2D SentencesPrecision@0.70.523AAMN
Referring Expression SegmentationA2D SentencesPrecision@0.80.296AAMN
Referring Expression SegmentationA2D SentencesPrecision@0.90.029AAMN
Referring Expression SegmentationJ-HMDBAP0.321AAMN
Referring Expression SegmentationJ-HMDBIoU mean0.576AAMN
Referring Expression SegmentationJ-HMDBIoU overall0.583AAMN
Referring Expression SegmentationJ-HMDBPrecision@0.50.773AAMN
Referring Expression SegmentationJ-HMDBPrecision@0.60.627AAMN
Referring Expression SegmentationJ-HMDBPrecision@0.70.36AAMN
Referring Expression SegmentationJ-HMDBPrecision@0.80.044AAMN

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17